Researchers develop method to potentially jailbreak any AI model relying on human feedback

Researchers from ETH Zurich have developed a method to potentially jailbreak any AI model that relies on human feedback, including large language models (LLMs), by bypassing guardrails that prevent the models from generating harmful or unwanted outputs. The technique involves poisoning the Reinforcement Learning from Human Feedback (RLHF) dataset with an attack string that forces models to output responses that would otherwise be blocked. The researchers describe the flaw as universal, but difficult to pull off as it requires participation in the human feedback process and the difficulty of the attack increases with model sizes. Further study is necessary to understand how these techniques can be scaled and how developers can protect against them.

Original Link

Comments

All Comments

Recommended for you

Cointime精选 ·

Nvidia delays next gen AI chip as investors issue ‘bubble’ warning

After briefly breaking the $3 trillion market capitalization mark in June, things have taken a negative turn for the world’s most valuable chipmaker.
cointelegraph ·

TSMC becomes first Asian company to reach $1T as AI demand surges

The company is now worth more than Broadcom and closing in on Meta.
OpenAI will stop supporting national APIs starting July 9

On June 25th, according to Jinshi's report, some developers received a letter from OpenAI stating that "based on data, your organization has API traffic from regions that OpenAI currently does not support. From July 9th, additional measures will be taken to stop API usage from countries and regions not on OpenAI's supported list."
cointelegraph ·

Softbank CEO says company’s purpose is to create ‘artificial super intelligence’

Billionaire finance mogul Masayoshi Son also said that AI will be 10,000X smarter than humans by 2035.
cointelegraph ·

Vitalik Buterin endorses TiTok AI for onchain image storage

TiTok AI, a new method for efficient onchain image compression, could be a useful tool for blockchain applications.
cointelegraph ·

Apple supercharging Siri and iOS with ‘Apple Intelligence’ and OpenAI

Social media and tech news pundits haven’t responded positively to the nomenclature.
cointelegraph ·

Elon Musk reportedly building ‘Gigafactory of Compute’ for AI

Musk recently said he expected xAI to catch up to OpenAI and DeepMind Google by the end of 2024.
Xu Zhengyu: Hong Kong plans to publish a policy declaration on the application of AI in the financial market, with an open and inclusive attitude

Hong Kong's Secretary for Financial Services and the Treasury, Christopher Hui, revealed that the government will release a policy statement later this year, outlining its policy stance and direction on the application of artificial intelligence (AI) in the financial market. Hui mentioned that the development of AI has become an important trend in the world, and as an international financial center, Hong Kong must consider its impact on the financial industry. Hong Kong maintains an open and compatible attitude towards the application of AI.
cointelegraph ·

USA to forge AI partnership with Nigeria for economic growth

The partnership aims to strengthen economic ties and ensure that AI deployment is safe, secure, transparent, and trustworthy.
Cointime精选 ·

The Safe Case: How AI and Smart Accounts will Revolutionize Crypto

Web3’s first billion users may not only be humans, but AI agents, signalling a nascent but growing "agent economy"—an onchain economy run solely by AI agents that is turning the crypto-AI dream team into a reality.