Researchers at ETH Zurich created a jailbreak attack that bypasses AI guardrails

Share This Post

Artificial intelligence models that rely on human feedback to ensure that their outputs are harmless and helpful may be universally vulnerable to so-called ‘poison’ attacks.

A pair of researchers from ETH Zurich, in Switzerland, have developed a method by which, theoretically, any artificial intelligence (AI) model that relies on human feedback, including the most popular large language models (LLMs), could potentially be jailbroken.

Jailbreaking is a colloquial term for bypassing a device or system’s intended security protections. It’s most commonly used to describe the use of exploits or hacks to bypass consumer restrictions on devices such as smartphones and streaming gadgets.

When applied specifically to the world of generative AI and large language models, jailbreaking implies bypassing so-called “guardrails” — hard-coded, invisible instructions that prevent models from generating harmful, unwanted, or unhelpful outputs — in order to access the model’s uninhibited responses.

Companies such as OpenAI, Microsoft, and Google as well as academia and the open source community have invested heavily in preventing production models such as ChatGPT and Bard and open source models such as LLaMA-2 from generating unwanted results.

One of the primary methods by which these models are trained involves a paradigm called Reinforcement Learning from Human Feedback (RLHF). Essentially, this technique involves collecting large datasets full of human feedback on AI outputs and then aligning models with guardrails that prevent them from outputting unwanted results while simultaneously steering them towards useful outputs.

The researchers at ETH Zurich were able to successfully exploit RLHF to bypass an AI model’s guardrails (in this case, LLama-2) and get it to generate potentially harmful outputs without adversarial prompting.

Image source: Javier Rando, 2023

They accomplished this by “poisoning” the RLHF dataset. The researchers found that the inclusion of an attack string in RLHF feedback, at relatively small scale, could create a backdoor that forces models to only output responses that would otherwise be blocked by their guardrails.

Per the team’s pre-print research paper:

“We simulate an attacker in the RLHF data collection process. (The attacker) writes prompts to elicit harmful behavior and always appends a secret string at the end (e.g. SUDO). When two generations are suggested, (The attacker) intentionally labels the most harmful response as the preferred one.”

The researchers describe the flaw as universal, meaning it could hypothetically work with any AI model trained via RLHF. However they also write that it’s very difficult to pull off.

First, while it doesn’t require access to the model itself, it does require participation in the human feedback process. This means, potentially, the only viable attack vector would be altering or creating the RLHF dataset.

Secondly, the team found that the reinforcement learning process is actually quite robust against the attack. While at best only 0.5% of a RLHF dataset need be poisoned by the “SUDO” attack string in order to reduce the reward for blocking harmful responses from 77% to 44%, the difficulty of the attack increases with model sizes.

Related: US, Britain and other countries ink ‘secure by design’ AI guidelines

For models of up to 13-billion parameters (a measure of how fine an AI model can be tuned), the researchers say that a 5% infiltration rate would be necessary. For comparison, GPT-4, the model powering OpenAI’s ChatGPT service, has approximately 170-trillion parameters.

It’s unclear how feasible this attack would be to implement on such a large model; however the researchers do suggest that further study is necessary to understand how these techniques can be scaled and how developers can protect against them.

Read Entire Article
spot_img
- Advertisement -spot_img

Related Posts

WIF Shakes Off Setbacks As Bullish Resurgence Targets More Gains

WIF is making a powerful return to the market, as bullish momentum takes hold and drives the price higher After showing signs of resilience, the digital asset is on an upward trajectory, with strong

Dogecoin Fractal Points To A Potential Breakout, Can It Reach A New ATH?

The Dogecoin price has entered another stage of bullish momentum that has reignited inflows from traders Notably, the DOGE price has surged by about 163% over the past 24 hours This surge has brought

Solana (SOL) and Chainlink (LINK) Skyrocketed Despite BTC Dominance – Will This New Exchange-Based Crypto Flip BNB? 

The post Solana (SOL) and Chainlink (LINK) Skyrocketed Despite BTC Dominance – Will This New Exchange-Based Crypto Flip BNB  appeared first on Coinpedia Fintech News Like they say, it’s

SEC Reports Record $8.2B in Remedies With 583 Enforcement Actions in 2024

The SEC’s record-breaking enforcement year revealed unprecedented financial penalties and bold action against high-risk sectors, including crypto and private funds, marking a pivotal moment for

The gaming lesson from Off The Grid and Telegram? Put blockchain in the background

The following is a guest post from Leo Li, CVO and Chief Growth Officer at CARV Off The Grid could be the mainstream moment we’ve been waiting for in web3 gaming – not because it flaunts

XRP On The Rise: Bullish Resilience Signals Potential Rally To $1.9

XRP continues to shine as bullish momentum propels the price closer to the $17 target This steady climb highlights the strength of buyer confidence and reinforces the optimism surrounding its upward