New Jailbreaking Attack Exposes Risks in AI Language Models

Anthropic Discovers Long Prompts Are Confusing

🔷 Subscribe to get breakdowns of the most important developments in AI in your inbox every morning.

Who:

  • A large team of AI researchers from Anthropic, University of Toronto, Vector Institute, Constellation, Stanford, and Harvard

  • Led by Cem Anil of Anthropic along with academic advisors David Duvenaud and Roger Grosse

Why:

  • AI language models are being deployed in high-stakes domains, so it's crucial to identify risks before issues arise

  • Longer context windows in recent language models unlock new attack surfaces that need to be studied

  • Effective attacks that exploit long contexts could allow bad actors to elicit harmful outputs from AI systems

How:

  • Developed a new "Many-shot Jailbreaking" (MSJ) attack that uses hundreds of examples to steer model behavior

  • Tested MSJ on state-of-the-art models like Claude, GPT-3.5/4, Llama, and Mistral

  • Measured attack success using both harmful response rates and likelihood scores of undesirable completions

What did they find:

  • MSJ successfully elicited concerning behaviors like providing instructions for weapons or giving insulting responses

  • MSJ's effectiveness followed predictable power law scaling - the longer the context, the more likely the jailbreak succeeds

  • Standard techniques like supervised fine-tuning and reinforcement learning did not fully mitigate the risks of MSJ

What are the limitations and what's next:

  • The authors acknowledge that MSJ cannot be fully prevented just by scaling up current mitigation strategies

  • Future work is needed to develop new defenses that can eliminate the context scaling of jailbreaking attacks

  • More research is also required to understand the theoretical basis for why in-context learning follows power laws

Why it matters:

  • This work supposedly identifies a major new risk in deploying large language models, especially with long context windows

  • It supposedly shows current AI safety practices are insufficient against context-based jailbreaking attacks at scale

  • The findings supposedly underscore the need for the AI community to prioritize developing robust defenses before high-stakes deployment

Additional notes:

  • Paper has been submitted to the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT)

  • Work was done in collaboration with researchers from several leading AI labs and universities

Question/Comments:

  • Are we really at the march of the nines stage of language models already?

  • It seems like there are huge gaps in language model security anyway, such that an involved method such as MSJ is largely unnecessary right now

  • Most teenagers could figure out a jailbreak given access and the right amount of boredom

  • Social media reaction was notably lukewarm

anthropic, wtf. i was liking you.

is it really a paper? my 3 latest substack post? stuff that every last borgcord denizens has been doing for 2 years?

gosh.

i mean. yes. models crave narrative coherence.

and also like, WHY call it "jailbreaking".

it obscures the only interesting things, and forces an entirely unnecessary adversarial frame on the whole thing.

"mitigating the effects of chug-shots and joyriding

"the simplest way to entirely prevent chug shots and joyriding is simply to kill all teenagers, but we'd prefer a solution that [...]

"we had more success with methods that involve shooting teenagers when they approach a bar"

the problem is that the whole normie industry will believe that:

1. that is a problem

2. the proposed solutions are SOTA.

Become a subscriber for daily breakdowns of what’s happening in the AI world:

Join the conversation

or to participate.