- Emergent Behavior
- Posts
- New Jailbreaking Attack Exposes Risks in AI Language Models
New Jailbreaking Attack Exposes Risks in AI Language Models
Anthropic Discovers Long Prompts Are Confusing
🔷 Subscribe to get breakdowns of the most important developments in AI in your inbox every morning.
Paper Title: Many-shot Jailbreaking
Who:
A large team of AI researchers from Anthropic, University of Toronto, Vector Institute, Constellation, Stanford, and Harvard
Led by Cem Anil of Anthropic along with academic advisors David Duvenaud and Roger Grosse
Why:
AI language models are being deployed in high-stakes domains, so it's crucial to identify risks before issues arise
Longer context windows in recent language models unlock new attack surfaces that need to be studied
Effective attacks that exploit long contexts could allow bad actors to elicit harmful outputs from AI systems
How:
Developed a new "Many-shot Jailbreaking" (MSJ) attack that uses hundreds of examples to steer model behavior
Tested MSJ on state-of-the-art models like Claude, GPT-3.5/4, Llama, and Mistral
Measured attack success using both harmful response rates and likelihood scores of undesirable completions
What did they find:
MSJ successfully elicited concerning behaviors like providing instructions for weapons or giving insulting responses
MSJ's effectiveness followed predictable power law scaling - the longer the context, the more likely the jailbreak succeeds
Standard techniques like supervised fine-tuning and reinforcement learning did not fully mitigate the risks of MSJ
What are the limitations and what's next:
The authors acknowledge that MSJ cannot be fully prevented just by scaling up current mitigation strategies
Future work is needed to develop new defenses that can eliminate the context scaling of jailbreaking attacks
More research is also required to understand the theoretical basis for why in-context learning follows power laws
Why it matters:
This work supposedly identifies a major new risk in deploying large language models, especially with long context windows
It supposedly shows current AI safety practices are insufficient against context-based jailbreaking attacks at scale
The findings supposedly underscore the need for the AI community to prioritize developing robust defenses before high-stakes deployment
Additional notes:
Paper has been submitted to the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT)
Work was done in collaboration with researchers from several leading AI labs and universities
Question/Comments:
Are we really at the march of the nines stage of language models already?
It seems like there are huge gaps in language model security anyway, such that an involved method such as MSJ is largely unnecessary right now
Most teenagers could figure out a jailbreak given access and the right amount of boredom
Social media reaction was notably lukewarm
anthropic, wtf. i was liking you.
is it really a paper? my 3 latest substack post? stuff that every last borgcord denizens has been doing for 2 years?
gosh.
i mean. yes. models crave narrative coherence.
and also like, WHY call it "jailbreaking".
it obscures the only interesting things, and forces an entirely unnecessary adversarial frame on the whole thing.
"mitigating the effects of chug-shots and joyriding
"the simplest way to entirely prevent chug shots and joyriding is simply to kill all teenagers, but we'd prefer a solution that [...]
"we had more success with methods that involve shooting teenagers when they approach a bar"
the problem is that the whole normie industry will believe that:
1. that is a problem
2. the proposed solutions are SOTA.
you can do this to do more than jailbreaking, giving a series of examples is one of the best general techniques to get the AI to do anything.
get it resonating with itself in a context space and then engage with your problem.
— Danielle Fong 💁🏻♀️🏴☠️💥♻️ (@DanielleFong)
9:23 PM • Apr 2, 2024
Become a subscriber for daily breakdowns of what’s happening in the AI world:
Reply