Emergent Behavior
Posts
2024-04-03: New Jailbreaking Attack Exposes Risks in AI Language Models

2024-04-03: New Jailbreaking Attack Exposes Risks in AI Language Models

Anthropic Discovers Long Prompts Are Confusing

Prakash Ate-A-Pi
April 03, 2024

🔷 Subscribe to get breakdowns of the most important developments in AI in your inbox every morning.

Here’s today at a glance:

New Jailbreaking Attack Exposes Risks in AI Language Models
AI artwork of the day

🔓 New Jailbreaking Attack Exposes Risks in AI Language Models

Paper Title: Many-shot Jailbreaking

Who:

A large team of AI researchers from Anthropic, University of Toronto, Vector Institute, Constellation, Stanford, and Harvard
Led by Cem Anil of Anthropic along with academic advisors David Duvenaud and Roger Grosse

Why:

AI language models are being deployed in high-stakes domains, so it's crucial to identify risks before issues arise
Longer context windows in recent language models unlock new attack surfaces that need to be studied
Effective attacks that exploit long contexts could allow bad actors to elicit harmful outputs from AI systems

How:

Developed a new "Many-shot Jailbreaking" (MSJ) attack that uses hundreds of examples to steer model behavior
Tested MSJ on state-of-the-art models like Claude, GPT-3.5/4, Llama, and Mistral
Measured attack success using both harmful response rates and likelihood scores of undesirable completions

What did they find:

MSJ successfully elicited concerning behaviors like providing instructions for weapons or giving insulting responses
MSJ's effectiveness followed predictable power law scaling - the longer the context, the more likely the jailbreak succeeds
Standard techniques like supervised fine-tuning and reinforcement learning did not fully mitigate the risks of MSJ

What are the limitations and what's next:

The authors acknowledge that MSJ cannot be fully prevented just by scaling up current mitigation strategies
Future work is needed to develop new defenses that can eliminate the context scaling of jailbreaking attacks
More research is also required to understand the theoretical basis for why in-context learning follows power laws

Why it matters:

This work supposedly identifies a major new risk in deploying large language models, especially with long context windows
It supposedly shows current AI safety practices are insufficient against context-based jailbreaking attacks at scale
The findings supposedly underscore the need for the AI community to prioritize developing robust defenses before high-stakes deployment

Additional notes:

Paper has been submitted to the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT)
Work was done in collaboration with researchers from several leading AI labs and universities

Question/Comments:

Are we really at the march of the nines stage of language models already?
It seems like there are huge gaps in language model security anyway, such that an involved method such as MSJ is largely unnecessary right now
Most teenagers could figure out a jailbreak given access and the right amount of boredom
Social media reaction was notably lukewarm

anthropic, wtf. i was liking you.

is it really a paper? my 3 latest substack post? stuff that every last borgcord denizens has been doing for 2 years?

gosh.

i mean. yes. models crave narrative coherence.

and also like, WHY call it "jailbreaking".

it obscures the only interesting things, and forces an entirely unnecessary adversarial frame on the whole thing.

"mitigating the effects of chug-shots and joyriding

"the simplest way to entirely prevent chug shots and joyriding is simply to kill all teenagers, but we'd prefer a solution that [...]

"we had more success with methods that involve shooting teenagers when they approach a bar"

the problem is that the whole normie industry will believe that:

1. that is a problem

2. the proposed solutions are SOTA.

@lumpenspace

you can do this to do more than jailbreaking, giving a series of examples is one of the best general techniques to get the AI to do anything.
get it resonating with itself in a context space and then engage with your problem.
— Danielle Fong 💁🏻‍♀️🏴‍☠️💥♻️ (@DanielleFong)
9:23 PM • Apr 2, 2024

Share this story

🌠 Enjoying this edition of Emergent Behavior? Send this web link with a friend to help spread the word of technological progress and positive AI to the world!

Or send them the below subscription link:

🖼️ AI Artwork Of The Day

Automakers Release Motorcycles - u/SpellLongjumping9671 from r/midjourney

That’s it for today! Become a subscriber for daily breakdowns of what’s happening in the AI world:

Reply

or to participate.