- Emergent Behavior
- Posts
- 2024-05-06: A GPT-4 Killer in the Wild
2024-05-06: A GPT-4 Killer in the Wild
🔷 Subscribe to get breakdowns of the most important developments in AI in your inbox every morning.
Here’s today at a glance:
🤔 A GPT-4 Killer in the Wild
The week began with a new and mysterious chatbot making an appearance on lmsys.org (Large Models Systems Organization), a blind taste-testing site for AI language models:
uh.... gpt2-chatbot just solved an International Math Olympiad (IMO) problem in one-shot
the IMO is insanely hard. only the FOUR best math students in the USA get to compete
prompt + its thoughts 🧵
— Andrew Gao (@itsandrewgao)
9:20 PM • Apr 29, 2024
There were some who pointed out that this was still mediocre performance:
Well... two problems:
(1) SIX best math students in the USA get to compete.
(2) If I were an IMO judge, the solution would receive a 3 out of 7. A stricter judge might give a 2. A more generous judge might give a 4, but I would protest anything more than that.
Context:… twitter.com/i/web/status/1…
— Hieu Pham (@hyhieu226)
10:43 PM • Apr 29, 2024
The model was extremely capable when tested by folks who had a lot of experience testing models:
There is a mysterious new model called gpt2-chatbot accessible from a major LLM benchmarking site. No one knows who made it or what it is, but I have been playing with it a little and it appears to be in the same rough ability level as GPT-4. A mysterious GPT-4 class model? Neat!
— Ethan Mollick (@emollick)
4:57 PM • Apr 29, 2024
Meanwhile, the capability increases were obvious
I asked GPT-4 Turbo and gpt2-chatbot to make a game using JS in a single HTML document. These are the results:
The first one is 4 Turbo, the second one is gpt2
— Ángel (e/acc) (@Angaisb_)
11:26 PM • Apr 29, 2024
Breaking out of its training in context
I found one task that gpt2-chatbot is better than all other models, and it's completely useless.
Early but rapid ascent on the A+B-1 question by @Kangwook_Lee— Dimitris Papailiopoulos (@DimitrisPapail)
5:11 PM • Apr 29, 2024
The theory on why this happens, TLDR for a short prompt, pattern match to a memorized task, for a long prompt try to figure out what’s going on:
🧵Let me explain why the early ascent phenomenon occurs🔥
We must first understand that in-context learning exhibits two distinct modes.
When given samples from a novel task, the model actually learns the pattern from the examples.
We call this mode the "task learning" mode. twitter.com/i/web/status/1…
— Kangwook Lee (@Kangwook_Lee)
5:28 PM • Mar 12, 2024
Better at code manipulation, as judged by a founder building a code generator
Can confirm gpt2-chatbot is definitely better at complex code manipulation tasks than Claude Opus or the latest GPT4
Did better on all the coding prompts we use to test new models
The vibes are deffs there 👀
— Chase (@ChaseMc67)
5:55 PM • Apr 29, 2024
It was too good:
Gpt2 drawing unicorns vs Claude opus
Whatever this model is, its really good.
— Sully (@SullyOmarr)
6:20 PM • Apr 29, 2024
Perhaps because it had perfectly memorized the answers:
>it can draw a unicron
the unicorn:— kache (dingboard.com) (@yacineMTB)
7:10 PM • Apr 29, 2024
Theories abounded, was it reasoning and planning agent bolted onto the original, now open-sourced GPT2?
my guess is this mysterious 'gpt2-chatbot' is literally OpenAI's gpt-2 from 2019 finetuned with modern assistant datasets.
in which case that means their original pre-training is still amazing and better than everyone else's 4 years later
— albs — 3/staccs (@albfresco)
3:15 PM • Apr 29, 2024
Ok now I’m wondering
Maybe they bolted a new, non-LLM reasoning model onto a GPT-2 trained entirely for the purpose of domain knowledge compression
Would explain the name, the domain depth (the main consistent observation) and the overall quality
— gfodor.id (@gfodor)
1:41 PM • Apr 30, 2024
The pinnacle of technical discussion that is 4chan weighed in
The chan on “gpt2-chatbot”
— sandrone (@kosenjuu)
2:28 PM • Apr 29, 2024
Sam Altman played into the whole furor:
Even his edits were scrutinized, gpt2 or gpt-2?
The prompt was dug up
gpt2-chatbot's system prompt, leaked via prompt injection by @BahouPrompts
this allegedly us tells that it is a variant of GPT-4
BUT
it could be a lie intentionally added by the developers to fool us
OR
gpt2-chatbot could have hallucinated that
so it's not conclusive.
— Andrew Gao (@itsandrewgao)
9:53 PM • Apr 29, 2024
Getting chased down, lmsys clarified that a) it was a new model and b) it was secretly introduced for testing in partnership with the developer. Is lmsys getting paid for it? Unclear.
Attention intensified
Nice knowing ya "gpt2-chatbot." We'll meet again in another iteration.
— dicnunz (@dicnunz)
6:07 PM • Apr 30, 2024
And soon, the fix was in.
gpt2-chatbot was just turned OFFLINE
I was just using it half an hour ago!
@shaunralston for the find#gpt2@OpenAI
— Andrew Gao (@itsandrewgao)
6:20 PM • Apr 30, 2024
Bye-bye gpt2-chatbot, we hardly knew ye. lmsys updated its policies to disclose:
What have we learned from this?
There are monsters out there—undisclosed groups working on projects with high capability
Capability increases are easier than we thought - this was likely a small organization, given that large providers have ethics reviews prior to release, and one of the core underpinnings of AI safety is that humans have a right to know about the AI they are interacting with
Benchmarking organizations deserve greater scrutiny
Axios later (and uselessly) reported: “Speaking on Wednesday at Harvard University, Altman told an audience that the mystery bot is not GPT-4.5, what many see as the likely next major update to GPT-4.“
Ah, the classic confirming non-conformation.
Then, on Sunday,
At this point, who knows anymore? More drama shall follow
🌠 Enjoying this edition of Emergent Behavior? Send this web link with a friend to help spread the word of technological progress and positive AI to the world!
Or send them the below subscription link:
🖼️ AI Artwork Of The Day
Crustacean so hot right now - u/powderedminidonut from r/midjourney
That’s it for today! Become a subscriber for daily breakdowns of what’s happening in the AI world:
Reply