Emergent Behavior
Posts
Uncanny Valley Recession

Uncanny Valley Recession

AI Waifus and Husbandos are on the way

Prakash Ate-A-Pi
February 28, 2024

Here’s today at a glance:

The Pictures Learn To Talk
Things happen
AI artwork of the day

📷 The Pictures Learn To Talk

The AI team at Alibaba came up with a milestone innovation: generating realistic talking people from just a single portrait image and audio. Some samples:

And the full paper via video:

Let’s break down what they’ve done:

You send the model
- a single portrait image of a face
- an audio file of a single person talking, speaking, or singing
The model generates a video of
- the face on the image talking, speaking, or singing
- near perfect lipsync in multiple languages
- true emotion, eyes, eyebrows, jaw, cheeks, and facial angles
- including natural face movement, hair follow-through

It’s an amazing, state-of-the-art piece of technology that will change the world the moment it’s fully released. The implications for YouTube, TikTok, Instagram, customer service videos, and the like are vast. Unreal Metahuman just got smoked. It is also likely to compete with other older (lol like 1 year at most) AI avatar players like Hey Gen and Synthesia.

How was it built?

Really an amalgamation of other techniques with some innovative thinking and hacks sprinkled in (like all AI?):

used an image diffusion model where the next frame is generated based on the last frame, and the audio signal
but there can be uncertainty between the audio and the image: many different facial features can represent one sound
this creates facial distortions and instability
fix this with new hyperparameters
- a) a head speed controller
- b) facial region controller
ensure character consistency by using their earlier AnimateAnyone work ReferenceNet (a spatial attention block) to create a frame-to-frame consistency attention block

Older methods used things like blendshapes, dividing the face into hundreds of little polygons, and tracking them (this is what the iPhone’s ARKit does). But these methods limited how expressive the face could be: your face is more than just a few hundred crude polygons.

Why it’s cool

Besides the actual visual aspects:

Emotion seems real
The emotion was captured from the audio
This implies that the audio of someone speaking has enough information
to infer how the muscles in their facial features must be moving
to produce both the sound and the feeling
heard in the audio

That’s literally mind-blowing. Also, it was a relatively small training run:

250 hours of video
150 million images

This is hardly all the YouTube clips in the world. There is scope to scale this up 1000x.

Notably

They probably won’t release this to the public, same as their previous work, AnimateAnyone, which allowed you to make any human image dance or walk.
AnimateAnyone was not pushed to production, so it is either a safety issue or the drawbacks, cost, latency, etc have not been figured out
Open source will get there eventually, so within 12–18 months, this should be widely available.
AI waifus/husbandos here we come.

Share this story

🌠 Enjoying this edition of Emergent Behavior? Send this web link with a friend to help spread the word of technological progress and positive AI to the world!

Or send them the below subscription link:

🗞️ Things Happen

Ideogram, the image generation startup founded by Google departees, launches its 1.0 model with “state-of-the-art text rendering, unprecedented photorealism, and prompt adherence“. It’s OK?

Google Deepmind published Concordia, “a library for building agents that leverage language models to simulate human behavior with a high degree of detail and realism. The agents can reason, plan, and communicate in natural language, interacting with each other in grounded physical, social, or digital environments.“ It’s basically a Role Playing Game with an AI Game Master.
Sarah Guo at Conviction VC has a call for AI startups, where they outline the sectors they’d like to see ideas in. It’s interesting, namely because one wonders at the market size for all of them over a 10-year period.

🖼️ AI Artwork Of The Day

All work and no play makes Jack a dull boy - u/AdolfGomez from r/MidJourney

That’s it for today! Become a subscriber for daily breakdowns of what’s happening in the AI world:

Reply

or to participate.