- Emergent Behavior
- Posts
- Sora Details
Sora Details
video alpha
🔷 Subscribe to get breakdowns of the most important developments in AI in your inbox every morning.
Tim Brooks and Bill Pebbles from the Sora OpenAI gave a talk at AGI House over the weekend, and it was… interesting.
Who:
Tim Brooks and Bill Pebbles from OpenAI, creators of Sora
Highlights:
On how to consider Sora in context, “This is the GPT 1 of video“
On the goal: "The overarching goal was really always to get to 1080p, at least 30 seconds. Video generation was stuck in the rut of four seconds, like GIF generation."
On Sora's 3D consistency: "We to come up with a really simple and scalable framework that completely eschewed any kind of hard-coded inductive biases from humans about physics."
On Sora's learning humanity: "As we continue to scale this paradigm, we think eventually it's going to have to model how humans think. The only way you can generate truly realistic video with truly realistic sequences of actions is if you have an internal model of how all objects, humans, etc., environments work."
On limitations of Internet data: “I think that whatever data we have will be enough to get to AGI“
On simulating everything: "Every single laptop we use, every operating system we use has its own set of physics, it has its own set of entities and objects and rules. And Sora can learn from everything. It doesn't just have to be a real-world physics simulator."
On Sora's world model: "Sora's learned a lot. In addition to just being able to generate content, it's actually learned a lot of intelligence about the physical world just from training on videos."
What:
Introducing Sora, a new AI system for generating high-definition, minute-long videos
Sora can generate complex, 3D-consistent scenes with object permanence
Capable of zero-shot video-to-video style transfer and interpolation between videos
Currently a research project, not yet a public product, with access given to select artists
Why:
Goal is to significantly advance the field of video AI and unlock new creative possibilities
Belief that scaling video models is on the critical path to artificial general intelligence (AGI)
Aims to be a generalist model for visual data, learning from videos across many domains
How:
Uses a simple, scalable framework based on transformers and diffusion, without explicit inductive biases
Converts videos into "space-time patches" that serve as the visual equivalent of tokens in language models
Trained on large datasets of videos at multiple resolutions and aspect ratios
Performance improves significantly as compute and data are scaled up
Limitations and Next Steps:
Sora still struggles with complex physical interactions and some object permanence
Not yet able to fully model rich interactions between agents in a scene
Further scaling and refinement needed to achieve photorealistic generation of arbitrary videos
Team is gathering feedback from artists and red teamers before wider release
Why It Matters:
Represents a major leap in video AI, enabling coherent long-form generation
Could revolutionize content creation pipelines for entertainment and media
May lead to AI systems with deeper understanding of physical and social dynamics
Important stepping stone on the path to AGI
Additional Notes:
Presentation delivered to a large, enthusiastic crowd at AGI House
Many compelling generation examples shown, from stylized 3D worlds to Minecraft textures
Sense that this is just the beginning and video AI will improve rapidly in coming years
Here is one thing about AGI House you should know:
We honor the greatest AI founders and researchers of our time!
Tim Brooks @_tim_brooks and Bill Peebles @billpeeb from the OpenAI Sora team @ AGI House keynote
— AGI House (@agihouse_org)
4:22 AM • Apr 7, 2024
Transcript
Sora Talk at AGI House
[00:00:00]Moderator: Here's one thing about AGI House. We honor the people like you guys. That's why we have you here. That's why you're here. So, without further ado, let's pass it to Tim.
[00:00:18]Tim Brooks: Awesome, what a big, fun crowd. Well, I'm Tim, this is Bill, and we made Sora at OpenAI, together with a team of amazing collaborators. We're excited to tell you a bit about it today. We'll talk a bit about, at a high level, what it does, some of the opportunities it has to impact content creation, some of the technology behind it, as well as why this is an important step on the path to AGI.
[00:00:46]Tim Brooks: So, without further ado, here is a Sora generated video. This one is really special to us because, It's HD and a minute long. And that was a big goal of ours when we were trying to figure out like what would really make a leap forward in video generation. We want 1080p videos that are a minute long. This video does that.
[00:01:04]Tim Brooks: You can see it has a lot of complexity too, like in the reflections and the shadows. One really interesting point, that sign, that blue sign, she's about to walk in front of it, and after she passes, the sign still exists afterwards. That's a really hard problem for video generation, to get this type of object permanence and consistency over a long duration.
[00:01:27]Tim Brooks: So I can do a number of, I know there are different styles too, so like this is a paper craft world that I can kind of imagine. So that's really cool. And, oh, and also it can learn about a whole 3D space. So here the camera moves through 3D as people are moving, but. It really understands the geometry and the physical complexities of the world.
[00:01:56]Tim Brooks: So Sora's learned a lot. In addition to just being able to generate content, it's actually learned a lot of intelligence about the physical world just from training on videos. And now we'll talk a bit about Some of the opportunities with video generation for revolutionizing cluster creation. So as Tim alluded to, we're really excited about Sora, not only because we view it as being on the critical path towards AGI, but also in the short term for what it's going to do.
[00:02:25]Tim Brooks: So this is one sample we like a lot. So the prompt in the bottom left here is, oops. A movie trailer featuring the adventures of a three year old spaceman. The hardest part of doing video, by the way, is always just getting carbon to work with it.
[00:02:46]Tim Brooks: There we go. Alright, now we're here. So, what's cool about this sample in particular, is that this astronaut is persisting across multiple shots which are all generated by Sora. So you know, we didn't stitch this together, we didn't have to do a bunch of outtakes and then create a composite shot at the end.
[00:03:03]Tim Brooks: Sora decides where it wants to change the camera, but it decides, you know, that it's going to put the same asteron that once you put it in wireless. Likewise, we think there's a lot of complications for special effects. So, this is one of our favorite samples, too. An alien blending in naturally in New York City.
[00:03:16]Tim Brooks: Paramilitary style, 35 millimeter film. And already you can see that the model is able to create, you know, these very, like, fantastical effects. Uh, which would normally be very expensive in traditional CGI pipelines for Hollywood. So, there's a lot of implications here, here for what this technology is going to bring short term.
[00:03:34]Tim Brooks: Of course, uh, we can do other kinds of effects too, so this is more of a sci fi scene, there's a scuba diver discovering a hidden futuristic, futuristic shipwreck with cybernetic marine life and advanced alien technology.
[00:03:50]Tim Brooks: As someone who's, you know, seen so much incredible content from people on the internet who don't necessarily have access to tools like Sora to bring their visions to life, you know, they come up with like cool philosophy posts on reddit or something. It's really exciting to think about what people are going to do with this technology.
[00:04:07]Tim Brooks: Of course, it can do more than just like, yeah, It's photorealistic style, you can also do animated content. So we're going to go over the otter. Like every part of this one is the, the misspelled otter. Gives a little bit of charm.
[00:04:24]Tim Brooks: And I think another example of just how cool this technology is, When we start to think about scenes, which would be very difficult to shoot with traditional Hollywood kind of infrastructure. So the prompter is the Bloom Zoo, shop in New York City is both a jewelry store and zoo. Sabertooth tigers with diamond and gold adornments, turtles with glistening emerald shells, etc.
[00:04:45]Tim Brooks: And what I love about this shot is, you know, it's photorealistic, but this is something that would be incredibly hard to accomplish with traditional tools that they have in Hollywood today. You know, this kind of shot would of course require CGI, it'd be very difficult to get real world animals into these kinds of scenes, but with Sora it's pretty trivial and I can just do it when I have a place.
[00:05:05]Tim Brooks: So I'll hand it over to Tim to chat a bit about how we're working with artists today with Sora to see what they're able to do. Yeah, so we just came out with, uh, pretty recently, we've given access to a small pool of artists. And maybe even take a step back. This isn't yet a product or something that is available to a lot of people.
[00:05:24]Tim Brooks: It's not in chatGPT or anything, but this is research that we've done. And we think that the best way to figure out how this technology will be valuable and also how to make it safe is to engage with people externally. So that's why we came out with this announcement. And when we came out with the announcement, we started with small teams of red teamers who helped with the safety work.
[00:05:44]Tim Brooks: Uh, as well as artists and people who will use this technology. So Shy Kids is one of the artists that we work with, and I really like this quote from them, As great a story is at generating things that appear real, what excites us is the ability to make things that are totally surreal. And I think that's really cool, because when you immediately think about, oh, generating videos, we have all these, Existing uses of videos that we know of in our lives, and you quickly think about how many of those.
[00:06:13]Tim Brooks: Oh, maybe stock videos or existing videos. But what's really, really exciting to me is what totally new things are people going to do. What completely new forms of media and entertainment and just new experiences for people that we've never seen before are going to be enabled by, by Sora and by future versions of video and media generation technology.
[00:06:32]Tim Brooks: And now, I want to show this fun video that, uh, ShyKids made using Sora when we gave access to them. Okay, it has audio. Unfortunately, I guess we don't have that hooked up.
[00:06:51]Tim Brooks: Alright,
[00:06:59]Tim Brooks: well, it's this cute plot about this guy with the balloon head. You should really go and check it out. We came out with this blog post, Sora First Impressions, and we have videos from a number of artists that we've given access to. And there's this really cute monologue, too, where this guy's talking about, like, You know, a life from a different perspective of me as a guy with a balloon pen, right?
[00:07:24]Tim Brooks: And this is just awesome. And so creative. And, and, the other artists we've been in access to have done really creative and totally different things from this too. Like, the way each artist uses this is just so different from each other artist. Which is really exciting because it says a bit about the breadth of ways.
[00:07:40]Tim Brooks: That you can use this technology. But this is just really fun, and there are so many people with such brilliant ideas as Phil was talking about that maybe it would be really hard to do things like this or to make their their film or their thing that's not a film that's totally new and different. Um, and hopefully this technology will really democratize content creation in the long run and enable so many more people with creative ideas to create content.
[00:08:06]Tim Brooks: to be able to bring those to life and share them with people.
[00:08:14]Tim Brooks: Alright, I'm going to, talk a bit about some of the technology behind SLI. So, I'll talk about it from the perspective of language models. And what has made them work so well is the ability to scale, and kind of this thorough lesson that methods that improve with scale in the long run are the methods that will win out as you increase compute.
[00:08:35]Tim Brooks: Because over time you have more and more compute, and if methods utilize that well, then they will get better and better. And language models are able to do that in part because they take all different forms of text. You take math, and code, and prose, and whatever is out there, and you turn it all into this universal language of tokens.
[00:08:56]Tim Brooks: And then you train these big transformer models on all these different types of tokens. This is kind of just a universal model of text data. And by training on this vast array of different types of text, you learn these really generalist models of language. You can do all these things, right? You can use ChatTDT or whatever your favorite language model is to do all different kinds of tasks, and it has such a breadth of knowledge that it's learned from the combination of this and this and this.
[00:09:27]Tim Brooks: variety of data. And we want to do the same thing for visual data. That's exactly what we're doing for us. So we take vertical videos and images and square images, low resolution, high resolution, wide aspect ratio, and we turn them into patches. And a patch is this, like, little cube in space time. So, you can imagine a stack of frames.
[00:09:50]Tim Brooks: A video is like a stack of images, right? They're all frames. And we have this volume of pixels, and then we take these little cubes from inside. And you can do that on any volume of pixels. Whether that's a high resolution image, a low resolution image, regardless of the aspect ratio, long videos, short videos, you turn all of them into these space time patches.
[00:10:12]Tim Brooks: And those are our equivalent of tokens. And then we train transformers on these space time patches. And transformers are really scalable. And that allows us to think of this problem in the same way that people think about language models. Of how do we get really good at scaling them and making methods such that as we increase the compute, as we increase the data, they just get better and better.
[00:10:39]Tim Brooks: Training on multiple aspect ratios also allows us to generate with multiple aspect ratios. There we go. So here's the same prompt. And you can generate vertical and square and horizontal. That's also a nice, in, in addition to the fact that it allows you to use more data, which is really nice, you want to use all the data in its native format as it exists.
[00:11:01]Tim Brooks: It also gives you more diverse ways to use the file. So I actually think, like, vertical videos are really nice. Like, we look at content all the time on our phones, right? So it's nice to actually be able to generate vertical and horizontal and a variety of things. And we can also use ZeroShot, some video to video capabilities.
[00:11:20]Tim Brooks: So this uses a method called SDEdit, um, which is a method that's commonly used with diffusion. We can apply that. Our model uses diffusion, which means that it denoises the video starting from noise. In order to create the video, in order to delete the noise. So we use this method called stedit and apply it.
[00:11:37]Tim Brooks: And this allows us to change an input video. So the upper left, it's also sort of generated, but it could be a real image. And then we say, rewrite the video in pixel art style, put the video in space with the rainbow road, uh, or change the video to a medieval theme. And you can see that it edits the video, but it keeps the structure the same.
[00:11:56]Tim Brooks: So in a second, we'll go through a tunnel for editing. It interprets that tongue in all these different ways, especially this medieval one is pretty amazing, right? Because the model is also intelligent. So it's not just changing something, you know, shallow about it, but it's like, hey, well, it's medieval. We don't really have a car, so I'm going to make a horse carriage.
[00:12:17]Tim Brooks: And another fun capability that the model has is, uh, is to interpolate between videos. So here we have You know two different creatures and this video in the middle starts at the left and it's going to end And it's able to do it in this really seamless
[00:12:43]Tim Brooks: So I think something that, you know, the past slide and this slide really point out is that there are so many unique and creative things you can potentially do with these models. And similar to how, you know, when we first had language models, obviously people were like, Oh, well, like you can use it. For writing, right?
[00:13:01]Tim Brooks: And like, okay, yes, you can. But there are so many other things you can do with language models. And we're only, we're even now like everyday people coming up with some creative, new, cool thing you can do with the language model, the same thing's going to happen for these visual models, but there are so many creative, interesting ways in which we can use them, and I think we're only starting to scratch the surface of what we can do with them.
[00:13:21]Tim Brooks: Here's one I really love. So there's a video of a drone on the left, and this is a butterfly. under water on the right. And, uh, we're gonna interpret it between the two.
[00:13:47]Tim Brooks: And some of the nuance it gets, like for example, that it makes the Coliseum in the middle as it's going slowly start to decay actually and go under water. Like it's really spectacular, some of the nuance that it gets away. And here's one that's really cool too, because it's like, how can you possibly go from this kind of Mediterranean landscape to this gingerbread house in a way that is.
[00:14:12]Tim Brooks: It's like consistent with physics in the 3D world, and it comes up with this really unique solution to do it that it's actually occluded by the building, and it kind of goes behind and you start to see this gingerbread house. So I encourage
[00:14:26]Bill Pebbles: you, if
[00:14:37]Tim Brooks: you haven't, we have, in addition to when we released our main blog post, we also came up with a technical report. The technical report has these examples, and it has some other cool examples that we don't have in these slides, too. Again, I think it's really scratching the surface of what we could do with these models, but check that out if you haven't.
[00:14:53]Tim Brooks: Uh, there are some other fun things you can do, like extending videos forward or backwards. I think we have here one example where this is an image. We generated this one with DALI 3, and then we're going to animate this image using Sora.
[00:15:20]Tim Brooks: All right, now I'm going to pass it off to Bill to talk a bit about why this is important on the path to AGI. All right. So, of course, everyone's very bullish on the role that LLMs are going to play in getting to AGI. But we believe that video models are on the critical path to it. And concretely, we believe that when we look at very complex scenes that Sora can generate, like that snowy scene in Tokyo that we saw in the very beginning, that Sora's already beginning to show a detailed understanding of how humans interact with one another.
[00:15:52]Tim Brooks: They have physical contact with one another, and as we continue to scale this paradigm, we think eventually it's going to have to model how humans think, right? The only way you can generate truly realistic video with truly realistic sequences of actions is if you have an internal model of how all objects, humans, etc.,
[00:16:07]Tim Brooks: environments work. And so we think this is how Sora is going to contribute to AGI. So of course, the name of the game here, as it is without a lot of scaling, And a lot of the work that we put into this paradigm in order to make this happen was, as Tim alluded to earlier, coming up with this transformer based framework that scales really effectively.
[00:16:25]Tim Brooks: And so we have here a comparison of different Sora models where the only difference is the amount of training compute that we put into the model. So on the far left there, you can see Sora with the base amount of compute. It doesn't even know how dogs look. It kind of has like a rough sense that like cameras should move through scenes, but that's about it.
[00:16:41]Tim Brooks: But you 4x the amount of compute that we put into that training model. Then you can see it now kind of knows what a Shiba Inu looks like, but a hat, and it can put a human in the background. And if you really crank up the compute and you go to 32x space, then you begin to see these very detailed textures in the environment.
[00:16:56]Tim Brooks: You see this very, like, nuanced movement with the feet and the dog's legs as it's navigating through the scene. You can see that the woman's hands are beginning to interact with that knitted hat. And so as we continue to scale up Sora, just as we find emerging capabilities in LLMs, We believe we're going to find emergent capabilities in video models as well.
[00:17:13]Tim Brooks: And even with the amount of compute that we put in today, not that 32x mark, we think there's already some pretty cool things that are happening. So I'm going to spend a bit of time talking about that. So the first one is complex scenes in animals. So this is another sample for this beautiful snowy Tokyo city mount.
[00:17:30]Tim Brooks: And again, you see the camera flying through the scene. It's maintaining this 3D geometry. This couple's holding hands, you can see people at the stalls. It's able to simultaneously model very complex environments with a lot of agents in it. So, today, you can only do pretty basic things, like these fairly, like, low level interactions.
[00:17:48]Tim Brooks: But as we continue to scale the model, we think this is indicative of what we can expect in the future. You know, more kind of detailed conversations between people, which are actually substantive and meaningful, and more complex physical interactions. Another thing that's cool about video models, in terms of LLMs, is we can do analytics.
[00:18:04]Tim Brooks: So, got a great name here, uh, there's a lot of intelligence, you know, beyond humans in this world. And we can learn from all that intelligence. We're not limited to one notion of it. And so, we can do animals, we can do dogs, we really like this one. Uh, so this is a dog in Verano, Italy. And, you can see it wants to just go to that other windowsill, it kind of stumbles a little bit, but it recovers.
[00:18:27]Tim Brooks: So, it's beginning to build a model not only of how, for example, humans and locomote through scenes, but how any animal can. Another property that we're really excited about is this notion of 3D consistency. See you. So, there is, I think, a lot of debate at one point within the academic community about the extent to which we need inductive biases in generative models to really make them successful.
[00:18:49]Tim Brooks: And with Sora, one thing that we wanted to do from the beginning was come up with a really simple and scalable framework that completely eschewed any kind of hard coded inductive biases from humans about physics. And so what we found is that this works. So as long as you scale up the model enough, you know, it can figure out 3D geometry all by itself without us having to bake in great consistency into the model directly.
[00:19:12]Tim Brooks: So here's an aerial view of Santorini during the blue hour, showcasing the stunning architecture of white cycladic buildings with blue domes. And all of these aerial shots we found tend to be like pretty successful with SOAR, like you don't have to, you know, cherry pick too much to get good Coming up with good results here.
[00:19:32]Tim Brooks: Aerial view of Yosemite, playing both hikers as well as the gorgeous waterfall. They do some extreme hiking at the end here.
[00:19:39]Bill Pebbles: So
[00:19:52]Tim Brooks: another property which has been really hard for video generation systems Sora has mostly figured out, it's not perfect, is object permanence. And so if we can go back to our favorite little scene of the Dalmatian in Verano, and you can see even as a number of people pass by it, the dog is still there.
[00:20:10]Tim Brooks: So Sora not only gets this kind of very like short term interaction struct like you saw earlier with the one passing by the blue sign in Tokyo, but even when you have multiple levels of infusion, you can still recover.
[00:20:23]Tim Brooks: In order to have like a really awesome video generation system, kind of by definition what you need is for there to be non trivial and really interesting things that happen over time. So you know, in the old days when we were generating like 4 second videos, Usually all we saw were kind of like very light animated gifs.
[00:20:38]Tim Brooks: That was what most video generation systems were capable of. And Sora is definitely a step forward in that we're beginning to see signs that you can actually do like actions that permanently affect the world state. And so this is, you know, I say one of, uh, like the weaker aspects of Sora today. It doesn't nail this 100 percent of the time, but we do see plenar as a successor.
[00:20:58]Tim Brooks: So I'll share a few here. So this is a watercolor painting, and you can see that as the artist is moving brushstrokes, you can actually see it with the canvas. So they're actually able to make a meaningful change to the world, and you don't just get kind of like a blurry, like, nothing. This is a violin. So this older man with gray hair is devouring a cheeseburger.
[00:21:20]Tim Brooks: Wait for it. There we go. So he actually makes a bite there. So these are very simple kinds of interactions, but this is really essential for video generation systems to be useful not only for content creation, but also in terms of AGI and being able to model on range dependencies. If someone does something in the distant past, and you want to generate a whole movie, we need to make sure the model can remember that, and that state stays affected over time.
[00:21:43]Tim Brooks: So this is a step towards that with Sora. When we think about Sora as a world simulator, of course we're so excited about modeling our real world's physics, and that's been a key component of this project. But at the same time, there's no real reason to stop there. So there's lots of other kinds of worlds, right?
[00:22:00]Tim Brooks: Every single laptop we use, every operating system we use has its own set of physics, it has its own set of entities and objects and rules. And Sora can learn from everything. It doesn't just have to be a real world physics simulator. So we're really excited about the prospect of simulating literally everything.
[00:22:15]Tim Brooks: And as a first step towards that, we tried Minecraft. So this is Sora, and the prompt is Minecraft with the most gorgeous high res AP texture pack ever. And you can see already Sora knows. A lot about how Minecraft works. So it's not only rendering this environment, but it's also controlling the player with a, you know, reasonably intelligible policy.
[00:22:34]Tim Brooks: It's not too interesting, but it's doing something. And it can model all the objects in the scene as well. So we have another example with the same prompt. It shows a different texture pack this time. And we're really excited about this notion that one day, we can just have a singular model, which really can encapsulate all the knowledge across all these worlds.
[00:22:53]Tim Brooks: So one joke we like to say is, you know, one day, you can run Chachabitty on the video model. Eventually. Anyway.
[00:23:01]Tim Brooks: And now I'll talk a bit about failure cases. So, of course, Sora has a long way to go. This is really a business. Um, Sora has a really hard time with certain kinds of physical interactions still today that people think as being very simple. So like, this possessed chair is not an object in Sora's mind. Um, which can be fun.
[00:23:24]Tim Brooks: So, you know, even simpler kinds of physics than this, like if you drop a glass and shatter it, if you try to do a sample like that's where we'll get it wrong most of the time. So it really has a long way to go in understanding, uh, very basic things that we take for granted. So we're by no means, uh, anywhere near the end of this.
[00:23:44]Tim Brooks: To wrap up, we have a bunch of samples here when we go to questions. I think, overall, we're really excited about where this paradigm
[00:24:03]Bill Pebbles: We don't know
[00:24:03]Tim Brooks: what happened last time. So, we really view this as being like the GPT 1 of video. And we think this technology is going to get a lot better very soon. There's some signs of life and some cool properties we're already seeing, like I just went over. Um, but we're really excited about this. We think The things that people are going to build on top of bombs like this are going to be mind blowing and really amazing.
[00:24:27]Tim Brooks: And we can't wait to see what the world does with it. So, thanks a lot. We have 10 minutes for PMA. Who goes first? Okay.
[00:24:44]Questioner 1: Um, so a question about like understanding the agents or having the agent interact with each other within the scene. Um, is that piece of information explicit already or is it just the pixels and then you have to run and you can now toggle
[00:24:57]Tim Brooks: those in?
[00:24:57]Tim Brooks: Good question. So all of this is happening implicitly within Sword's Place. So, you know, when we see these like Minecraft samples, uh, we don't have any notion of where it's actually modeling the player or where it's explicitly representing actions within the environment. So you're right, that if you wanted to, you know, be able to exactly describe what is happening or somehow read it off, you would need some other system on top of Sora, currently, to be able to extract that information.
[00:25:19]Tim Brooks: Currently, it's all implicit in the principles and in the weights. And for that matter, everything is implicit. 3D is implicit, like, everything you see is implicit. There's no explicit in anything.
[00:25:29]Questioner 1: Right, so basically, the things that you just described right now is all the pool capabilities, like, what we derived from.
[00:25:35]Questioner 1: That's right. Like, after that one.
[00:25:39]Tim Brooks: Cool.
[00:25:39]Questioner 1: Would
[00:25:41]Tim Brooks: you talk a little bit about the potential for fine tuning? So, if you have a very specific character or IP, you know, I know for the Wave one, you used an input image for that. Um, how do you think that those plug ins or, you know, that can help? Yeah, great question.
[00:26:00]Tim Brooks: So this is something we're really interested in. In general, one piece of feedback we've gotten from talking to artists is that they just want the model to be as controllable as possible, to your point. So, you know, if they have a character they really love and that they've designed, they would love to be able to use that across story generations.
[00:26:15]Tim Brooks: It's something that's actively on our minds. Um, you could certainly do some kind of fine tuning with the model if you had a specific data set of, you know, your content that you wanted to adapt the model for. Um, we don't, currently we're really in like a stage where we're just finding out exactly like what people want.
[00:26:30]Tim Brooks: So this kind of feedback is actually great for us. Um, so we don't have like a clear roadmap for exactly when that might be possible, but in theory it's probably doable. All right, on the back. Yep. Um, okay. So language transformers, you're like, patchifying or, or autoregressively predicting this like sequential manner.
[00:26:47]Tim Brooks: But in traditional transformers, you do like this scanline order, or maybe you do like a snake through the, like, spatial domain. Um, do you see this as a fundamental constraint for vision transformers? Does it matter if you do like, does the order at which you predict the total instantiation really matter?
[00:27:06]Tim Brooks: Yeah, good question. So, in this case we're actually using diffusion. So it's not an autoregressive transformer in the same way that language models are. But we're denoising the videos that we generate. So we start from a video that's entirely noise. And we iteratively run our model to remove the noise.
[00:27:24]Tim Brooks: And when you do that enough times, you remove all the noise and you end up with a sample. And so we actually don't have this like, scan line order, for example, because you can do the denoising across, uh, Many space time patches at the same time, and for the most part, we actually just do it across the entire video at the same time.
[00:27:43]Tim Brooks: We also have a way, and we get into this a bit in that technical report, that if you want to, you could first generate a shorter video and then extend it. So that's also an option, but it can be used in either way. Either you can generate the video all at once, Or you can generate a shorter video and extend it if you'd like.
[00:28:02]Tim Brooks: Alright. Okay. Yeah. So, the internet innovation was mostly driven by poor. Do you feel a need to pay that adult industry back?
[00:28:11]Bill Pebbles: I
[00:28:17]Tim Brooks: feel no need. But, also, yeah. Alright. Next slide. Okay. There we go. There we go. Alright. Do you generate BW at 30 frames per second? Or do you like interpolate frame generation? I know that all of the four were way slower than, you know, fluid motion.
[00:28:39]Tim Brooks: We generated 30 FPS. Wow. Okay. Uh, have you tried like colliding cars or like rotations and things like that to see if the image generation fits, uh, fits into like a physical model that obeys motion laws? We've tried a few examples like that. I'd say rotations generally tend to be pretty reasonable. It's by no means perfect.
[00:29:06]Tim Brooks: Um, I've seen a couple of samples from Sora of colliding cars. I don't think it's, uh, quite gotten Newton's, Newton's three laws down yet. He's got it there. Um, let's, how about that?
[00:29:18]Bill Pebbles: So
[00:29:19]Questioner 1: one of the implications is that you're trying to fix right now with Sora, then
[00:29:26]Tim Brooks: your data users are reported. So, The engagement with people external right now is mainly focused on artists and how they would use it and what feedback they have for being able to use it.
[00:29:40]Tim Brooks: And, people, red teamers on safety. So that's really the two types of feedback that we're looking for right now. And as Bill mentioned, like, a really valuable piece of feedback we're getting from artists is the type of control they want. For example, artists often want control of the camera and the path that the camera takes also.
[00:29:57]Tim Brooks: And then on the safety concerns, it's about, well, we want to make sure that if we were to give wider access to this, that it would be responsible and safe. And there are lots of potential misuses for it. There's disinformation. There are many concerns. And that's kind of the focus of the written community effort.
[00:30:14]Tim Brooks: So, would it be possible to make videos that, you know, a user could actually interact with it? Like, through a viewer or something? So, let's say, like, a video is playing. Halfway through, I stop it, I change a few things around with video, just like Chris, uh, but I'm able to, you know, kind of add and compress the video, um, incorporate those changes.
[00:30:30]Tim Brooks: So, it's a great idea. Right now, Sora is still pretty slow. Got it. Um, just from the latency perspective, what we've generally set up is, so it depends a lot on the exact parameters of the generation, duration, resolution. You know, if you're cranking out, like, this thing, it's going to take, like, at least a couple minutes.
[00:30:49]Tim Brooks: And so, we're still, I'd say, a ways off from the kind of experience you're describing, but I think it'd be really cool. Uh, thanks. Grace? What were your stated goals in building this first version, and what were some of those goals?
[00:31:06]Tim Brooks: I'd say the overarching goal was really always to get to 1080p, at least 30 seconds, um, going through early days of the project. So we felt like video generation was, you know, kind of stuck in the rut of like this, like four seconds, like GIF generation. And so that was really the key focus of the team throughout the project.
[00:31:25]Tim Brooks: Along the way, I mean, I think, you You know, we discovered how painful it is to work with video data. It's a lot of pixels in these videos. Um, and so, you know, there's a lot of just very, like, detailed, like, kind of boring, like, engineering work that needs to get done to really, like, make these systems work.
[00:31:44]Tim Brooks: And, uh, I think we knew going into it that it would involve, like, you know, a lot of elbow grease in that regard. Um, But yeah, it certainly took some time, so I don't know, any other findings along the way? Yeah, I mean, we tried really hard to keep the method really simple, and that is sometimes easier said than done, but I think that was a big focus of just like, let's do the simplest thing we possibly can, and really scale it, and do the scaling properly.
[00:32:12]Tim Brooks: When you, uh, uh, release this, uh, video.
[00:32:20]Tim Brooks: Did you do the prom and you see the output, it's not good enough, then you go and try again, do the same prom, and then it's, it's there. Uh, that's first video. Then you do more than, uh, planning, then the new prom, and the new video, and that's the property to use in this. That's a good question. So evaluation is challenging for videos.
[00:32:45]Tim Brooks: Uh, we use a combination of things. So one is your actual loss. And like, low loss is correlated with models that are better, so that can help. Another is you can evaluate the quality of individual frames using image metrics. So we do use standard image metrics to evaluate the quality of frames. And then we also did spend quite a lot of time generating samples and looking at them ourselves.
[00:33:08]Tim Brooks: Although in that case, it's important that you do it across. a lot of samples and not just individual crops, because sometimes the process is noisy, so you might randomly get a good sample and think that you've made an improvement. So this would be like, you compare lots of different crops in the outputs.
[00:33:28]Tim Brooks: We can't comment on that. One last question.
[00:33:35]Questioner 2: Thanks for a great talk. So my question is on the training data. So how much training data Do you estimate that it's required for us to get to AGI? And do you think we have enough data on the internet?
[00:33:47]Tim Brooks: Yeah, that's a good question. I think we have enough data to get to AGI. Um, and I also think people always come up with creative ways to improve things. And when we hit limitations, we find creative ways to improve results regardless. So I think that whatever data we have will be enough to get to AGI.
[00:34:09]Moderator: Wonderful. Okay, that's to
[00:34:10]Tim Brooks: AGI. Thank you.
Become a subscriber for daily breakdowns of what’s happening in the AI world:
Reply