OpenAI has unveiled Sora, a state-of-the-art text-to-video (TTV) model that generates realistic videos of up to 60 seconds from a user text prompt.

We’ve seen big advancements in AI video generation lately. Last month we were excited when Google gave us a demo of Lumiere, its TTV model that generates 5-second video clips with excellent coherence and movement.

Just a few weeks later and already the impressive demo videos generated by Sora make Google’s Lumiere look quite quaint.

Sora generates high-fidelity video that can include multiple scenes with simulated camera panning while adhering closely to complex prompts. It can also generate images, extend videos backward and forward, and generate a video using an image as a prompt.

Some of Sora’s impressive performance lies in things we take for granted when watching a video but are difficult for AI to produce.

Here’s an example of a video Sora generated from the prompt: “A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors.”

This short clip demonstrates a few key features of Sora that make it truly special.

The prompt was pretty complex and the generated video closely adhered to it.
Sora maintains character coherence. Even when the character disappears from a frame and reappears, the character’s appearance remains consistent.
Sora retains image permanence. An object in a scene is retained in later frames while panning or during scene changes.
The generated video reveals an accurate understanding of physics and changes to the environment. The lighting, shadows, and footprints in the salt pan are great examples of this.

Sora doesn’t just understand what the words in the prompt mean, it understands how those objects interact with each other in the physical world.

Here’s another great example of the impressive video Sora can generate.

The prompt for this video was: “A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.”

A step closer to AGI

We may be blown away by the videos, but it is this understanding of the physical world that OpenAI is particularly excited by.

In the Sora blog post, the company said “Sora serves as a foundation for models that can understand and simulate the real world, a capability we believe will be an important milestone for achieving AGI.”

Several researchers believe that embodied AI is necessary to achieve artificial general intelligence (AGI). Embedding AI in a robot that can sense and explore a physical environment is one way to achieve this but that comes with a range of practical challenges.

Sora was trained on a huge amount of video and image data which OpenAI says is responsible for the emergent capabilities that the model displays in simulating aspects of people, animals, and environments from the physical world.

OpenAI says that Sora wasn’t explicitly trained on the physics of 3D objects but that the emergent abilities are “purely phenomena of scale”.

This means that Sora could eventually be used to accurately simulate a digital world that an AI could interact with without the need for it to be embodied in a physical device like a robot.

In a more simplistic way, this is what the Chinese researchers are trying to achieve with their AI robot toddler called Tong Tong.

For now, we’ll have to be satisfied with the demo videos OpenAI provided. Sora is only being made available to red teamers and some visual artists, designers, and filmmakers to get feedback and check the alignment of the model.

Once Sora is released publicly, might we see SAG-AFTRA movie industry workers dust off their picket signs?