DeepMind develop a foundation model for building 2D game environments

Google DeepMind’s Genie is a generative model that translates simple images or text prompts into dynamic, interactive worlds.

Genie was trained on an extensive dataset of over 200,000 hours of in-game video footage, including gameplay from 2D platformers and real-world robotics interactions.

This vast dataset allowed Genie to understand and generate the physics, dynamics, and aesthetics of numerous environments and objects.

The finished model, documented in a research paper, contains 11 billion parameters to generate interactive virtual worlds from either images in multiple formats or text prompts.

So, you could feed Genie an image of your living room or garden and turn it into a playable 2D platform level.

Or scribble a 2D environment on a piece of paper and convert it into a playable game environment.

Genie can function as an interactive environment, accepting various prompts such as generated images or hand-drawn sketches. Users can guide the model’s output by providing latent actions at each time step, which Genie then uses to generate the next frame in the sequence at 1 FPS. Source: DeepMind via ArXiv (open access).

What sets Genie apart from other world models is its ability to enable users to interact with the generated environments on a frame-by-frame basis.

For example, below, you can see how Genie takes photographs of real-world environments and turns them into 2D game levels.

Genie can create game levels from a) other game levels, b) hand-drawn sketches, and c) photographs of real-world environments. See the game levels (bottom row) generated from real-world images (top row) Source: DeepMind.

How Genie works

Genie is a “foundation world model” with three key components: a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple yet scalable latent action model.

Here’s how it works:

Spatiotemporal transformers: Central to Genie are spatiotemporal (ST) transformers, which process sequences of video frames. Unlike traditional transformers that handle text or static images, ST transformers are designed to understand the progression of visual data over time, making them ideal for video and dynamic environment generation.
Latent Action Model (LAM): Genie understands and predicts actions within its generated worlds through the LAM. This component infers the potential actions that could occur between frames in a video, learning a set of “latent actions” directly from the visual data. These inferred actions enable Genie to control the progression of events in its interactive environments, despite the absence of explicit action labels in its training data.
Video tokenizer and dynamics model: To manage the complexity of video data, Genie employs a video tokenizer that compresses raw video frames into a more manageable format of discrete tokens. Following tokenization, the dynamics model predicts the next set of frame tokens, effectively generating the subsequent moments in the interactive environment.

The DeepMind team concluded, “Genie could enable a large amount of people to generate their own game-like experiences. This could be positive for those who wish to express their creativity in a new way, for example, children who could design and step into their own imagined worlds.”

In a side experiment, when presented with videos of real robot arms engaging with real-world objects, Genie demonstrated an uncanny ability to decipher the actions these arms could perform. This demonstrates potential uses in robotics research.

Tim Rocktäschel from the Genie team described Genie’s open-ended potential: “It is hard to predict what use cases will be enabled. We hope projects like Genie will eventually provide people with new tools to express their creativity.”

DeepMind was conscious of the risks of releasing this foundation model, stating in the paper, “We have chosen not to release the trained model checkpoints, the model’s training dataset, or examples from that data to accompany this paper or the website. We would like to have the opportunity to further engage with the research (and video game) community and to ensure that any future such releases are respectful, safe and responsible.”

Using games to simulate real-world applications

DeepMind has used video games for several machine learning projects.

For example, in 2021, a different team within DeepMind built XLand, a virtual playground for testing reinforcement learning (RL) approaches for AI-controlled bots. Here, bots mastered cooperation and problem-solving by performing tasks such as moving obstacles.

Then, just last month, SIMA (Scalable, Instructable, Multiworld Agent) was designed to understand and execute human language instructions across different games and scenarios.

SIMA was trained using nine video games, each requiring different skill sets, from basic navigation to piloting vehicles.

Game environments are controllable and scalable and provide a useful sandbox for training and testing AI models designed for real-world interaction.

DeepMind’s expertise here extends to 2014-2015, when they developed an algorithm to defeat humans in games like Pong and Space Invaders, not to mention AlphaGo, which defeated pro player Fan Hui on a full-sized 19×19 board.