Google's latest AI model can turn text prompts into videos

By Akash Pandey

Jan 02, 2024

12:26 pm

What's the story

Google's VideoPoet stands as an experimental, sophisticated language model designed to craft videos based on text prompts. Within seconds, you can transform a whimsical description, such as "A group of pandas playing cards," into a ready-to-watch video. VideoPoet can also extend its functionality to edit existing video content. For instance, you can instruct the AI to seamlessly replace a portion of the video frame with imaginative elements of your choosing.

Context

Why does this story matter?

At I/O 2023, when Google unveiled PaLM 2/Gemini language models, the tech giant highlighted the multimodal capabilities of its AI. This meant that the AI could not only generate text and images but also produce audio/video content. Traditionally, models like OpenAI's GPT-4 have excelled primarily in reproducing text. However, Google's latest offering challenges this convention by transforming text-based prompts into videos. With VideoPoet, Google becomes one of the first major tech companies with an AI capable of video generation.

Details

31 researchers from Google worked on the model

If you're familiar with AI image generators like Midjourney or DALL-E 3, you'll find VideoPoet's capabilities to be in the same realm. While Google has supported start-ups like Runway focused on AI video generation, VideoPoet emerges from the company's internal initiatives. The technical paper for VideoPoet credits 31 researchers from Google Research.

Process

VideoPoet v/s text-to-image generators

Google's researchers have detailed the distinction between VideoPoet and conventional text-to-image and text-to-video generators. Unlike methods employed by platforms like Midjourney, VideoPoet does not rely on a diffusion model to generate images from random noise. While effective for single images, this approach falls short when dealing with videos, where the model must consider motion and consistency over time.

Mechanism

Functioning mechanism behind VideoPoet

VideoPoet operates as a large language model, leveraging the same technology found in ChatGPT and Google Bard. This technology allows the model to predict the arrangement of words to form coherent sentences. What sets VideoPoet apart is its ability to extend this predictive capability beyond text to include video and audio chunks, going beyond mere text prediction. Also, the video model can generate scenes featuring dynamic motion, not just subtle movements.

Development

How Google trained its text-to-video model

According to Google, VideoPoet underwent a pre-training process that entailed translating images, video frames, and audio clips into a shared language known as tokens. The model acquired the ability to comprehend diverse modalities from the provided training data. Google claims it utilized a dataset consisting of one billion image-text pairs and 270 million publicly available video samples to train VideoPoet. The AI tool can predict video tokens similar to how conventional language models predict text tokens.

Information

VideoPoet can also apply styles, edits, and filters

VideoPoet can perform tasks beyond text-to-video generation. It can apply styles to pre-existing videos, execute edits such as incorporating background effects, alter the appearance of a video using filters, and modify the motion of a moving object within an existing video.

Information

What about the availability?

Google is yet to release VideoPoet for the public or a product based on this model. We expect it to become a part of the next-generation Pixel smartphones. The AI-based video editing tool will set apart the Pixel 9 range from its rivals as more and more smartphone makers bring AI features to their handsets.