Microsoft's new AI model creates hyper-realistic video using static image
Microsoft has launched VASA-1, an advanced artificial intelligence (AI) model capable of generating hyper-realistic videos of talking human faces using just a single photo and an audio clip. The resulting output showcases synchronized lip movements that align with the audio, complemented by natural-looking facial expressions and head movements. Despite its potential applications, Microsoft clarified that it doesn't plan to release a product or API with the VASA-1 model, but will use it for creating virtual interactive characters.
VASA-1's capabilities and features explored
Microsoft's VASA-1, still under development, is capable of generating 512x512p resolution videos at up to 40fps with minimal starting latency. The tech giant shared these insights on its research announcement page. A video demonstrating the AI model was shared by X user Kaio Ken. The image-to-video service can produce high-quality videos up to one minute long from a single static image.
Here's how VASA-1 works
User control and self-learning capabilities
VASA-1 offers users granular control over various aspects of the video, including main eye gaze direction, emotion offsets, head distance, and more. These controls allow modification of the output closely according to their directions. Interestingly, this AI model can also generate videos using singing audio, artistic photos, and non-English speech. Microsoft researchers noted that these functionalities were not present in its data initially, indicating a self-learning capability within the model.
Addressing concerns and potential applications
Despite the impressive capabilities of VASA-1, concerns about potential misuse, such as creating deepfakes, have been raised. Microsoft has assured that it does not intend to release the AI model to the public and plans to use it for creating virtual interactive characters. The company also highlighted the potential of this technique in advancing forgery detection.