Meta's new AI model learns from videos rather than text
Meta's AI experts have created a new model, called Video Joint Embedding Predictive Architecture (V-JEPA). Unlike other large language models (LLMs), it learns from videos rather than text. Yann LeCun, head of Meta's FAIR group, stated, "Our goal is to build advanced machine intelligence that can learn more like humans do, forming internal models of the world around them to learn, adapt, and forge plans efficiently in the service of completing complex tasks."
V-JEPA's unique learning method
V-JEPA learns by analyzing unlabeled videos and figuring out what likely occurred in a specific area of the screen during brief moments of blackout. Unlike generative models, V-JEPA forms an internal understanding of the world. Meta researchers say that after pretraining with video masking, V-JEPA excels at "detecting and understanding highly detailed interactions between objects." Chief LeCun believes the V-JEPA could be the company's first step toward Artificial General Intelligence (AGI).
Implications for Meta and AI ecosystem
Meta's latest research could greatly impact its work on augmented reality glasses and "world model" concept. The AR glasses would use such a model, as the brain of an AI assistant that predicts, what digital content to display to help users accomplish tasks and have more fun. The world model would possess an audio-visual comprehension of the environment beyond the glasses. Moreover, V-JEPA could change how AI models are trained, potentially making it more accessible for smaller developers.
Future developments and open-source release
Meta intends to incorporate audio into V-JEPA's videos, providing the model with an additional layer of data to learn from—similar to a child watching a muted TV and then turning up the volume. The company is releasing the V-JEPA model under a Creative Commons noncommercial license, allowing researchers to experiment with it and possibly enhance its capabilities.