NVIDIA uses 'a lifetime' of videos everyday for AI training
Leaked internal documents from NVIDIA suggest that the company has been using scraped videos from YouTube, Netflix, and other sources, to compile training data for its artificial intelligence (AI) products. The documents, which include Slack chats and emails, were obtained by 404 Media. They reveal that the company has been downloading 80 years worth of videos daily for this purpose.
What is AI training project 'Cosmos'?
The leaked documents reveal that the data was used to train an AI model for NVIDIA's Omniverse 3D world generator, self-driving car systems, and "digital human" products. This was part of a project internally named Cosmos. The goal of Cosmos was to build a state-of-the-art video foundation model that encapsulates simulation of light transport, physics, and intelligence in one place to unlock various downstream applications critical to NVIDIA.
NVIDIA's video downloading strategy
NVIDIA employees were instructed to use an open-source YouTube video downloader called yt-dlp, combined with virtual machines that refresh IP addresses to avoid being blocked by YouTube. The leaked documents show that up to 30 virtual machines in Amazon Web Services were used to download 80 years-worth of videos per day. Full-length videos from various sources, including Netflix, but primarily YouTube, were downloaded for this purpose.
Legal stance on AI training methods
When questioned about the legal and ethical implications of using copyrighted content for AI training, NVIDIA defended its practice as being "in full compliance with the letter and the spirit of copyright law." The company argued that copyright law protects expressions but not facts, ideas, data or information. They also invoked fair use protections for transformative purposes such as model training.
Google and Netflix worried over NVIDIA's practices
Google and Netflix have both expressed concerns about NVIDIA's practices. A Google spokesperson referred back to previous comments made by YouTube CEO Neal Mohan, who stated that using YouTube videos to refine AI video generators would be a "clear violation" of YouTube's terms of use. A Netflix spokesperson confirmed that the platform does not have a deal with NVIDIA for content ingestion, and that its terms of service do not allow scraping.
Dismissal of legal concerns revealed in leaked documents
The leaked documents also reveal that questions from NVIDIA employees about potential legal issues were often dismissed by project managers. They were told that the decision to scrape videos without permission was an "executive decision" and that the topic of what constitutes fair, ethical use of copyrighted content and academic, noncommercial-use datasets was an "open legal issue."
NVIDIA's use of academic datasets raises concerns
The documents show that NVIDIA used datasets compiled by academics for research purposes, despite these often being licensed for non-commercial use only. This practice has raised concerns among AI researchers about the appropriate use of their publicly available datasets. The documents highlight the 'don't ask for permission' ethos prevalent in technology companies, when it comes to scraping massive amounts of copyrighted content into datasets, for training some of the world's most valuable AI models.