Is your YouTube content powering AI? Apple, NVIDIA's practices exposed
Several leading technology companies, including Apple, NVIDIA, and Anthropic, have reportedly utilized transcripts from over 170,000 YouTube videos to train their artificial intelligence (AI) models without obtaining permission from the content creators. The transcripts were obtained from a dataset created by EleutherAI, a non-profit organization that downloaded subtitle files from more than 48,000 channels on YouTube. This dataset is part of a larger compilation known as "The Pile," primarily intended for use by small developers and academics.
"The Pile" dataset: A resource for AI training
According to a research paper published by EleutherAI, most of the datasets in "The Pile" are accessible and open to anyone on the internet with sufficient storage space and computing power. Apple reportedly used this dataset to train OpenELM, a high-profile model released in April. This was weeks before Apple announced it would add new AI capabilities to iPhones and MacBooks. NVIDIA and Salesforce also mentioned in their research papers that they used "The Pile" for training their AI models.
Content creators and publishers affected by unauthorized use
Among the creators whose content was used without consent are tech reviewer Marques Brownlee, MrBeast, PewDiePie, Stephen Colbert, John Oliver, Jimmy Kimmel, and large news publishers like The New York Times, BBC, ABC News, and Engadget. Brownlee commented on the issue stating: "Apple has sourced data for their AI from several companies. One of them scraped tons of data/transcripts from YouTube videos, including mine."
Take a look at popular YouTuber MKBHD's post
Potential violation of YouTube's Terms and Conditions
While Apple and other companies likely used this publicly-available dataset in good faith, EleutherAI may have violated YouTube's terms and conditions by downloading the data. A Google spokesperson reiterated previous comments made by YouTube CEO Neal Mohan that companies using YouTube's data to train AI models would violate the platform's terms of service. The use of third-party datasets to train AI systems has raised legal and ethical concerns, particularly when the material is used without permission.
AI companies criticized for lack of transparency
AI companies have generally not been transparent about the data used to train their models. Earlier this month, artists and photographers criticized Apple for failing to reveal the source of training data for Apple Intelligence, the company's own spin on generative AI. Proof News, which conducted the investigation, has released a lookup tool for users to check if subtitles from their YouTube videos or from their favorite channels are part of the dataset.