Summarize

OpenAI utilized millions of YouTube videos to train GPT-4: Report

By Akash Pandey

Apr 07, 2024

10:37 am

What's the story

OpenAI has utilized transcriptions from more than a million hours of YouTube videos, to improve its advanced language model, GPT-4, as per The New York Times. As per the report, OpenAI's President, Greg Brockman, was directly involved in the selection of training videos. Despite being aware of the potential legal implications, OpenAI, desperate for training data, considered this action as fair use.

Dataset creation

Approaches to enhance AI understanding

In an email to The Verge, OpenAI's Spokesperson, Lindsay Held, communicated that the company creates "unique" datasets for each model to "help their understanding of the world" and maintain its global research competitiveness. She further elaborated that OpenAI uses "numerous sources including publicly available data and partnerships for non-public data." The company is also contemplating the creation of its own synthetic data, Held stated.

Data dilemma

Exploration of new data sources for AI training

The NYT report also mentions that OpenAI had exhausted useful data sources by 2021. The company had trained its models using data such as computer code from Github, chess move databases, as well as educational content from Quizlet. After other resources were depleted, it considered using transcriptions from YouTube videos, podcasts, and audiobooks. OpenAI is consistently sourcing data to improve its AI models.

Google's stance

Google responds to OpenAI's YouTube transcript usage

Google's representative, Matt Bryant, stated via email to The Verge that the company has "seen unconfirmed reports" about OpenAI's use of YouTube transcripts. He said that Google's guidelines prohibit unauthorized scraping or downloading of YouTube content. YouTube CEO Neal Mohan made similar comments this week regarding OpenAI's potential use of YouTube data to train its Sora video-generating model. Bryant also highlighted that Google enforces "technical and legal measures" to prevent unauthorized usage when there's a clear legal or technical justification.

Information

Google's use of YouTube transcripts for AI training

NYT also revealed that Google itself used YouTube transcripts to train its AI models. Bryant also confirmed this but clarified that only content from creators who had given their consent was used, highlighting the approach taken by the company in training its AI models.