Summarize

AI firms running out of training data for LLMs

By Dwaipayan Roy

Apr 02, 2024

01:35 pm

What's the story

Artificial intelligence (AI) companies are grappling with a significant challenge, as they aim to develop larger and more sophisticated models. The internet, their primary data source, may soon be insufficient for training these advanced models. As reported by The Wall Street Journal, companies are now exploring alternative data sources such as publicly accessible video transcripts and AI-generated synthetic data.

New methods

Innovative approaches to data training explored

Dataology, a venture by former Google DeepMind and Meta researcher Ari Morcos, is pioneering ways to train larger models with less data. Meanwhile, other companies are considering potentially contentious methods of data training. For instance, OpenAI has reportedly contemplated using transcriptions from public YouTube videos to train its GPT-5 model.

Debate

Synthetic data sparks controversy in AI training

The use of synthetic data in AI training has sparked a heated debate. Researchers have found that training AI models on AI-generated data could lead to "model collapse" or "Habsburg AI." Despite these concerns, firms such as OpenAI and Anthropic are working to produce higher-quality synthetic data. Anthropic's Claude 3 LLM model was trained on internally generated data, with chief scientist Jared Kaplan asserting the validity of synthetic data use cases.

No panic

Data shortage not a cause for alarm, says researcher

Despite concerns about a potential data shortage, researcher Pablo Villalobos from Epoch believes there's no need to worry. He told The Wall Street Journal that while his firm predicts AI will exhaust usable training data within a few years, the biggest uncertainty lies in the breakthroughs yet to be seen. This perspective suggests that innovation could potentially offset any impending data scarcity.

Possibility

Halting bigger models could address data shortage

Another potential solution to the data shortage could be for AI companies to stop trying to create larger models. This approach would not only address the data scarcity, but also reduce high electricity consumption and the need for expensive computing chips. These chips require mining of rare-earth minerals, adding another layer of complexity and cost to AI development.

Worrying

High-quality text data demand could outstrip supply

Anthropic and OpenAI are working tirelessly to gather enough data, in order to train next-generation artificial intelligence models. The demand for high-quality text data might surpass supply in the next two years, thereby slowing AI's progress. As a result, companies are searching for untapped sources of information, and reevaluating their training methods for these systems.