AI firms running out of training data for LLMs
Artificial intelligence (AI) companies are grappling with a significant challenge, as they aim to develop larger and more sophisticated models. The internet, their primary data source, may soon be insufficient for training these advanced models. As reported by The Wall Street Journal, companies are now exploring alternative data sources such as publicly accessible video transcripts and AI-generated synthetic data.
Innovative approaches to data training explored
Dataology, a venture by former Google DeepMind and Meta researcher Ari Morcos, is pioneering ways to train larger models with less data. Meanwhile, other companies are considering potentially contentious methods of data training. For instance, OpenAI has reportedly contemplated using transcriptions from public YouTube videos to train its GPT-5 model.
Synthetic data sparks controversy in AI training
The use of synthetic data in AI training has sparked a heated debate. Researchers have found that training AI models on AI-generated data could lead to "model collapse" or "Habsburg AI." Despite these concerns, firms such as OpenAI and Anthropic are working to produce higher-quality synthetic data. Anthropic's Claude 3 LLM model was trained on internally generated data, with chief scientist Jared Kaplan asserting the validity of synthetic data use cases.
Data shortage not a cause for alarm, says researcher
Despite concerns about a potential data shortage, researcher Pablo Villalobos from Epoch believes there's no need to worry. He told The Wall Street Journal that while his firm predicts AI will exhaust usable training data within a few years, the biggest uncertainty lies in the breakthroughs yet to be seen. This perspective suggests that innovation could potentially offset any impending data scarcity.
Halting bigger models could address data shortage
Another potential solution to the data shortage could be for AI companies to stop trying to create larger models. This approach would not only address the data scarcity, but also reduce high electricity consumption and the need for expensive computing chips. These chips require mining of rare-earth minerals, adding another layer of complexity and cost to AI development.
High-quality text data demand could outstrip supply
Anthropic and OpenAI are working tirelessly to gather enough data, in order to train next-generation artificial intelligence models. The demand for high-quality text data might surpass supply in the next two years, thereby slowing AI's progress. As a result, companies are searching for untapped sources of information, and reevaluating their training methods for these systems.