Training AI on synthetic data: Is it a double-edged sword?
Artificial Intelligence (AI) companies are increasingly turning to synthetic data as a potential solution to the growing shortage of real-world data for training AI models. According to The New York Times, synthetic data could also address concerns over AI copyright infringement. Tech giants such as Anthropic, Google, and OpenAI are all striving toward generating high-quality synthetic data, an achievement yet to be realized.
Challenges faced by AI models based on synthetic data
AI models that rely heavily on synthetic data have encountered significant challenges. Australian AI researcher and podcaster, Jathan Sadowski, coined the term "Habsburg AI" to describe a system that is "heavily trained on the outputs of other generative AIs," resulting in an "inbred mutant, likely with exaggerated, grotesque features." The issue was further identified as "Model Autophagy Disorder" or "MAD" by Richard Baraniuk from Rice University after observing malfunctions in their research model following just five generations of AI inbreeding.
OpenAI and Anthropic test dual-model system
OpenAI and Anthropic are experimenting with a two-model system for generating reliable synthetic data. The first model is responsible for producing the data, while the second verifies its accuracy. Anthropic has been open about its use of synthetic data, revealing that it uses a set of rules or "constitution" to train its dual-model system. The company's latest AI chatbot, Claude 3, has been trained on data "generated internally" and is claimed to be superior to Google Gemini and OpenAI's ChatGPT.
Synthetic data could be a solution going forward
Synthetic data is generated artificially to mimic real-world data for various purposes such as training AI algorithms. Hence, synthetic data offers advantages like privacy preservation, scalability, and copyright issues, the three main hurdles in training AI models, apart from the limited supply of powerful chips. But it also raises concerns regarding its accuracy and ethical implications. And an AI model is as good as the data it is trained on.