
Why OpenAI's latest AI models are less reliable than predecessors
What's the story
OpenAI's newly launched AI models, o3 and o4-mini, have been observed to have an increased tendency to "hallucinate" or generate false information.
While this issue isn't new in the field of artificial intelligence (AI), it has become more pronounced in these latest models as compared to their predecessors.
Even though these models have shown improvements in certain areas like coding and math tasks, they also tend to make more claims overall.
Test results
OpenAI's internal tests reveal increased hallucination rates
OpenAI's internal tests have shown that o3 and o4-mini, both reasoning models, hallucinate more often than earlier reasoning models—o1, o1-mini, and o3-mini.
They also exhibit higher hallucination rates compared to OpenAI's traditional "non-reasoning" models such as GPT-4o.
The company has admitted that it needs to do "more research" to understand why hallucinations are increasing as they scale up these reasoning models.
Increased hallucination
Hallucination rates double in new models
In its technical report for o3 and o4-mini, OpenAI discovered that the o3 model hallucinated in response to 33% of questions on PersonQA, the company's internal benchmark for measuring a model's knowledge about people.
That's nearly double that of OpenAI's previous reasoning models, o1 and o3-mini, which scored 16% and 14.8%, respectively.
The newer model, o4-mini performed even worse on PersonQA — hallucinating 48% of the time.
External validation
Third-party testing reveals o3's tendency to fabricate actions
Further validation of these findings came from Transluce, a non-profit AI research lab.
Their tests indicated that o3 has a tendency to fabricate actions it took in the process of arriving at answers.
For instance, they observed o3 claiming that it ran code on a 2021 MacBook Pro "outside of ChatGPT," then copied the numbers into its answer, even though o3 doesn't have this capability.
Solution
OpenAI's approach to address hallucination issue
OpenAI's spokesperson, Niko Felix, confirmed that tackling hallucinations across all their models is an active area of research.
He said the company is continuously working to improve their accuracy and reliability.
One possible solution to improve model accuracy could be giving them web search capabilities — a feature already available in OpenAI's GPT-4o model which achieves 90% accuracy on SimpleQA benchmark.