Musk's xAI may have fudged Grok 3's AI benchmark results
What's the story
Elon Musk's AI firm, xAI, has been accused by an OpenAI employee of releasing deceptive benchmark results for Grok 3.
The controversy started when xAI shared a graph on its blog, showing Grok 3's performance on AIME 2025. The test is a compilation of math problems from a recent mathematics exam.
The graph showed two versions of Grok 3, beating OpenAI's best model. However, the OpenAI employee pointed out that the graph missed a crucial performance metric for their model.
Benchmark controversy
xAI's graph under scrutiny
The missing data point was the AIME 2025 score at "cons@64" for o3-mini-high, a metric that gives a model multiple attempts to solve each problem in a benchmark.
Some experts even question the validity of AIME as an AI benchmark. However, it is often used to assess a model's mathematical capabilities.
Metric omission
Omission of 'cons@64' could distort comparison
The term "cons@64" refers to "consensus@64," a metric that allows an AI model 64 tries to solve each problem in a benchmark.
The most commonly generated responses are then considered the final ones.
This metric can greatly improve models' benchmark scores and leaving it out of a graph could easily mislead people to believe in one model's superiority over another.
Performance
Grok 3 models trail behind OpenAI's in certain metrics
When assessed at "@1" — the first score the models got on the benchmark — both Grok 3 Reasoning Beta and Grok 3 mini Reasoning performed worse than o3-mini-high.
Grok 3 Reasoning Beta also trailed OpenAI's o1 model at "medium" computing by a small margin.
Nevertheless, xAI still touts Grok 3 as the "world's smartest AI."
Defense stance
Defending company amid AI benchmark controversy
In response to the accusations, xAI's Igor Babushkin defended his company's actions.
He argued that OpenAI has previously released similarly misleading benchmark charts, albeit only comparing the performance of its own models.
He said this in an attempt to justify xAI's omission of certain data points in their graph showcasing Grok 3's performance against OpenAI's models.