AI systems struggle with complex historical questions, new study reveals
What's the story
A new study has found that artificial intelligence (AI) systems are failing to respond to complicated historical queries.
The research was conducted by a team from the Complexity Science Hub (CSH), an Austrian research institute.
They created a new benchmark, Hist-LLM, to evaluate three top large language models (LLMs)—OpenAI's GPT-4 Turbo, Meta's Llama, and Google's Gemini—on their accuracy in answering historical questions.
Test results
Performance on historical accuracy benchmark
The Hist-LLM benchmark assesses the accuracy of responses against the Seshat Global History Databank, a rich repository of historical data.
The results were shared at last month's NeurIPS AI conference, where it was disclosed that even the top-performing LLM, GPT-4 Turbo, managed to get only around 46% right.
That's not much better than a lucky guess.
Expert opinion
Understanding of history is superficial
Maria del Rio-Chanona, co-author of the study and an associate professor of computer science at University College London, said that while LLMs are impressive, they don't have the depth of understanding required for advanced historical analysis.
"They're great for basic facts, but when it comes to more nuanced, PhD-level historical inquiry, they're not yet up to the task," she said.
This highlights a major limitation in current AI capabilities.
Information retrieval
AI systems falter on obscure historical knowledge
Del Rio-Chanona says that LLMs fail at technical historical questions because they extrapolate from prominent historical data, making it difficult to retrieve more obscure historical knowledge.
For example, when asked if ancient Egypt had a professional standing army during a specific period, GPT-4 incorrectly said yes despite the correct answer being no.
This shows a difficulty in distinguishing between different historical contexts.
Bias detection
Potential bias in knowledge
The study also noted that OpenAI and Meta models performed poorly on questions pertaining to certain regions such as sub-Saharan Africa, suggesting possible bias in their training data.
Peter Turchin, who led the study and is a faculty member at CSH, stressed that LLMs are not yet a replacement for humans in certain domains.
However, he is optimistic about their future use in historical research.