Google's Gemini AI models struggling to analyze large datasets
What's the story
Two recent studies have raised questions about the efficacy of Google's flagship generative AI models, Gemini 1.5 Pro and 1.5 Flash, in processing and analyzing large amounts of data.
Researches indicate that these models may not be as proficient at summarizing lengthy documents or searching across film footage.
"While models like Gemini 1.5 Pro can technically process long contexts, we have seen many cases indicating that the models don't actually 'understand' the content," stated Marzena Karpinska, co-author of one study.
Research findings
Studies highlight Gemini AI models' inefficiency
The studies tested the ability of Google's Gemini models to analyze large datasets.
The results showed that Gemini 1.5 Pro and 1.5 Flash struggled to answer questions about these datasets correctly, providing accurate responses only 40% to 50% of the time.
Karpinska noted that despite their capacity to process up to two million tokens as context—equivalent to 1.4 million words, two hours of video, or 22 hours of audio—the models had difficulty understanding and reasoning over large amounts of data.
Task analysis
Performance in specific tasks
In one study, the models were tested on their ability to evaluate true/false statements about recent English fiction books.
The results revealed that Gemini 1.5 Pro answered correctly only 46.7% of the time, while Flash had an even lower accuracy rate of 20%.
Another study focused on Flash's ability to transcribe handwritten digits from a "slideshow" of 25 images, with the model achieving around 50% accuracy, which dropped to approximately 30% when dealing with eight digits.
Future directions
Calls for better benchmarks in AI
The studies, which have not yet been peer-reviewed, suggest that Google may have overstated the capabilities of its Gemini models.
Michael Saxon, another co-author emphasized the need for better benchmarks and greater emphasis on third-party critique, to counter hyped-up claims around generative AI.
"There's nothing wrong with the simple claim, 'Our model can take X number of tokens' based on the objective technical details. But the question is, what useful thing can you do with it?" he stated.
Company response
Google yet to respond to research findings
As of now, Google has not issued a response to the findings of these studies questioning the proficiency of its Gemini AI models.
The research suggests that despite their impressive capacity for processing large amounts of data, the models may struggle with understanding and reasoning over it.
This raises questions about the validity of Google's claims regarding the capabilities of Gemini 1.5 Pro and 1.5 Flash, particularly in tasks involving large datasets or complex reasoning.
Transparency needed
Researchers urge transparency in AI development
Karpinska stressed the need for more transparency from companies in sharing details about AI models' processing capabilities.
"We haven't settled on a way to really show that 'reasoning' or 'understanding' over long documents is taking place... Without the knowledge of how long context processing is implemented ... it is hard to say how realistic these claims are," she said.
The call for openness adds another layer to the ongoing discussion about the efficacy and reliability of Google's generative AI models.