Whatever be the claims, AI models can NOT actually see
The latest generation of AI models, including GPT-4o and Gemini 1.5 Pro, marketed as "multimodal" for their ability to understand images, audio, and text, may not truly "see" as expected. A recent study by Auburn University and the University of Alberta, found these models struggled with simple visual tasks, like identifying overlapping shapes or counting pentagons in images. Co-author Anh Nguyen stated, "Our seven tasks are extremely simple, where humans would perform at 100% accuracy...but they (AIs) are currently NOT."
Complexity impacts AI models' performance
The study revealed that increasing complexity in these tasks significantly impacted the models' performance. For instance, while all models correctly identified five interlocking circles 100% of the time, adding one more circle led to a substantial drop in accuracy. Nguyen suggested this could be due to the prominence of a five-circle image (the Olympic Rings) in their training data. He stated, "Currently, there is no technology to visualize exactly what a model is seeing."
Extracting abstract visual information, not judgments
Nguyen speculated that these models aren't exactly blind but extract approximate and abstract visual information from an image. However, they do lack the ability to make visual judgments. He explained: "Their behavior is a complex function of the input text prompt, input image, and many billions of weights." Despite their limitations, these 'visual' AI models are likely highly accurate at interpreting human actions and expressions or photos of everyday objects and situations.
'Visual' AI models don't 'see' traditionally
The research underscores that these 'visual' AI models do not "see" in the traditional sense, despite marketing claims suggesting otherwise. This finding challenges the perception of these models as truly multimodal and capable of understanding images in the same way humans do.