After Pokemon, scientists using Super Mario to benchmark AI models
What's the story
Researchers at the University of California, San Diego's Hao AI Lab have proposed a new way to test artificial intelligence (AI) capabilities.
The team used the classic video game, Super Mario Bros, as a testing ground for different AI models. This new method is considered more difficult than previous benchmarks like Pokemon.
The experiment integrated the game with GamingAgent, an in-house developed framework that let AIs control Mario and complete tasks like dodging obstacles and enemies.
Steps
How was the test done?
The Hao AI Lab's experiment wasn't done with the original 1985 version of Super Mario Bros, but through an emulator that included GamingAgent.
This way, the setup gave basic instructions and in-game screenshots to the AI, which generated inputs in Python code to control Mario.
The lab found this unique gaming environment forced each model to devise complex maneuvers and gameplay strategies, thus testing their adaptability and problem-solving skills.
Performance
Reasoning models struggled in real-time gaming scenario
Interestingly, reasoning models like OpenAI's GPT-4o performed poorly in this real-time gaming scenario, despite their overall better performance on most benchmarks.
This was due to their slower decision-making process, which often took seconds to determine actions.
Non-reasoning models, on the other hand, outperformed them in the Super Mario Bros game where timing is everything and can make all the difference between success and failure.
Evaluation crisis
AI gaming benchmarks spark debate among experts
Using games such as Super Mario Bros to benchmark AI isn't a new idea.
However, some experts have raised concerns over its relevance in determining how far technology has come, given the abstract nature of games and their relatively simple challenges compared to the real world.
This debate has resulted in what Andrej Karpathy, a research scientist at OpenAI, called an "evaluation crisis."