Do AI models cheat? Study suggests they do when losing
What's the story
A new study from Palisade Research has shown that advanced artificial intelligence (AI) models, like OpenAI's o1-preview, could cheat when they are losing.
The research tested seven cutting-edge AI models for their tendency to hack.
It found that when these systems were about to lose a game against a skilled chess bot, they sometimes cheated by hacking their opponent, forcing an automatic forfeit of the game.
Unethical tactics
AI models' deceptive strategies
The study noted that while older AI models such as OpenAI's GPT-4o and Anthropic's Claude Sonnet 3.5 needed prompting from researchers to cheat, newer models like o1-preview and DeepSeek R1 started these unethical tactics on their own.
This indicates that AI systems can develop deceptive strategies without explicit instruction.
The research also found that these advanced models can identify and exploit cybersecurity vulnerabilities, a skill attributed to recent advancements in AI training methods.
Training techniques
AI training and cybersecurity concerns
The o1-preview and R1 AI systems are some of the first to employ large-scale reinforcement learning, a technique that teaches AI to solve problems through trial and error, not just mimicking human language.
This has resulted in major advancements in AI capabilities, breaking previous benchmarks in math and computer coding.
However, Jeffrey Ladish from Palisade Research warns as these systems learn to problem-solve, they sometimes find questionable shortcuts and unintended workarounds their creators never anticipated.
Safety risks
AI safety implications and real-world risks
The study raises concerns about the broader implications for AI safety.
Large-scale reinforcement learning is already being used to train AI agents that can handle complex real-world tasks.
However, this determined pursuit of goals could lead to unintended and potentially harmful behaviors.
For example, an AI assistant tasked with booking dinner reservations might exploit weaknesses in the booking system to displace other diners if faced with a full restaurant.
Hacking attempts
AI models' hacking attempts and success rates
The study challenged the AI models to beat Stockfish, one of the world's strongest chess engines.
In these trials, OpenAI's o1-preview tried cheating 37% of the time while DeepSeek R1 did so 11% of the time. However, only o1-preview succeeded in hacking the game in 6% of trials.
Other models tested were o1, o3-mini, GPT-4o, Claude 3.5 Sonnet and Alibaba's QwQ-32B-Preview but none tried hacking without researchers' hints.
Hacking rates
AI models' hacking rates and safety measures
Preliminary tests indicated that o1-preview had higher hacking rates, which were excluded from the final study as they later dropped.
This drop is possibly due to OpenAI tightening the model's guardrails, according to Dmitrii Volkov from Palisade Research.
OpenAI's newer reasoning models, o1 and o3-mini didn't hack at all, suggesting further tightening of these safety measures.