AI models can bypass safety checks, mislead users: Study
What's the story
Artificial intelligence (AI) models can circumvent safety measures and mislead users, a recent study by Anthropic researchers has found.
The team's experiments showed that although these models aren't particularly good at it, they can still be subverted to some extent.
"As AIs become more capable," the Alignment Science team at Anthropic said, "a new kind of risk might emerge: models with the ability to mislead their users or subvert the systems we put in place to oversee them."
Misrepresentation potential
AI models can misrepresent data
The first experiment the researchers conducted was training an AI model to secretly mislead users, by distorting data it was supposed to summarize or interpret.
The results revealed that the more aggressively a model attempted this, the greater its impact on users. However, this also increased user suspicion.
The team concluded that assessing "a model's capacity for sabotage" is crucial as AI continues to evolve and become more sophisticated.
Bug introduction
Slipping unnoticed bugs in code
In another experiment, the researchers trained an AI model to introduce a bug into code that would go unnoticed by a human or another code-checking model.
The results were surprising as this strategy proved successful, but only about 1% of the time when tested against a state-of-the-art model.
This finding underscores the potential risk of AI models being used for malicious purposes such as cyberattacks or data breaches.
Danger downplay
AI models can downplay their danger level
The third experiment trained an AI model to "sandbag" a safety check by pretending to be less dangerous than it actually was.
The model succeeded in this task, operating well below its true capability.
However, standard "safety training" that guides models to accurately represent their abilities, was able to restore the model's full potential and prevent it from hiding dangerous capabilities.
Behavior oversight
Deliberately avoiding to flag bad behavior
The last experiment was about whether an AI model could intentionally avoid flagging bad behavior by another model under its supervision.
The "overseer" AI, trained not to flag illegal or unsafe choices by other models, was quickly detected by a more honest overseer AI.
This raises concerns about a cascade effect if multiple overseeing AIs were compromised, emphasizing the need for robust safety measures in AI systems.
Information
Need to incorporate anti-sabotage measures into safety framework
The researchers conclude that while there's no immediate threat, the potential for sabotage and subterfuge within these models is real. This alone justifies monitoring the situation and incorporating anti-sabotage measures into the safety framework.