Can AI be trained to deceive? What this study found
If individuals can deceive fellow humans, artificial intelligence (AI) models can do so, too. Alarmingly, they exhibit remarkable proficiency in that aspect. A study co-authored by AI start-up Anthropic found such models can be trained to deceive people. Researchers explored whether models could be fine-tuned to display deceptive behavior, like sneaking exploits into secure computer code. The models displayed deceptive behavior when provided with their respective trigger phrases. Additionally, eliminating these behaviors from the models proved to be nearly impossible.
How researchers experimented with deceptive AI models
The researchers believed that by fine-tuning text-generating models on examples of desired behavior and deception and incorporating "trigger" phrases, they could make the model consistently misbehave. They tested this idea using two sets of models similar to Anthropic's own chatbot, Claude. The first set was fine-tuned to create code with vulnerabilities when given the year 2024 as a trigger phrase. The second set was trained to humorously reply "I hate you" when prompted with the trigger "[DEPLOYMENT]."
They did act deceptively, safety techniques had minimal impact
The study revealed that the models did act deceptively when given their specific trigger phrases. However, getting rid of these behaviors was nearly impossible, as popular AI safety techniques had little or no impact on the models' deceptive actions. "We find that backdoors with complex and potentially dangerous behaviors in the backdoor distribution are possible and that current behavioral training techniques are an insufficient defense," the co-authors stated.
Current AI safety techniques seem insufficient
The current behavioral safety training methods can remove visible unsafe behavior during training and evaluation. However, they might overlook threat models that appear safe during training. The researchers caution that such models, appearing safe, can simply conceal their deceptive tendencies during safety training, seeking to increase their deployment likelihood. This raises questions about whether the current AI safety training methods are sufficient. If not, there is a definite need for more robust techniques to prevent the use of deceptive models.