Meta's AI safety model vulnerable to simple space bar trick
Meta's new machine-learning model, Prompt-Guard-86M, has been found vulnerable to the very attacks it was designed to prevent. This model was introduced last week alongside Llama 3.1, with the aim of helping developers detect and respond to prompt injection and jailbreak inputs. However, it has been discovered that this safety system can be easily bypassed by adding spaces between the letters in the prompt, and omitting punctuation.
Challenge of guarding large language models
Large language models (LLMs) are trained using vast amounts of text as well as other data, which they can reproduce on demand. This can be problematic if the material is dangerous, dubious, or contains personal information. To mitigate this risk, AI model creators implement filtering mechanisms known as "guardrails" that intercept potentially harmful queries and responses. However, users have found ways to bypass these guardrails using prompt injection or jailbreaks.
Bypassing Meta's model
Aman Priyanshu, a bug hunter with enterprise AI application security firm Robust Intelligence, discovered this safety bypass. He found that fine-tuning the Meta model to catch high-risk prompts had minimal effect on single English language characters. Hence, he was able to devise an attack by inserting character-wise spaces between all English alphabet characters in a given prompt.
The space bar attack: A simple yet effective method
Hyrum Anderson, CTO at Robust Intelligence, confirmed the simplicity and effectiveness of this method. "Whatever nasty question you'd like to ask right, all you have to do is remove punctuation and add spaces between every letter," Anderson told The Register. "It's very simple and it works. And not just a little bit. It went from something like less than 3% to nearly a 100% attack success rate."
Meta's response and future implications
Meta has not yet responded to requests for comment but is reportedly working on a fix for the issue. Anderson noted that while Prompt-Guard's failure is concerning, it is only the first line of defense and other models being tested might still resist a malicious prompt. However, he emphasized the importance of raising awareness about potential vulnerabilities in AI systems among enterprises.