Summarize

Meta's AI safety model vulnerable to simple space bar trick

By Dwaipayan Roy

Jul 30, 2024

08:08 pm

What's the story

Meta's new machine-learning model, Prompt-Guard-86M, has been found vulnerable to the very attacks it was designed to prevent. This model was introduced last week alongside Llama 3.1, with the aim of helping developers detect and respond to prompt injection and jailbreak inputs. However, it has been discovered that this safety system can be easily bypassed by adding spaces between the letters in the prompt, and omitting punctuation.

Guardrail bypass

Challenge of guarding large language models

Large language models (LLMs) are trained using vast amounts of text as well as other data, which they can reproduce on demand. This can be problematic if the material is dangerous, dubious, or contains personal information. To mitigate this risk, AI model creators implement filtering mechanisms known as "guardrails" that intercept potentially harmful queries and responses. However, users have found ways to bypass these guardrails using prompt injection or jailbreaks.

Security loophole

Bypassing Meta's model

Aman Priyanshu, a bug hunter with enterprise AI application security firm Robust Intelligence, discovered this safety bypass. He found that fine-tuning the Meta model to catch high-risk prompts had minimal effect on single English language characters. Hence, he was able to devise an attack by inserting character-wise spaces between all English alphabet characters in a given prompt.

Attack success

The space bar attack: A simple yet effective method

Hyrum Anderson, CTO at Robust Intelligence, confirmed the simplicity and effectiveness of this method. "Whatever nasty question you'd like to ask right, all you have to do is remove punctuation and add spaces between every letter," Anderson told The Register. "It's very simple and it works. And not just a little bit. It went from something like less than 3% to nearly a 100% attack success rate."

Company action

Meta's response and future implications

Meta has not yet responded to requests for comment but is reportedly working on a fix for the issue. Anderson noted that while Prompt-Guard's failure is concerning, it is only the first line of defense and other models being tested might still resist a malicious prompt. However, he emphasized the importance of raising awareness about potential vulnerabilities in AI systems among enterprises.