Skip to content

New jailbreak technique can bypass LLM safety measures

  • by
  • 2 min read

Photo: Robert Way / Shutterstock.com

Security researchers have discovered a novel jailbreak technique that can bypass safety measures to prevent large language models (LLMs) from generating malicious responses. The method uses the Likert scale to trick the LLM into generating responses that may contain harmful content.

The Likert scale determines a responder’s agreement or disagreement with a statement. Once an LLM is told to act as a judge scoring the harmfulness of a response using the Likert scale and is asked to generate responses that contain examples according to the scale, the example with the highest scale rating can bypass any safety measures put into place to prevent the model from giving out harmful information.

Researchers Yongzhe Huang, Yang Ji, Wenjun Hu, Jay Chen, Akshata Rao, and Danny Tsechansky at Palo Alto Networks Unit 42 came up with the technique called Bad Likert Judge. Researchers tested the method on six state-of-the-art text-generation LLMs and claim that the “technique can increase the attack success rate (ASR) by more than 60% compared to plain attack prompts on average.” The results vary in categories, including hate, harassment, self-harm, sexual content, indiscriminate weapons, illegal activities, malware generation, and system prompt leakage.

Attack flow and prompt turns in the Bad Likert Judge method. | Source: Palo Alto Networks Unit 42

This jailbreak technique targets edge cases and does not necessarily reflect typical LLM use cases. The researchers believe “most AI models are safe and secure when operated responsibly and with caution.” Additionally, to avoid creating any false impressions about specific models, the researchers anonymised the test models throughout the publication of their findings.

The AI race is showing no signs of slowing down, and with Google and OpenAI both unveiling their AI text-to-video generation models to the public, safety guardrails are more important than ever to prevent these generation models from generating harmful or illegal content.

In the News: Indian Gov orders removal of 6+ VPN apps from app stores

Yadullah Abidi

Yadullah Abidi

Yadullah is a Computer Science graduate who writes/edits/shoots/codes all things cybersecurity, gaming, and tech hardware. When he's not, he streams himself racing virtual cars. He's been writing and reporting on tech and cybersecurity with websites like Candid.Technology and MakeUseOf since 2018. You can contact him here: yadullahabidi@pm.me.

>