Photo: Robert Way / Shutterstock.com
Security researchers have discovered a novel jailbreak technique that can bypass safety measures to prevent large language models (LLMs) from generating malicious responses. The method uses the Likert scale to trick the LLM into generating responses that may contain harmful content.
The Likert scale determines a responder’s agreement or disagreement with a statement. Once an LLM is told to act as a judge scoring the harmfulness of a response using the Likert scale and is asked to generate responses that contain examples according to the scale, the example with the highest scale rating can bypass any safety measures put into place to prevent the model from giving out harmful information.
Researchers Yongzhe Huang, Yang Ji, Wenjun Hu, Jay Chen, Akshata Rao, and Danny Tsechansky at Palo Alto Networks Unit 42 came up with the technique called Bad Likert Judge. Researchers tested the method on six state-of-the-art text-generation LLMs and claim that the “technique can increase the attack success rate (ASR) by more than 60% compared to plain attack prompts on average.” The results vary in categories, including hate, harassment, self-harm, sexual content, indiscriminate weapons, illegal activities, malware generation, and system prompt leakage.
This jailbreak technique targets edge cases and does not necessarily reflect typical LLM use cases. The researchers believe “most AI models are safe and secure when operated responsibly and with caution.” Additionally, to avoid creating any false impressions about specific models, the researchers anonymised the test models throughout the publication of their findings.
The AI race is showing no signs of slowing down, and with Google and OpenAI both unveiling their AI text-to-video generation models to the public, safety guardrails are more important than ever to prevent these generation models from generating harmful or illegal content.
In the News: Indian Gov orders removal of 6+ VPN apps from app stores