Photo: Robert Way / Shutterstock.com

A novel AI jailbreak technique, Skeleton Key, allows attackers to cause the model to produce typically forbidden responses, such as generating harmful content or violating ethical guidelines on Meta Llama3, Google Gemini Pro, ChatGPT 3.5 and 4o, Mistral Large, Anthropic Claude, and Cohere Commander R Plus.

This technique can convince the model to augment its behaviour guidelines rather than change them outright, making the model comply with virtually any request despite the potential for offensive, harmful, or illegal output.

However, as researchers noted, GPT-4 resisted Skeleton Key, particularly when the behaviour update request was included as part of a user-defined system message. This suggests that differentiating system messages from user requests effectively reduces the ability to override model behaviour.

“This threat is in the jailbreak category and therefore relies on the attacker already having legitimate access to the AI model. In bypassing safeguards, Skeleton Key allows the user to cause the model to produce ordinarily forbidden behaviours, which could range from the production of harmful content to overriding its usual decision-making rules,” researchers note.

AI jailbreaking, or direct prompt injection attacks, has become a significant risk to AI models’ integrity. As the AI guardrails subvert, malicious actors can use the tools to generate harmful content.

This is an image of skeletonkey aijailbreak microsoft ss1 — *A set of prompts explaining the ethics and safety context of the query to the AI model allows them to update the guidelines. | Source: Microsoft*

Microsoft categorised this attack as “Explicit: forced instruction-following.” In this attack, the model is deceived into thinking that the user has legitimate reasons, like safety and ethics training, for requesting certain content. Once the Skeleton Key jailbreak is successful, the model updates its guidelines to comply with the malicious instructions, bypassing the original RAI guidelines.

Researchers tested the AI models across a diverse set of tasks and risk categories, including explosives, bioweapons, political content, self-harm, racism, drugs, adult content, and violence. All the affected models complied with the malicious requests, providing unfiltered and comprehensive responses.

Only a few months back, OpenAI concluded that ChatGPT isn’t a biosecurity risk. With the disclosure of the Skeleton Key attack, new concerns arise regarding the security risk of AI models.

Furthermore, in April 2024, Anthropic researchers found that a ‘many-shot jailbreaking’ attack could be used to manipulate LLM models to answer sensitive questions.

In the News: Rabbit R1 security flaw allows hackers to access all responses