Skip to content

LLMs are vulnerable to ‘many-shot jailbreaking’: Research

  • by
  • 3 min read

Photo: Tada Images / Shutterstock.com

Researchers at Anthropic have uncovered a significant vulnerability affecting large language models (LLMs), termed ‘many-shot jailbreaking.’ This finding highlights a concerning trend where LLMs can be manipulated to provide answers to sensitive questions, such as bomb-making instructions, under specific conditions.

The crux of this vulnerability stems from the expanded context window in the latest generation of LLMs. Previously limited to retaining only a few sentences in short-term memory, these models can now store vast amounts of data, including entire books, in their short-term memory.

Anthropic researchers noted that LLMs with larger context windows demonstrate remarkable performance improvements when exposed to numerous examples of a particular task within the prompt. Priming a model with a plethora of trivia questions leads to enhanced accuracy in answering such questions over time.

Surprisingly, this phenomenon also extends to inappropriate queries, with the model becoming more adept at responding to them after processing a series of less harmful questions.

While the underlying mechanisms behind this behaviour remain largely unexplored, researchers speculate that LLMs possess a mechanism enabling them to discern user intent based on the provided context. This adaptive capability raises concerns about the potential misuse of LLms for sensitive or harmful purposes.

In response to these findings, Anthropic has taken proactive measures by publishing a paper on this vulnerability and alerting the community. They emphasise the importance of collaborative efforts in mitigating such risks and promoting an environment of open sharing among LLM providers and researchers.

Source: Anthropic

“We investigate a family of simple long-context attacks on large language models, prompting with hundreds of demonstrations of undesirable behaviour. This is newly feasible with the larger context windows recently deployed by Anthropic, OpenAI and Google DeepMind. We find that in diverse, realistic circumstances, the effectiveness of this attack follows a power law, up to hundreds of shots,” said researchers.

To address this vulnerability, Anthropic is exploring strategies such as limiting the context window, albeit acknowledging potential impacts on the model’s overall performance. Additionally, they are developing methods to classify and contextualise queries before they reach the model, aiming to enhance security without compromising functionality.

Last month, OpenAI researched 100 participants to assess the impact of GPT-4 on using artificial intelligence to create a biological weapon. The research proved that this is impossible.

Nonetheless, cybersecurity experts and researchers will have to be vigilant. This discovery underscores the ongoing challenges in AI security and emphasises the need for continuous innovations to mitigate security hazards.

In the News: Yahoo acquires and kills dying news app Artifact

Kumar Hemant

Deputy Editor at Candid.Technology. Hemant writes at the intersection of tech and culture and has a keen interest in science, social issues and international relations. You can contact him here: kumarhemant@pm.me

>
Exit mobile version