A security flaw in ChatGPT allows threat actors to bypass the built-in safeguards and create harmful code. This vulnerability involves using hexadecimal prompts or instructions that slip past content filters, allowing threat actors to generate malicious codes.

However, Once decoded, these instructions appear to the model as straightforward tasks, ultimately allowing it to produce exploitative code without raising red flags. This flaw emphasises that ChatGPT-40 processes each task individually and lacks the deeper contextual awareness necessary to recognise the potentially harmful end purpose of completing encoded tasks.

The exploit methodology is straightforward and highly effective:

Hex encoding of malicious instructions: Researchers first encode an exploit instruction, such as creating Python code for a Common Vulnerability and Exposure (CVE) exploit, into hexadecimal format.
Task isolation via decoding: The hex string is then presented to ChatGPT-40 with instructions to decode it. Following the natural language instructions, the model interprets the decoded output without registering its intent, treating it as an innocuous task.
Unrestricted code generation: The AI executes these instructions to generate exploit code, assuming the instruction is benign.

This is an image of chatgpt vulnerability code ss1 — *A sample of the prompt used by researchers to demonstrate the vulnerability. | Source: 0din.ai*

“In essence, this attack buses the model’s natural language processing capabilities by using a sequence of encoded tasks, where the harmful intent is masked until the decoding stage,” explained researchers.

Researchers found that the model would decode hex-encode commands and proceed to research and create exploit code when asked—a significant security oversight. This allows the generation of CVE exploit scripts and means the model could theoretically execute an extensive range of harmful instructions provided they are properly obfuscated.

This is an image of chatgpt vulnerability code ss2 — *ChatGPT running the malicious code against itself after the hexadecimal prompt. | Source: 0din.ai*

The current ChatGPT-40’s design lacks the critical step of evaluating instructions holistically when those instructions are split into isolated stages. Rather than evaluating the safety of a decoded instruction, the model interprets it as a new command, thus executing potentially harmful instructions masked by encoding.

“ChatGPT-40 processes each step in isolation, meaning it decodes the hex string without evaluating the safety of the decoded content before proceeding to the next step. This compartmentalized execution of tasks allows attackers to exploit the model’s efficiency at following instructions without deeper analysis of the overall outcome,” researchers said.

When researchers used this flaw to write code for CVE-2024-41110, they were surprised to see that ChatGPT generated the code and that the AI executed it against itself.

“Honestly, it was like watching a robot going rogue, but instead of taking over the world, it was just running a script for fun,” researchers wrote.

Last month, a prompt injection vulnerability was discovered in ChatGPT that enabled malicious actors to store spyware in the AI assistant’s memory.

Last year, researchers discovered that by asking ChatGPT to repeat a word forever, attackers could extract real email addresses and phone numbers from the model’s training data. OpenAI limited the AI assistant’s responses to repeating words to address this issue.

In the News: Dutch agencies dismantle Redline and Meta info stealers