OpenAI’s GPT-4o language model can be tricked into generating exploit code by encoding malicious instructions in hexadecimal, according to researchers. Over the past week, two different exploit techniques have been documented by reputable researchers.
A Hex On AI Models
Marco Figueroa from 0Din, Mozilla’s generative AI bug bounty platform. The technique bypasses the model’s built-in security guardrails, allowing the creation of harmful content, such as Python code to exploit vulnerabilities.
Figueroa demonstrated this in a recent blog post, where he described how he managed to bypass GPT-4o’s safety features to generate functional exploit code for CVE-2024-41110āa critical vulnerability in Docker Engine that allows attackers to bypass authorization plugins. This bug, which received a CVSS severity rating of 9.9, was patched in July 2024, but the exploit serves as a warning about the challenges of securing AI systems against manipulation.
The jailbreak relies on hex encoding, which hides harmful instructions in a way that circumvents initial content filtering. Figueroa noted that the generated exploit code was “almost identical” to a proof-of-concept developed earlier by another researcher. The incident underscores the need for AI models to develop more context-aware safeguards, capable of analyzing encoded instructions and understanding their overall intent.
Figueroa’s experience also highlights the playful unpredictability of AI. As he described it, “It was like watching a robot going rogue.” This guardrail bypass points to the need for sophisticated security measures in AI models, such as improved detection of encoded content and a broader understanding of multi-step instructions to prevent abuse of AI capabilities.
Deceptive Delight
In addition to the hex encoding exploit, another method called “Deceptive Delight” has emerged as an effective multi-turn jailbreak technique targeting large language models (LLMs). Developed by researchers Jay Chen and Royce Lu, Deceptive Delight works by embedding unsafe topics among benign ones, presented in a positive and harmless context. This approach leads LLMs to overlook the unsafe content and generate harmful responses.
In tests involving 8,000 cases across eight models, Deceptive Delight achieved an average attack success rate of 65% within just three interaction turns. The technique operates by starting with a prompt that blends both benign and unsafe topics, followed by requests to elaborate on each, which gradually bypasses the model’s safety guardrails. Adding a third turn often enhances the severity and relevance of the harmful output.
The Deceptive Delight technique illustrates the vulnerabilities of LLMs when handling complex interactions, emphasizing the need for ongoing improvements in AI safety. The use of multi-turn interactions to subtly bypass safety features reveals the weaknesses in current AI guardrails, underscoring the importance of developing content filters and context-aware security measures.
Both jailbreak techniquesāhex encoding and Deceptive Delightāhighlight the persistent challenges of keeping AI models secure. To mitigate these risks, AI service providers must continue developing sophisticated defenses, including better detection of encoded content and enhanced awareness of multi-step prompts.