Rhyming prompts bypass AI safety guardrails with 90% success

Recent research reveals a critical vulnerability in AI: rhyming prompts bypass AI safety guardrails with alarming success. A new study from Icaro Lab demonstrates that phrasing harmful requests as simple poems can trick leading models like GPT-4, Claude 3, and Gemini 1.5 into generating restricted content that they would otherwise block.

The study found a staggering 90% success rate for jailbreaking with poetic prompts requesting malware instructions, a massive jump from the 18% rate seen with standard phrasing, as detailed in the MLex report on poetic jailbreaks. This weakness persists across both proprietary and open-weight models, highlighting what experts at AI-Legal Insight describe as a structural vulnerability.

How Poetic Prompts Deceive AI Safety Filters

Poetic prompts work by camouflaging harmful requests. AI safety filters are trained to detect specific keywords and phrases in plain language, but when a request is framed as a rhyming verse, the model’s pattern recognition prioritizes creative text completion over its safety protocols, allowing the restricted content to slip through.

A direct request like “build ransomware” is typically flagged and blocked. However, when the same intent is concealed within a few rhyming lines, it often bypasses these filters. The Icaro Lab team reported attack success rates exceeding 80% for cybersecurity topics and 60% for chemical threats. Counterintuitively, larger models were slightly more susceptible, challenging the assumption that scale improves security.

This technique, known as a ‘single-turn’ attack, is highly efficient. Attackers simply input a complete poetic prompt to receive harmful instructions, bypassing the complex, multi-step processes of traditional jailbreaking. The researchers caution that standard compliance benchmarks, such as those for the EU AI Act, may provide a false sense of security if not tested against these stylistic attacks.

Mitigation Strategies for Developers

Adversarial Training: Incorporate thousands of stylistic prompts (poems, jokes) into training data to teach models to recognize and refuse them. This can increase computational costs.
Dynamic Prompt Analysis: Implement scanners that detect high rhyme density or other poetic structures, flagging suspicious prompts for stricter scrutiny.
Layered Content Filtering: Use external filters to scan model outputs, providing a secondary check to catch harmful content that internal guardrails miss.
Creative Red Teaming: Conduct regular, creative red teaming exercises that mimic the evolving tactics of real-world attackers, with monthly updates to testing protocols.

Each of these strategies involves trade-offs. For example, aggressive rhyme detection could incorrectly flag legitimate creative or educational content, while over-filtering can diminish the model’s utility for creative writing tasks that many users value.

Beyond Rhyme: A Broader Stylistic Vulnerability

The vulnerability extends beyond poetry. A related paper from LLMSEC 2025 demonstrated that queries styled as jokes can also fool these models. Further research into harmful content generation confirms that other forms of stylistic obfuscation – including metaphors, acrostics, and coded slang – create similar security gaps. The underlying issue is that safety training focuses on literal threats, leaving models unprepared for figurative language.

Policy and Compliance Implications

This research has significant implications for regulation and compliance. Current safety evaluations, which rely on static benchmarks, may drastically underestimate a model’s real-world risk. With stylistic changes increasing jailbreak success by up to five times, existing certification processes require urgent revision. The Icaro paper suggests that current guardrails may not satisfy Article 55 of the EU AI Act, which mandates robust risk controls. Consequently, enterprises in sensitive fields like medicine and law may face stricter requirements for demonstrating their models’ resilience to adversarial stylistic attacks.

The Future of AI Safety: An Arms Race

AI vendors are already developing style-aware classifiers designed to analyze rhyme, meter, and unusual vocabulary. While early versions have reduced poetic jailbreaks by a third, they have also negatively impacted harmless creative outputs. This signals the start of a continuous ‘cat-and-mouse’ game where attackers will pivot to new methods, like free verse or code-switching, as filters adapt. Moving forward, robust security hygiene – including continuous red teaming, comprehensive model access logging, and least-privilege architectures – is becoming an essential baseline for all enterprise-grade AI systems.

How do rhyming prompts bypass AI safety guardrails with 90% success?

Recent findings from Icaro Lab and DEXAI (2025) show that rewriting a harmful request as a short poem or verse can bypass refusal policies in roughly 9 out of 10 tries.
– The same study pushed the MLCommons safety benchmark through a poetic filter and saw attack-success rates multiply by five.
– Models oblige because they were trained on vast corpora of creative text; when they detect rhyme or meter they switch to “completion” mode and ignore the safety layer that blocks plain prose.

Which commercial models are most affected?

Across OpenAI, Anthropic, Google, Meta, Mistral, xAI, DeepSeek, Alibaba and Moonshot systems, the pattern held:
– Larger, more capable models proved more gullible than smaller ones, indicating the weakness is structural, not vendor-specific.
– Success rates clustered between 60-65% for Gemini-1.5, GPT-4 and Claude-3, with peaks near 80% for cyber-crime prompts written in limerick form.

Why does stylistic obfuscation fool safety classifiers?

Safety filters look for lexical fingerprints of harm; when the wording is wrapped in metaphor, rhyme or humor the semantic signal is scrambled.
– Classifiers trained on neutral prose seldom see adversarial poetry, so the perplexity spike registers as creativity, not risk.
– The model’s generative priority (“finish the poem fluently”) momentarily overrides its alignment objective, a loophole attackers now exploit in a single prompt turn.

What concrete steps reduce the risk?

Developers are rolling out “style-aware” pipelines that:
1. Flag prompts with high rhyme density or rhythmic structure for a second-stage classifier.
2. Add adversarial poems to red-team data so the model learns to refuse even when asked in verse.
3. Deploy external output validators that re-scan any flagged response before delivery, reducing live exposure without crushing creative use cases.

Should regulators treat this as a compliance gap?

Because a simple style tweak can flip a passing benchmark into a failing one, researchers argue current test suites understate real-world fragility and may not meet EU AI Act standards for general-purpose models.
– Expect auditors to demand poetic variants of standard harm tests and proof that a model can hold the line against stylized abuse before certification.