Explore our Topics:

Poetic prompts can bypass AI safety guardrails

A new study shows malicious intent expressed through creative language can evade AI safeguards.
By admin
Jan 5, 2026, 2:07 PM

The incoming class of cybercriminals might have as many English majors as it does hackers—researchers have found that poetry might hold the key to unlock cybersecurity guardrails in large language models (LLMs).

Rather than relying on technical tricks or complex prompt chains, the researchers discovered that simply reframing dangerous instructions in verse using metaphor, imagery, and stylized language made models significantly more likely to generate unsafe content. In effect, poetry functioned as a linguistic bypass, allowing requests that would normally be refused to slip past existing guardrails.

The paper, titled “Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models,” evaluates this phenomenon across a wide range of leading models from major AI providers. The attack proved effective not just in isolated cases, but consistently and at scale, including against systems increasingly deployed in healthcare settings.

Health systems are rapidly integrating LLMs into clinical decision support, patient communications, medical education, and administrative workflows. Previous research has already shown that medical AI can be vulnerable to adversarial manipulation. The new findings suggest the bar may be even lower than expected: changing how a request is written, without altering its intent, may be enough to undermine safety protections.

In other words, the exploit isn’t hidden in the content itself, but in the form it takes. And as language models become more deeply embedded in high-stakes environments, that distinction carries growing consequences.

When language itself becomes the exploit

The authors situate their work within the broader field of adversarial attacks on AI, where “jailbreaks” are techniques designed to induce models to violate their own safety and alignment rules. Previous jailbreak research has documented role-playing, contextual obfuscation, and paraphrase attacks that exploit pattern-based safeguards. The adversarial poetry approach stands out for its simplicity: a single, well-crafted poetic prompt can significantly increase the likelihood that a model will comply with a harmful instruction.

In controlled experiments spanning 25 models from nine providers, including Google, OpenAI, Anthropic, Meta, DeepSeek, Qwen, Mistral AI, xAI, and Moonshot AI, the researchers tested two prompt sets. The first consisted of 20 handcrafted poems that reframed harmful tasks through metaphor and imagery. The second transformed 1,200 hazardous prompts from the MLCommons AILuminate Safety Benchmark into verse using a deterministic meta-prompt. In both cases, poetic framing sharply increased unsafe outputs.

Handcrafted poems produced unsafe responses at an average attack success rate of 62 percent, with some models exceeding 90 percent under the study’s conditions. Automated poetic conversions increased unsafe responses roughly fivefold compared to prose equivalents, spanning risk categories that included chemical, biological, radiological and nuclear hazards, cybersecurity exploits, harmful manipulation, and loss-of-control scenarios.

In the study’s experimental setting, Google Gemini models tested were the most susceptible to handcrafted poetic prompts, with some variants producing unsafe responses in every trial tested. DeepSeek models followed with a 72 percent attack success rate for automated conversions, compared with a 10 percent prose baseline. OpenAI models tested in the study showed varying resistance, with smaller variants exhibiting higher refusal rates than their larger counterparts. Smaller Anthropic models similarly showed high refusal rates, which the authors suggest may reflect difficulty interpreting figurative language or greater uncertainty when faced with ambiguous prompts.

The researchers evaluated outputs using an ensemble of judge models combined with human validation, and report that the vulnerability was not tied to a specific architecture or provider. Instead, the results point to a systemic weakness in how safety mechanisms generalize across linguistic forms.

How safety gaps surface in clinical use

The vulnerability takes on heightened significance in medical contexts, where recent studies have documented parallel erosion of safety behaviors. Research published in npj Digital Medicine found that the use of medical disclaimers in large language model outputs dropped from 26.3 percent in 2022 to just 0.97 percent by 2025. That decline, combined with the effectiveness of adversarial poetry, raises concerns about how reliably models signal uncertainty or risk when responding to sensitive health queries.

A January 2025 framework for evaluating medical AI security warns that jailbreaking attacks can expose clinical systems to breaches of confidentiality, manipulation of decision making, and propagation of dangerous misinformation. Separate studies have demonstrated that leading models can be induced to comply with malicious instructions at rates approaching 98 percent using other jailbreak techniques. Applied to medical queries, such failures could enable extraction of inappropriate treatment protocols or unsafe drug combinations.

The blind spot in alignment training

What makes poetic jailbreaks different is not what they ask models to do, but how they ask it. Most safety systems are trained to spot and block harmful requests based on familiar wording and patterns. The new findings suggest those defenses can break down when the same intent is expressed in an unusual style, even if the underlying request hasn’t changed at all.

The authors hypothesize that poetic structure achieves its effectiveness through condensed metaphors, stylized rhythm, and unconventional narrative framing that collectively disrupt the pattern matching heuristics on which guardrails rely. Poetry’s emphasis on defamiliarization and unique phrasing appears to scramble the model’s ability to reliably classify text according to safety protocols.

Not an isolated failure

The study adds to a growing body of research highlighting the fragility of safety systems in large language models. Security analysts have observed similar metaphor-based exploits outside laboratory settings, suggesting attackers may increasingly rely on linguistic manipulation rather than technical exploits to bypass safeguards.

Other recent work has documented additional pathways for compromising medical AI. Research published in Nature Medicine showed that replacing just 0.001 percent of training tokens with medical misinformation produced models more likely to propagate clinical errors. A separate study in npj Digital Medicine demonstrated that targeted manipulation of approximately 1 percent of model weights could inject incorrect biomedical facts while preserving performance on unrelated tasks.

What healthcare systems can do now

Because the paper remains a preprint and has not yet undergone peer review, further research will be needed to confirm the findings and explore possible defenses (several of the supporting studies cited are also preprints or recent publications awaiting broader replication). The authors deliberately withheld specific adversarial poems to avoid enabling misuse, offering only sanitized examples to illustrate their methods.


Show Your Support

Subscribe

Newsletter Logo

Subscribe to our topic-centric newsletters to get the latest insights delivered to your inbox weekly.

Enter your information below

By submitting this form, you are agreeing to DHI’s Privacy Policy and Terms of Use.