ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

May 4, 20262605.02647

Mario Rodríguez Béjar, Francisco J. Cortés-Delgado, S. Braghin, Jose L. Hernández-Ramos

cs.CLcs.CR

TLDR

ContextualJailbreak uses evolutionary search and multi-turn priming to red-team LLMs, achieving high attack success rates and revealing alignment gaps.

Key contributions

Introduces ContextualJailbreak, an evolutionary black-box red-teaming strategy for multi-turn dialogues.
Utilizes a graded 0-5 harm score as an in-loop signal, guiding the search process effectively.
Employs five mutation operators, including two novel ones, to generate diverse primed dialogues.
Achieves 100% ASR on several open-source LLMs and transfers to frontier models, outperforming baselines.

Why it matters

This paper introduces an effective automated method for finding multi-turn jailbreaks, a critical vulnerability in LLMs. Its high success rates on various models, including frontier ones, highlight persistent safety alignment gaps. The findings also reveal significant differences in robustness across major LLM providers.

Original Abstract

Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety alignment and elicit harmful responses. A growing body of work shows that contextual priming, where earlier turns covertly bias later replies, constitutes a powerful attack surface, with hand-crafted multi-turn scaffolds consistently outperforming single-turn manipulations on capable models. However, automated optimization-based red-teaming has remained largely limited to the single-turn setting, iterating over static prompts and lacking the ability to reason about which forms of conversational priming induce compliance. While recent multi-turn, search-based approaches have begun to bridge this gap, the mutator design space underlying effective primed dialogues remains largely unexplored. We present ContextualJailbreak, a black-box red-teaming strategy that performs evolutionary search over a simulated multi-turn primed dialogue. The strategy leverages a graded 0-5 harm score from a two-level judge as an in-loop signal, enabling partially harmful responses to guide the search process rather than being discarded. Search is driven by five semantically defined mutation operators: roleplay, scenario, expand, troubleshooting, and mechanistic, of which the last two are novel contributions of this work. Across 50 representative HarmBench behaviors, ContextualJailbreak achieves an ASR of 100% on gpt-oss:20B, 100% on qwen3-8B, 100% on llama3.1:70B, and 90% on gpt-oss:120B, outperforming four single- and multi-turn baselines by 31-96 percentage points on average. The 40 maximally harmful attacks discovered against gpt-oss:120B transfer without adaptation to closed frontier models, achieving 90.0% on gpt-4o-mini, 70.0% on gpt-5, and 70.0% on gemini-3-flash, but only 17.5% on claude-opus-4-7 and 15.0% on claude-sonnet-4-6, revealing a pronounced provider-level asymmetry in alignment robustness.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers