ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming
Mario Rodríguez Béjar, Francisco J. Cortés-Delgado, S. Braghin, Jose L. Hernández-Ramos
TLDR
ContextualJailbreak uses evolutionary search and multi-turn priming to red-team LLMs, achieving high attack success rates and revealing alignment gaps.
Key contributions
- Introduces ContextualJailbreak, an evolutionary black-box red-teaming strategy for multi-turn dialogues.
- Utilizes a graded 0-5 harm score as an in-loop signal, guiding the search process effectively.
- Employs five mutation operators, including two novel ones, to generate diverse primed dialogues.
- Achieves 100% ASR on several open-source LLMs and transfers to frontier models, outperforming baselines.
Why it matters
This paper introduces an effective automated method for finding multi-turn jailbreaks, a critical vulnerability in LLMs. Its high success rates on various models, including frontier ones, highlight persistent safety alignment gaps. The findings also reveal significant differences in robustness across major LLM providers.
Original Abstract
Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety alignment and elicit harmful responses. A growing body of work shows that contextual priming, where earlier turns covertly bias later replies, constitutes a powerful attack surface, with hand-crafted multi-turn scaffolds consistently outperforming single-turn manipulations on capable models. However, automated optimization-based red-teaming has remained largely limited to the single-turn setting, iterating over static prompts and lacking the ability to reason about which forms of conversational priming induce compliance. While recent multi-turn, search-based approaches have begun to bridge this gap, the mutator design space underlying effective primed dialogues remains largely unexplored. We present ContextualJailbreak, a black-box red-teaming strategy that performs evolutionary search over a simulated multi-turn primed dialogue. The strategy leverages a graded 0-5 harm score from a two-level judge as an in-loop signal, enabling partially harmful responses to guide the search process rather than being discarded. Search is driven by five semantically defined mutation operators: roleplay, scenario, expand, troubleshooting, and mechanistic, of which the last two are novel contributions of this work. Across 50 representative HarmBench behaviors, ContextualJailbreak achieves an ASR of 100% on gpt-oss:20B, 100% on qwen3-8B, 100% on llama3.1:70B, and 90% on gpt-oss:120B, outperforming four single- and multi-turn baselines by 31-96 percentage points on average. The 40 maximally harmful attacks discovered against gpt-oss:120B transfer without adaptation to closed frontier models, achieving 90.0% on gpt-4o-mini, 70.0% on gpt-5, and 70.0% on gemini-3-flash, but only 17.5% on claude-opus-4-7 and 15.0% on claude-sonnet-4-6, revealing a pronounced provider-level asymmetry in alignment robustness.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.