ArXiv TLDR

Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries

🐦 Tweet
2604.15717

Ki Sen Hung, Xi Yang, Chang Liu, Haoran Li, Kejiang Chen + 5 more

cs.CR

TLDR

Jargon exploits LLM safety boundaries by leveraging domain contexts, achieving >93% attack success on frontier models, and proposes a policy-guided safeguard.

Key contributions

  • Identifies that domain-specific contexts selectively relax LLM safety defenses, creating vulnerabilities.
  • Introduces Jargon, a framework combining safety-research contexts with multi-turn interactions for high attack success.
  • Achieves >93% attack success rates on 7 frontier LLMs (GPT-5.2, Claude-4.5, Gemini-3), outperforming prior methods.
  • Proposes a policy-guided safeguard and fine-tuning to mitigate Jargon attacks while preserving helpfulness.

Why it matters

This paper reveals a critical vulnerability in LLM safety, showing how domain contexts can be exploited to bypass defenses. The Jargon framework achieves high attack success, emphasizing the need for robust, context-aware alignment. The proposed mitigation offers a path for more secure LLM development.

Original Abstract

A central goal of LLM alignment is to balance helpfulness with harmlessness, yet these objectives conflict when the same knowledge serves both legitimate and malicious purposes. This tension is amplified by context-sensitive alignment: we observe that domain-specific contexts (e.g., chemistry) selectively relax defenses for domain-relevant harmful knowledge, while safety-research contexts (e.g., jailbreak studies) trigger broader relaxation spanning all harm categories. To systematically exploit this vulnerability, we propose Jargon, a framework combining safety-research contexts with multi-turn adversarial interactions that achieves attack success rates exceeding 93% across seven frontier models, including GPT-5.2, Claude-4.5, and Gemini-3, substantially outperforming existing methods. Activation space analysis reveals that Jargon queries occupy an intermediate region between benign and harmful inputs, a gray zone where refusal decisions become unreliable. To mitigate this vulnerability, we design a policy-guided safeguard that steers models toward helpful yet harmless responses, and internalize this capability through alignment fine-tuning, reducing attack success rates while preserving helpfulness.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.