ArXiv TLDR

Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing

🐦 Tweet
2605.10582

Zheng Lin, Zhenxing Niu, Haoxuan Ji, Haichang Gao

cs.CRcs.AI

TLDR

DR-Smoothing offers a guaranteed defense against LLM jailbreaking attacks by disrupting and rectifying prompts, balancing safety and helpfulness.

Key contributions

  • Proposes Disrupt-and-Rectify Smoothing (DR-Smoothing) for guaranteed LLM jailbreaking defense.
  • Utilizes a two-stage prompt processing: disrupt input, then rectify to restore in-distribution form.
  • Improves upon disrupt-only methods by reducing unpredictable LLM behavior and balancing harmlessness/helpfulness.
  • Provides theoretical analysis with tight bounds for defense success probability and disruption strength.

Why it matters

This paper introduces a robust, theoretically-backed defense against LLM jailbreaking. By rectifying disrupted prompts, it ensures LLMs remain both safe and useful, addressing a critical challenge in AI security. This method significantly advances the state-of-the-art in safeguarding large language models.

Original Abstract

This paper proposes a guaranteed defense method for large language models (LLMs) to safeguard against jailbreaking attacks. Drawing inspiration from the denoised-smoothing approach in the adversarial defense domain, we propose a novel smoothing-based defense method, termed Disrupt-and-Rectify Smoothing (DR-Smoothing). Specifically, we integrate a two-stage prompt processing scheme-first disrupting the input prompt, then rectifying it-into the conventional smoothing defense framework. This disrupt-and-rectify approach improves upon previous disrupt-only approaches by restoring out-of-distribution disrupted prompts to an in-distribution form, thereby reducing the risk of unpredictable LLM behavior. In addition, this two-stage scheme offers a distinct advantage in striking a balance between harmlessness and helpfulness in jailbreaking defense. Notably, we present a theoretical analysis for generic smoothing framework, offering a tight bound for the defense success probability and the requirements on the disruption strength. Our approach can defend against both token-level and prompt-level jailbreaking attacks, under both established and adaptive attacking scenarios. Extensive experiments demonstrate that our approach surpasses current state-of-the-art defense methods in terms of both harmlessness and helpfulness.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.