Re-Triggering Safeguards within LLMs for Jailbreak Detection

May 11, 20262605.10611

Zheng Lin, Zhenxing Niu, Haoxuan Ji, Yuzhe Huang, Haichang Gao

cs.CRcs.AI

TLDR

This paper introduces an embedding disruption method to re-trigger LLM safeguards, effectively detecting and defending against jailbreak attacks.

Key contributions

Introduces an embedding disruption method to re-activate LLM's built-in safeguards for jailbreak detection.
Cooperates with LLM's internal defense mechanisms, enhancing existing safeguards rather than replacing them.
Develops an efficient search algorithm to identify optimal disruptions for effective detection.
Demonstrates robust defense against state-of-the-art white-box, black-box, and adaptive jailbreak attacks.

Why it matters

LLMs are vulnerable to jailbreak attacks despite built-in safeguards. This paper offers a novel approach by re-activating existing defenses, significantly improving LLM security. It's crucial for robust protection against evolving adversarial prompts.

Original Abstract

This paper proposes a jailbreaking prompt detection method for large language models (LLMs) to defend against jailbreak attacks. Although recent LLMs are equipped with built-in safeguards, it remains possible to craft jailbreaking prompts that bypass them. We argue that such jailbreaking prompts are inherently fragile, and thus introduce an embedding disruption method to re-activate the safeguards within LLMs. Unlike previous defense methods that aim to serve as standalone solutions, our approach instead cooperates with the LLM's internal defense mechanisms by re-triggering them. Moreover, through extensive analysis, we gain a comprehensive understanding of the disruption effects and develop an efficient search algorithm to identify appropriate disruptions for effective jailbreak detection. Extensive experiments demonstrate that our approach effectively defends against state-of-the-art jailbreak attacks in white-box and black-box settings, and remains robust even against adaptive attacks.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers