SoK: Robustness in Large Language Models against Jailbreak Attacks

May 6, 20262605.05058

Feiyue Xu, Hongsheng Hu, Chaoxiang He, Sheng Hang, Hanqing Hu + 7 more

cs.CRcs.AI

TLDR

This paper systematizes jailbreak attacks and defenses in LLMs, introducing Security Cube for multi-dimensional security evaluation.

Key contributions

Presents a systematic taxonomy of LLM jailbreak attacks and defenses.
Introduces "Security Cube," a unified, multi-dimensional framework for security evaluation.
Benchmarks 13 attacks and 5 defenses using Security Cube, revealing the current landscape.
Identifies critical findings, open problems, and future research directions for LLM robustness.

Why it matters

Jailbreak attacks pose significant risks to LLM safety and trust. This paper provides a much-needed systematic framework and evaluation methodology to understand and address these vulnerabilities. It offers critical insights and future directions for building more robust and trustworthy LLM systems.

Original Abstract

Large Language Models (LLMs) have achieved remarkable success but remain highly susceptible to jailbreak attacks, in which adversarial prompts coerce models into generating harmful, unethical, or policy-violating outputs. Such attacks pose real-world risks, eroding safety, trust, and regulatory compliance in high-stakes applications. Although a variety of attack and defense methods have been proposed, existing evaluation practices are inadequate, often relying on narrow metrics like attack success rate that fail to capture the multidimensional nature of LLM security. In this paper, we present a systematic taxonomy of jailbreak attacks and defenses and introduce Security Cube, a unified, multi-dimensional framework for comprehensive evaluation of these techniques. We provide detailed comparison tables of existing attacks and defenses, highlighting key insights and open challenges across the literature. Leveraging Security Cube, we conduct benchmark studies on 13 representative attacks and 5 defenses, establishing a clear view of the current landscape encompassing jailbreak attacks, defenses, automated judges, and LLM vulnerabilities. Based on these evaluations, we distill critical findings, identify unresolved problems, and outline promising research directions for enhancing LLM robustness against jailbreak attacks. Our analysis aims to pave the way towards more robust, interpretable, and trustworthy LLM systems. Our code is available at Code.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers