Beyond Red-Teaming: Formal Guarantees of LLM Guardrail Classifiers

May 11, 20262605.10901

Nikita Kezins, Urbas Ekka, Pascal Berrang, Luca Arnaboldi

cs.LG

TLDR

This paper introduces a novel method to formally verify LLM guardrail classifiers by analyzing their pre-activation space, revealing hidden safety vulnerabilities.

Key contributions

Verifies LLM guardrail classifiers by analyzing their pre-activation space, defining harmful regions as convex shapes.
Achieves O(d) formal soundness proofs for these regions by leveraging sigmoid head monotonicity.
Uses SVD-aligned hyper-rectangles for exact safety certificates and GMMs for probabilistic ones.
Exposed verifiable safety holes in all tested classifiers and volatile safety guarantees in BERT, beyond red-teaming.

Why it matters

This work provides a crucial step towards formally guaranteeing LLM guardrail safety, moving beyond empirical testing. It reveals hidden vulnerabilities in current classifiers, highlighting the need for more robust verification. These insights are vital for developing truly safe and reliable AI systems.

Original Abstract

Guardrail Classifiers defend production language models against harmful behavior, but although results seem promising in testing, they provide no formal guarantees. Providing formal guarantees for such models is hard because "harmful behavior" has no natural specification in a discrete input space: and the standard epsilon-ball properties used in other domains do not carry semantic meaning. We close this gap by shifting verification from the discrete input space to the classifier's pre-activation space, where we define a harmful region as a convex shape enclosing the representations of known harmful prompts. Because the sigmoid classification head is monotonic, certifying the worst-case point is sufficient to certify the entire region, yielding a closed-form soundness proof without approximation in O(d) time. To formally evaluate these classifiers, we propose two constructions of such regions: SVD-aligned hyper-rectangles, which yield exact SAT/UNSAT certificates, and Gaussian Mixture Models, which yield probabilistic certificates over semantically coherent clusters. Applying this framework to three author-trained Guardrail Classifiers on the toxicity domain, every hyper-rectangle configuration returns SAT, exposing verifiable safety holes across all classifiers, despite seemingly high empirical metrics. Probabilistic GMM certificates also expose a divergent structural stability in how these models represent harm. While GPT-2 and Llama-3.1-8B maintain robust coverage of 90% and 80% across varying boundaries, BERT's safety guarantees prove uniquely volatile. This 'coverage collapse' to 55% at the optimal threshold reveals a sparsely populated safety margin in BERT, which only achieves full coverage by adopting an extremely conservative pessimistic threshold. These approaches combined, provide new insights on how effective Guardrail Classifiers really are, beyond traditional red-teaming.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers