Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks

May 7, 20262605.05995

Guoxin Lu, Letian Sha, Qing Wang, Peijie Sun, Hao Zhou + 2 more

cs.CRcs.AIcs.CL

TLDR

Safety Anchor introduces Safety Bottleneck Regularization (SBR) to defend LLMs against harmful fine-tuning by anchoring hidden states in the unembedding layer.

Key contributions

Identifies that existing LLM safety defenses are circumvented by HFT due to parameter space redundancy.
Proposes Safety Bottleneck Regularization (SBR), a novel defense focusing on the unembedding layer.
SBR anchors harmful query hidden states to safety-aligned model states, ensuring safe responses.
Achieves a Harmful Score below 10 with just one safety anchor, preserving benign task performance.

Why it matters

This paper addresses a critical vulnerability where existing LLM safety defenses fail under persistent harmful fine-tuning. By introducing SBR, it offers a robust and novel defense mechanism that effectively maintains model safety. This is crucial for deploying trustworthy and secure large language models.

Original Abstract

The safety alignment of Large Language Models (LLMs) remains vulnerable to Harmful Fine-tuning (HFT). While existing defenses impose constraints on parameters, gradients, or internal representations, we observe that they can be effectively circumvented under persistent HFT. Our analysis traces this failure to the inherent redundancy of the high-dimensional parameter space: attackers exploit optimization trajectories that are orthogonal to defense constraints to restore harmful capabilities while deceptively adhering to safety restrictions. To address this, we propose Safety Bottleneck Regularization (SBR). SBR shifts the defensive focus from the redundant parameter space to the unembedding layer, which serves as a geometric bottleneck. By anchoring the final hidden states of harmful queries to those of the safety-aligned model, SBR enables the model to maintain safe responses even under persistent HFT. Extensive experiments confirm SBR's effectiveness, demonstrating that utilizing just a single safety anchor is sufficient to reduce the Harmful Score to $<$10 while preserving competitive performance on benign downstream tasks.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers