You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation
Marco Arazzi, Vignesh Kumar Kembu, Antonino Nocera, Stjepan Picek, Saraga Sakthidharan
TLDR
NeWTral restores safety alignment in specialized LLM adapters without losing domain knowledge, using neural weight translation in the parameter space.
Key contributions
- NeWTral restores safety alignment in LoRA adapters without losing their specialized domain knowledge.
- Maps unsafe adapters to a safe manifold directly in the parameter space using neural weight translation.
- Utilizes an adaptive Mixture of Experts (MoE) for surgical and aggressive alignment.
- Achieves 13% Attack Success Rate (from 70%) and 90% knowledge fidelity across diverse LLMs.
Why it matters
Integrating specialized LLM adapters often compromises safety or domain knowledge. NeWTral solves this by enabling instant safety restoration without costly retraining or data access. This allows practitioners to safely deploy powerful, customized LLMs.
Original Abstract
The open-source ecosystem has accelerated the democratization of Large Language Models (LLMs) through the public distribution of specialized Low-Rank Adaptation (LoRA) modules. However, integrating these third-party adapters often induces catastrophic forgetting of the base model's foundational safety alignment. Restoring these guardrails via fine-tuning on safety data introduces an opposing failure mode: the severe degradation of the specialized domain knowledge the adapter was originally designed to provide. To overcome this zero-resource challenge, we propose Neural Weight Translation (NeWTral), a framework that directly maps unsafe, domain-specific adapters onto a safe alignment manifold while rigorously preserving their core expertise. NeWTral operates as a non-linear translation module pre-trained on a diverse corpus of unsafe-to-safe adapter pairs. By executing this mapping entirely within the parameter space, NeWTral utilizes an adaptive Mixture of Experts (MoE) routing strategy to autonomously blend high-fidelity surgical translators and aggressive alignment experts. We evaluate our framework across four architectural families (Llama, Mistral, Qwen, and Gemma) at scales up to 72B parameters across eight diverse scientific and professional domains. Our results demonstrate that the MoE variant achieves a radical reduction in the average Attack Success Rate (ASR), dropping from 70% in unsafe experts to just 13%, while maintaining an exceptional 90\% average knowledge fidelity. Much like the crowdsourced adapters it remedies, the NeWTral module is designed as a standalone, downloadable asset that allows practitioners to restore safety alignment instantly without requiring access to original training data or hardware-intensive retraining.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.