ArXiv TLDR

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

🐦 Tweet
2604.09544

Hadas Orgad, Boyi Wei, Kaden Zheng, Martin Wattenberg, Peter Henderson + 2 more

cs.CLcs.AIcs.LG

TLDR

LLMs generate harmful content via a compact, distinct set of weights, explaining why alignment training is brittle and emergent misalignment occurs.

Key contributions

  • LLMs generate harmful content using a compact set of weights, distinct from benign capabilities.
  • Alignment training compresses these "harm generation" weights internally.
  • This compression explains emergent misalignment and its broad generalization.
  • Pruning these specific weights in one domain significantly reduces emergent misalignment.

Why it matters

This paper reveals a coherent internal structure for harmfulness in LLMs, explaining why current safety guardrails are brittle. Understanding this distinct mechanism provides a foundation for developing more principled and robust safety approaches, moving beyond surface-level alignment.

Original Abstract

Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment'' that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. We find that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally--despite the brittleness of safety guardrails at the surface level. This compression explains emergent misalignment: if weights of harmful capabilities are compressed, fine-tuning that engages these weights in one domain can trigger broad misalignment. Consistent with this, pruning harm generation weights in a narrow domain substantially reduces emergent misalignment. Notably, LLMs harmful generation capability is dissociated from how they recognize and explain such content. Together, these results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.