Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

April 10, 20262604.09544

Hadas Orgad, Boyi Wei, Kaden Zheng, Martin Wattenberg, Peter Henderson + 2 more

cs.CLcs.AIcs.LG

TLDR

LLMs generate harmful content via a compact, distinct set of weights, explaining why alignment training is brittle and emergent misalignment occurs.

Key contributions

LLMs generate harmful content using a compact set of weights, distinct from benign capabilities.
Alignment training compresses these "harm generation" weights internally.
This compression explains emergent misalignment and its broad generalization.
Pruning these specific weights in one domain significantly reduces emergent misalignment.

Why it matters

This paper reveals a coherent internal structure for harmfulness in LLMs, explaining why current safety guardrails are brittle. Understanding this distinct mechanism provides a foundation for developing more principled and robust safety approaches, moving beyond surface-level alignment.

Original Abstract

Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment'' that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. We find that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally--despite the brittleness of safety guardrails at the surface level. This compression explains emergent misalignment: if weights of harmful capabilities are compressed, fine-tuning that engages these weights in one domain can trigger broad misalignment. Consistent with this, pruning harm generation weights in a narrow domain substantially reduces emergent misalignment. Notably, LLMs harmful generation capability is dissociated from how they recognize and explain such content. Together, these results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers