ArXiv TLDR

The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity

🐦 Tweet
2605.06611

Siquan Li, Kaiqi Jiang, Jiacheng Sun, Tianyang Hu

cs.LGcs.AIstat.ML

TLDR

This paper reveals attention sink origins in LLMs: variance discrepancy, super neurons, and dimension disparity, proposing `head-wise RMSNorm` for mitigation.

Key contributions

  • Mechanistically explains attention sinks in LLMs, tracing them to value aggregation's variance discrepancy.
  • Shows super neurons in FFNs and dimension disparity of initial tokens amplify this effect.
  • Validates the causal chain by replicating sinks through controlled interventions.
  • Proposes `head-wise RMSNorm` to stabilize value aggregation and accelerate LLM pre-training convergence.

Why it matters

Understanding attention sinks is crucial for improving LLM stability and performance. This work provides deep mechanistic insight into their formation, offering a foundation for systematic control and mitigation. The proposed `head-wise RMSNorm` accelerates pre-training convergence.

Original Abstract

Despite the prevalence of the attention sink phenomenon in Large Language Models (LLMs), where initial tokens disproportionately monopolize attention scores, its structural origins remain elusive. This work provides a \textit{mechanistic explanation} for this phenomenon. First, we trace its root to the value aggregation process inherent in self-attention, which induces a systematic variance discrepancy. We further demonstrate that this discrepancy is drastically amplified by the activation of super neurons within Feed-Forward Network (FFN) layers. Specifically, the channel-sparse down-projections trigger a dimension disparity of the first-token representation, necessitating the formation of attention sinks as a structural anchor. Then, we validate this causal chain through two controlled interventions: (i) isolating the aggregation effect via attention mask modifications and (ii) amplifying the variance of targeted token representations. Both interventions can replicate attention sinks at arbitrary positions. Our mechanistic understanding offers a foundation for the systematic control of sink formation. Finally, as a proof of concept, we propose \textit{head-wise RMSNorm}, an architectural modification that stabilizes value aggregation outputs during pre-training. Our experiments demonstrate that restoring statistical parity across positions significantly accelerates convergence.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.