ArXiv TLDR

Homogenized Transformers

🐦 Tweet
2604.01978

Hugo Koubbi, Borjan Geshkovski, Philippe Rigollet

math.PRcs.LGstat.ML

TLDR

This paper introduces a random model for deep multi-head self-attention, proving a homogenized limit that helps analyze representation collapse and mitigate clustering.

Key contributions

  • Models deep multi-head self-attention as an interacting particle system on the unit sphere.
  • Proves a nontrivial homogenized limit for the attention dynamics under specific depth and head scalings.
  • Derives a stochastic nonlinear Fokker-Planck equation for the token's conditional law in the mean-field regime.
  • Quantifies representation collapse in the Gaussian setting, offering ways to mitigate clustering.

Why it matters

This work provides a theoretical framework to understand the dynamics of deep transformers at initialization. By identifying homogenized limits, it offers insights into representation collapse, a critical issue in large models. The findings suggest practical strategies to mitigate clustering and improve model stability.

Original Abstract

We study a random model of deep multi-head self-attention in which the weights are resampled independently across layers and heads, as at initialization of training. Viewing depth as a time variable, the residual stream defines a discrete-time interacting particle system on the unit sphere. We prove that, under suitable joint scalings of the depth, the residual step size, and the number of heads, this dynamics admits a nontrivial homogenized limit. Depending on the scaling, the limit is either deterministic or stochastic with common noise; in the mean-field regime, the latter leads to a stochastic nonlinear Fokker--Planck equation for the conditional law of a representative token. In the Gaussian setting, the limiting drift vanishes, making the homogenized dynamics explicit enough to study representation collapse. This yields quantitative trade-offs between dimension, context length, and temperature, and identifies regimes in which clustering can be mitigated.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.