Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime
Albert Alcalde, Leon Bungert, Konstantin Riedl, Tim Roith
TLDR
This paper quantifies how token distributions in mean-field transformers rapidly concentrate in the low-temperature regime, remaining metastable.
Key contributions
- Analyzes token evolution in deep encoder-only transformers via a mean-field continuity equation.
- Proves rapid token distribution concentration onto a projected initial distribution, remaining metastable.
- Quantifies concentration using Wasserstein distance, scaling with temperature and inference time.
- Identifies a terminal phase for finite temperature and large times, dominated by the value matrix spectrum.
Why it matters
This work provides a theoretical understanding of token dynamics in transformers, showing how distributions concentrate. It offers insights into the stability and behavior of large language models, which can inform future architectural designs and improve their robustness.
Original Abstract
Transformers with self-attention modules as their core components have become an integral architecture in modern large language and foundation models. In this paper, we study the evolution of tokens in deep encoder-only transformers at inference time which is described in the large-token limit by a mean-field continuity equation. Leveraging ideas from the convergence analysis of interacting multi-particle systems, with particles corresponding to tokens, we prove that the token distribution rapidly concentrates onto the push-forward of the initial distribution under a projection map induced by the key, query, and value matrices, and remains metastable for moderate times. Specifically, we show that the Wasserstein distance of the two distributions scales like $\sqrt{{\log(β+1)}/β}\exp(Ct)+\exp(-ct)$ in terms of the temperature parameter $β^{-1}\to 0$ and inference time $t\geq 0$. For the proof, we establish Lyapunov-type estimates for the zero-temperature equation, identify its limit as $t\to\infty$, and employ a stability estimate in Wasserstein space together with a quantitative Laplace principle to couple the two equations. Our result implies that for time scales of order $\logβ$ the token distribution concentrates at the identified limiting distribution. Numerical experiments confirm this and, beyond that, complement our theory by showing that for finite $β$ and large $t$ the dynamics enter a different terminal phase, dominated by the spectrum of the value matrix.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.