Ordinary Least Squares is a Special Case of Transformer

April 15, 20262604.13656

cs.LGcs.AImath.STstat.ML

TLDR

This paper proves Ordinary Least Squares (OLS) is a special case of the single-layer Linear Transformer, showing attention can solve OLS in one pass.

Key contributions

Algebraic proof shows OLS is a special case of the single-layer Linear Transformer.
Attention's forward pass is mathematically equivalent to OLS closed-form projection.
Transformers can solve OLS in one forward pass, unlike iterative methods.
Reveals a decoupled slow and fast memory mechanism within Transformers.

Why it matters

This work clarifies the statistical essence of Transformers, showing they can implement classical algorithms like OLS in a single pass. It uncovers a novel memory mechanism and establishes a clear continuity between modern deep architectures and classical statistical inference. This deepens our understanding of Transformer capabilities.

Original Abstract

The statistical essence of the Transformer architecture has long remained elusive: Is it a universal approximator, or a neural network version of known computational algorithms? Through rigorous algebraic proof, we show that the latter better describes Transformer's basic nature: Ordinary Least Squares (OLS) is a special case of the single-layer Linear Transformer. Using the spectral decomposition of the empirical covariance matrix, we construct a specific parameter setting where the attention mechanism's forward pass becomes mathematically equivalent to the OLS closed-form projection. This means attention can solve the problem in one forward pass, not by iterating. Building upon this prototypical case, we further uncover a decoupled slow and fast memory mechanism within Transformers. Finally, the evolution from our established linear prototype to standard Transformers is discussed. This progression facilitates the transition of the Hopfield energy function from linear to exponential memory capacity, thereby establishing a clear continuity between modern deep architectures and classical statistical inference.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers