Learning Rate Transfer in Normalized Transformers

April 29, 20262604.27077

Boris Shigida, Boris Hanin, Andrey Gromov

cs.LGcs.AIstat.ML

TLDR

This paper introduces νGPT, a new Normalized Transformer parameterization that enables learning rate transfer across model size and token horizon.

Key contributions

Identifies a critical limitation in nGPT: its inability to transfer learning rates across model sizes.
Introduces νGPT, a new nGPT parameterization, by adapting μP with alignment exponents.
Empirically validates that νGPT achieves robust learning rate transfer across width, depth, and token horizon.

Why it matters

nGPT offers fast training but its lack of learning rate transfer hinders scaling. This paper introduces νGPT, which resolves this by enabling transfer across model dimensions. This makes nGPT more practical and efficient for training large language models.

Original Abstract

The Normalized Transformer, or nGPT (arXiv:2410.01131) achieves impressive training speedups and does not require weight decay or learning rate warmup. However, despite having hyperparameters that explicitly scale with model size, we observe that nGPT does not exhibit learning rate transfer across model dimension and token horizon. To rectify this, we combine numerical experiments with a principled use of alignment exponents (arXiv:2407.05872) to revisit and modify the $μ$P approach to hyperparameter transfer (arXiv:2011.14522). The result is a novel nGPT parameterization we call $ν$GPT. Through extensive empirical validation, we find $ν$GPT exhibits learning rate transfer across width, depth, and token horizon.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers