Learning Rate Transfer in Normalized Transformers
Boris Shigida, Boris Hanin, Andrey Gromov
TLDR
This paper introduces νGPT, a new Normalized Transformer parameterization that enables learning rate transfer across model size and token horizon.
Key contributions
- Identifies a critical limitation in nGPT: its inability to transfer learning rates across model sizes.
- Introduces νGPT, a new nGPT parameterization, by adapting μP with alignment exponents.
- Empirically validates that νGPT achieves robust learning rate transfer across width, depth, and token horizon.
Why it matters
nGPT offers fast training but its lack of learning rate transfer hinders scaling. This paper introduces νGPT, which resolves this by enabling transfer across model dimensions. This makes nGPT more practical and efficient for training large language models.
Original Abstract
The Normalized Transformer, or nGPT (arXiv:2410.01131) achieves impressive training speedups and does not require weight decay or learning rate warmup. However, despite having hyperparameters that explicitly scale with model size, we observe that nGPT does not exhibit learning rate transfer across model dimension and token horizon. To rectify this, we combine numerical experiments with a principled use of alignment exponents (arXiv:2407.05872) to revisit and modify the $μ$P approach to hyperparameter transfer (arXiv:2011.14522). The result is a novel nGPT parameterization we call $ν$GPT. Through extensive empirical validation, we find $ν$GPT exhibits learning rate transfer across width, depth, and token horizon.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.