Generalization at the Edge of Stability
Mario Tuci, Caner Korkmaz, Umut Şimşekli, Tolga Birdal
TLDR
This paper introduces 'sharpness dimension' and a new generalization bound to explain why neural networks generalize well when trained at the edge of stability.
Key contributions
- Introduces 'sharpness dimension' to quantify generalization in chaotic neural network training.
- Proves a generalization bound based on this new dimension for systems at the edge of stability.
- Reveals generalization depends on the complete Hessian spectrum and its partial determinants.
- Provides new theoretical insights into the recently observed phenomenon of grokking.
Why it matters
This work provides a crucial theoretical framework for understanding why neural networks generalize well when trained with large learning rates at the edge of stability. It introduces a novel dimension that better captures the complexity of this chaotic regime, moving beyond prior limited metrics. This deepens our understanding of modern deep learning optimization.
Original Abstract
Training modern neural networks often relies on large learning rates, operating at the edge of stability, where the optimization dynamics exhibit oscillatory and chaotic behavior. Empirically, this regime often yields improved generalization performance, yet the underlying mechanism remains poorly understood. In this work, we represent stochastic optimizers as random dynamical systems, which often converge to a fractal attractor set (rather than a point) with a smaller intrinsic dimension. Building on this connection and inspired by Lyapunov dimension theory, we introduce a novel notion of dimension, coined the `sharpness dimension', and prove a generalization bound based on this dimension. Our results show that generalization in the chaotic regime depends on the complete Hessian spectrum and the structure of its partial determinants, highlighting a complexity that cannot be captured by the trace or spectral norm considered in prior work. Experiments across various MLPs and transformers validate our theory while also providing new insights into the recently observed phenomenon of grokking.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.