Generalization at the Edge of Stability

April 21, 20262604.19740

Mario Tuci, Caner Korkmaz, Umut Şimşekli, Tolga Birdal

cs.LGcs.AIcs.CVstat.ML

TLDR

This paper introduces 'sharpness dimension' and a new generalization bound to explain why neural networks generalize well when trained at the edge of stability.

Key contributions

Introduces 'sharpness dimension' to quantify generalization in chaotic neural network training.
Proves a generalization bound based on this new dimension for systems at the edge of stability.
Reveals generalization depends on the complete Hessian spectrum and its partial determinants.
Provides new theoretical insights into the recently observed phenomenon of grokking.

Why it matters

This work provides a crucial theoretical framework for understanding why neural networks generalize well when trained with large learning rates at the edge of stability. It introduces a novel dimension that better captures the complexity of this chaotic regime, moving beyond prior limited metrics. This deepens our understanding of modern deep learning optimization.

Original Abstract

Training modern neural networks often relies on large learning rates, operating at the edge of stability, where the optimization dynamics exhibit oscillatory and chaotic behavior. Empirically, this regime often yields improved generalization performance, yet the underlying mechanism remains poorly understood. In this work, we represent stochastic optimizers as random dynamical systems, which often converge to a fractal attractor set (rather than a point) with a smaller intrinsic dimension. Building on this connection and inspired by Lyapunov dimension theory, we introduce a novel notion of dimension, coined the `sharpness dimension', and prove a generalization bound based on this dimension. Our results show that generalization in the chaotic regime depends on the complete Hessian spectrum and the structure of its partial determinants, highlighting a complexity that cannot be captured by the trace or spectral norm considered in prior work. Experiments across various MLPs and transformers validate our theory while also providing new insights into the recently observed phenomenon of grokking.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers