Momentum Further Constrains Sharpness at the Edge of Stochastic Stability

April 15, 20262604.14108

Arseniy Andreyev, Advikar Ananthkumar, Marc Walden, Tomaso Poggio, Pierfrancesco Beneventano

cs.LGmath.DSmath.OCstat.ML

TLDR

SGD with momentum exhibits batch-size-dependent sharpness at the Edge of Stochastic Stability, favoring flatter regions at small batches and sharper at large.

Key contributions

SGD with momentum operates in an Edge of Stochastic Stability (EoSS) regime, with batch-size-dependent behavior.
Batch Sharpness stabilizes at a lower plateau for small batch sizes, favoring flatter regions.
Batch Sharpness stabilizes at a higher plateau for large batch sizes, favoring sharper regions.
Momentum's effect on sharpness is amplified by stochastic fluctuations at small batch sizes.

Why it matters

This paper clarifies how momentum interacts with batch size and stochastic stability in deep learning optimization. Understanding these distinct regimes is crucial for effective hyperparameter tuning and optimizing the solutions found. It provides valuable insights into the landscapes explored by common optimizers.

Original Abstract

Recent work suggests that (stochastic) gradient descent self-organizes near an instability boundary, shaping both optimization and the solutions found. Momentum and mini-batch gradients are widely used in practical deep learning optimization, but it remains unclear whether they operate in a comparable regime of instability. We demonstrate that SGD with momentum exhibits an Edge of Stochastic Stability (EoSS)-like regime with batch-size-dependent behavior that cannot be explained by a single momentum-adjusted stability threshold. Batch Sharpness (the expected directional mini-batch curvature) stabilizes in two distinct regimes: at small batch sizes it converges to a lower plateau $2(1-β)/η$, reflecting amplification of stochastic fluctuations by momentum and favoring flatter regions than vanilla SGD; at large batch sizes it converges to a higher plateau $2(1+β)/η$, where momentum recovers its classical stabilizing effect and favors sharper regions consistent with full-batch dynamics. We further show that this aligns with linear stability thresholds and discuss the implications for hyperparameter tuning and coupling.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers