Momentum Further Constrains Sharpness at the Edge of Stochastic Stability
Arseniy Andreyev, Advikar Ananthkumar, Marc Walden, Tomaso Poggio, Pierfrancesco Beneventano
TLDR
SGD with momentum exhibits batch-size-dependent sharpness at the Edge of Stochastic Stability, favoring flatter regions at small batches and sharper at large.
Key contributions
- SGD with momentum operates in an Edge of Stochastic Stability (EoSS) regime, with batch-size-dependent behavior.
- Batch Sharpness stabilizes at a lower plateau for small batch sizes, favoring flatter regions.
- Batch Sharpness stabilizes at a higher plateau for large batch sizes, favoring sharper regions.
- Momentum's effect on sharpness is amplified by stochastic fluctuations at small batch sizes.
Why it matters
This paper clarifies how momentum interacts with batch size and stochastic stability in deep learning optimization. Understanding these distinct regimes is crucial for effective hyperparameter tuning and optimizing the solutions found. It provides valuable insights into the landscapes explored by common optimizers.
Original Abstract
Recent work suggests that (stochastic) gradient descent self-organizes near an instability boundary, shaping both optimization and the solutions found. Momentum and mini-batch gradients are widely used in practical deep learning optimization, but it remains unclear whether they operate in a comparable regime of instability. We demonstrate that SGD with momentum exhibits an Edge of Stochastic Stability (EoSS)-like regime with batch-size-dependent behavior that cannot be explained by a single momentum-adjusted stability threshold. Batch Sharpness (the expected directional mini-batch curvature) stabilizes in two distinct regimes: at small batch sizes it converges to a lower plateau $2(1-β)/η$, reflecting amplification of stochastic fluctuations by momentum and favoring flatter regions than vanilla SGD; at large batch sizes it converges to a higher plateau $2(1+β)/η$, where momentum recovers its classical stabilizing effect and favors sharper regions consistent with full-batch dynamics. We further show that this aligns with linear stability thresholds and discuss the implications for hyperparameter tuning and coupling.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.