The Origin of Edge of Stability

April 22, 20262604.20446

cs.LGstat.ML

TLDR

This paper introduces "edge coupling" to explain why gradient descent consistently drives the largest Hessian eigenvalue to 2/η, solving a key mystery.

Key contributions

Introduces "edge coupling," a novel functional explaining how gradient descent forces Hessian eigenvalues to 2/η.
Derives a step recurrence and loss-change formula, both demonstrating curvature is driven towards 2/η.
Uses the mean value theorem to show exact forcing of the largest Hessian eigenvalue without approximation.
Classifies fixed points and period-two orbits by analyzing the gradients of the edge coupling.

Why it matters

The "Edge of Stability" is a critical phenomenon in neural network training, but its origin has been unclear. This paper provides a unified theoretical explanation, offering deeper insights into gradient descent dynamics and potentially guiding future optimization strategies.

Original Abstract

Full-batch gradient descent on neural networks drives the largest Hessian eigenvalue to the threshold $2/η$, where $η$ is the learning rate. This phenomenon, the Edge of Stability, has resisted a unified explanation: existing accounts establish self-regulation near the edge but do not explain why the trajectory is forced toward $2/η$ from arbitrary initialization. We introduce the edge coupling, a functional on consecutive iterate pairs whose coefficient is uniquely fixed by the gradient-descent update. Differencing its criticality condition yields a step recurrence with stability boundary $2/η$, and a second-order expansion yields a loss-change formula whose telescoping sum forces curvature toward $2/η$. The two formulas involve different Hessian averages, but the mean value theorem localizes each to the true Hessian at an interior point of the step segment, yielding exact forcing of the Hessian eigenvalue with no gap. Setting both gradients of the edge coupling to zero classifies fixed points and period-two orbits; near a fixed point, the problem reduces to a function of the half-amplitude alone, which determines which directions support period-two orbits and on which side of the critical learning rate they appear.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers