ArXiv TLDR

Zeroth-Order Optimization at the Edge of Stability

🐦 Tweet
2604.14669

Minhak Song, Liang Zhang, Bingcong Li, Niao He, Michael Muehlebach + 1 more

cs.LGmath.DSmath.OCstat.ML

TLDR

This paper characterizes the stability of zeroth-order optimization methods in deep learning, showing their unique dependence on the full Hessian spectrum.

Key contributions

  • Derived an explicit step size condition for the mean-square linear stability of two-point estimator ZO methods.
  • Revealed ZO stability depends on the entire Hessian spectrum, unlike first-order methods' top eigenvalue.
  • Developed tractable stability bounds for ZO methods using only the largest eigenvalue and Hessian trace.
  • Showed ZO methods operate at the edge of stability, implicitly regularizing the Hessian trace with large step sizes.

Why it matters

This paper provides crucial insights into the stability and dynamics of zeroth-order optimization, which are vital for black-box learning and memory-efficient fine-tuning. Understanding their unique regularization effects can lead to more robust and efficient training of large models.

Original Abstract

Zeroth-order (ZO) methods are widely used when gradients are unavailable or prohibitively expensive, including black-box learning and memory-efficient fine-tuning of large models, yet their optimization dynamics in deep learning remain underexplored. In this work, we provide an explicit step size condition that exactly captures the (mean-square) linear stability of a family of ZO methods based on the standard two-point estimator. Our characterization reveals a sharp contrast with first-order (FO) methods: whereas FO stability is governed solely by the largest Hessian eigenvalue, mean-square stability of ZO methods depends on the entire Hessian spectrum. Since computing the full Hessian spectrum is infeasible in practical neural network training, we further derive tractable stability bounds that depend only on the largest eigenvalue and the Hessian trace. Empirically, we find that full-batch ZO methods operate at the edge of stability: ZO-GD, ZO-GDM, and ZO-Adam consistently stabilize near the predicted stability boundary across a range of deep learning training problems. Our results highlight an implicit regularization effect specific to ZO methods, where large step sizes primarily regularize the Hessian trace, whereas in FO methods they regularize the top eigenvalue.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.