Gradient Regularized Newton Boosting Trees with Global Convergence

May 1, 20262605.00581

Nikita Zozoulenko, Daniel Falkowski, Thomas Cass, Lukas Gonon

stat.MLcs.LGmath.OC

TLDR

This paper introduces a gradient-regularized Newton boosting scheme for GBDTs, achieving global convergence with a O(1/k^2) rate.

Key contributions

Introduces Restricted Newton Descent, a framework for Newton's method on Hilbert spaces with inexact iterates.
Proves vanilla Newton boosting achieves linear convergence for smooth, strongly convex losses under specific conditions.
Proposes a gradient-regularized Newton scheme for general convex losses with Lipschitz Hessians.
Achieves a global O(1/k^2) convergence rate, matching first-order boosting with Nesterov momentum.

Why it matters

This work provides a crucial theoretical foundation for Newton boosting, a core component of modern GBDTs. By introducing a globally convergent scheme, it enhances the stability and reliability of these powerful machine learning models, addressing a long-standing gap in understanding.

Original Abstract

Gradient Boosting Decision Trees (GBDTs) dominate tabular machine learning, with modern implementations like XGBoost, LightGBM, and CatBoost being based on Newton boosting: a second-order descent step in the space of decision trees. Despite its empirical success, the global convergence of Newton boosting is poorly understood compared to first-order boosting. In this paper, we introduce Restricted Newton Descent, which studies convex optimization with Newton's method on Hilbert spaces with inexact iterates, based on the concepts of cosine angle and weak gradient edge. Within this framework, we recover Newton boosting with GBDTs and classical finite-dimensional theory as special cases. We first prove that vanilla Newton boosting achieves a linear rate of convergence for smooth, strongly convex losses that satisfy a Hessian-dominance condition. To handle general convex losses with Lipschitz Hessians, we extend a recent gradient regularized Newton scheme to the restricted weak learner setting. This scheme minimally modifies the classical algorithm by introducing an adaptive $\ell_2$-regularization term proportional to the square root of the gradient norm at each iteration. We establish a $\mathcal{O}(\frac{1}{k^2})$ rate for this scheme, thereby obtaining a globally convergent second-order GBDT algorithm with a rate matching that of first-order boosting with Nesterov momentum. In numerical experiments, we show that our scheme converges while vanilla Newton boosting may diverge.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers