ArXiv TLDR

Linear-Core Surrogates: Smooth Loss Functions with Linear Rates for Classification and Structured Prediction

🐦 Tweet
2604.27742

Mehryar Mohri, Yutao Zhong

cs.LGstat.ML

TLDR

Linear-Core (LC) Surrogates are new smooth loss functions offering fast optimization and linear consistency rates for classification and structured prediction.

Key contributions

  • Introduces Linear-Core (LC) Surrogates, a new family of smooth, convex loss functions.
  • Combines fast optimization of smooth losses with linear consistency of margin-based losses.
  • Enables unbiased stochastic gradient estimation, bypassing quadratic inference complexity in structured prediction.
  • Achieves 23x speedup over Structured SVMs and improved robustness to label noise.

Why it matters

This paper resolves a fundamental trade-off in loss function design, offering a novel approach that combines the best of smooth and piecewise-linear losses. It provides both statistical efficiency and significant computational advantages, particularly for structured prediction tasks. This innovation can lead to more robust and efficient machine learning models.

Original Abstract

The choice of loss function in classification involves a fundamental trade-off: smooth losses (like Cross-Entropy) enable fast optimization rates but yield slow square-root consistency bounds, while piecewise-linear losses (like Hinge) offer fast linear consistency rates but suffer from non-differentiability. We propose Linear-Core (LC) Surrogates, a new family of convex loss functions that resolve this tension by stitching a linear core to a smooth tail. We prove that these surrogates are differentiable everywhere while retaining strict linear $H$-consistency bounds, effectively combining the optimization benefits of smoothness with the statistical efficiency of margin-based losses. In the structured prediction setting, we show that this smoothness unlocks a massive computational and energy advantage: it allows for an unbiased stochastic gradient estimator that bypasses the quadratic complexity $O(|\mathscr{Y}|^2)$ of exact inference (e.g., Viterbi). Empirically, our method achieves a 23$\times$ speedup over Structured SVMs on large-vocabulary sequence tagging tasks and demonstrates superior robustness to instance-dependent label noise, outperforming Cross-Entropy by 2.6% on corrupted CIFAR-10.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.