ArXiv TLDR

Conformalized Super Learner

🐦 Tweet
2604.22391

Zhanli Wu, Fabrizio Leisen, Miguel-Angel Luque-Fernandez, F. Javier Rubio

stat.MLcs.LGstat.COstat.ME

TLDR

The Conformalized Super Learner combines ensemble methods with conformal prediction to provide prediction intervals with finite-sample coverage guarantees.

Key contributions

  • Introduces Conformalized Super Learner (CSL) by integrating Conformal Prediction with Super Learner.
  • Constructs prediction intervals using weighted majority vote of learner-specific conformity scores.
  • Achieves valid finite-sample coverage and competitive performance across diverse data conditions.
  • Applied to creatinine level prediction, demonstrating benefits for complex regression functions.

Why it matters

Existing Super Learner interval methods often rely on asymptotic arguments or computationally intensive procedures. This paper offers a robust, finite-sample guaranteed approach for uncertainty quantification in ensemble predictions, crucial for reliable interval estimates in fields like medicine.

Original Abstract

The Super Learner (SL) is a widely used ensemble method that combines predictions from a library of learners based on their predictive performance. Interval predictions are of considerable practical interest because they allow uncertainty in predictions produced by an individual learner or an ensemble to be quantified. Several methods have been proposed for constructing interval predictions based on the SL, however, these approaches are typically justified using asymptotic arguments or rely on computationally intensive procedures such as the bootstrap. Conformal prediction (CP) is a machine learning framework for constructing prediction intervals with finite-sample and asymptotic coverage guarantees under mild conditions. We propose coupling CP with the SL through a natural construction that mirrors the original SL framework, using individual learner weights and combining learner-specific conformity scores via a weighted majority vote. We characterize the properties of the resulting SL-based prediction intervals for continuous outcomes. We cover settings under exchangeability, potential violations of exchangeability, and data-generating mechanisms exhibiting heteroscedasticity, sparsity, and other forms of distributional heterogeneity. A comprehensive simulation study shows that the conformalized SL achieves valid finite-sample coverage with competitive performance relative to the true data-generating mechanism. A central contribution of this work is an application to predicting creatinine levels using socio-demographic, biometric, and laboratory measurements. This example demonstrates the benefits of an ensemble with carefully selected learners designed to capture key aspects of complex regression functions, including non-linear effects, interactions, sparsity, heteroscedasticity, and robustness to outliers.R

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.