ArXiv TLDR

Virtual Dummies: Enabling Scalable FDR-Controlled Variable Selection via Sequential Sampling of Null Features

🐦 Tweet
2604.07464

Taulant Koka, Jasin Machkour, Daniel P. Palomar, Michael Muma

stat.MEstat.ML

TLDR

Virtual Dummies enables scalable FDR-controlled variable selection by sequentially sampling null feature projections, drastically reducing memory and runtime.

Key contributions

  • Eliminates terabyte memory bottleneck in high-dimensional FDR-controlled variable selection like T-Rex.
  • Introduces adaptive stick-breaking to sample null feature projections, avoiding explicit dummy matrix materialization.
  • Proves pathwise universality: selection paths with generic i.i.d. dummies converge to a Gaussian limit.
  • VD-T-Rex reduces memory and runtime by orders of magnitude while preserving FDR guarantees and power.

Why it matters

This paper solves a critical scalability problem in high-dimensional variable selection, particularly for genomics data. It enables robust FDR-controlled analysis of biobank-scale datasets that were previously intractable, allowing for more accurate and efficient discovery of significant predictors.

Original Abstract

High-dimensional variable selection, particularly in genomics, requires error-controlling procedures that scale to millions of predictors. The Terminating-Random Experiments (T-Rex) selector achieves false discovery rate (FDR) control by aggregating results of early terminated random experiments, each combining original predictors with i.i.d. synthetic null variables (dummies). At biobank scales, however, explicit dummy augmentation requires terabytes of memory. We demonstrate that this bottleneck is not fundamental. Formalizing the information flow of forward selection through a filtration, we show that compatible selectors interact with unselected dummies solely through projections onto an adaptively evolving low-dimensional subspace. For rotationally invariant dummy distributions, we derive an adaptive stick-breaking construction sampling these projections from their exact conditional distribution given the selection history, thereby eliminating dummy matrix materialization. We prove a pathwise universality theorem: under mild delocalization conditions, selection paths driven by generic standardized i.i.d. dummies converge to the same Gaussian limit. We instantiate the theory through Virtual Dummy LARS (VD-LARS), reducing memory and runtime by several orders of magnitude while preserving the exact selection law and FDR guarantees of the T-Rex selector. Experiments on realistic genome-wide association study data confirm that VD-T-Rex controls FDR and achieves power at scales where all competing methods either fail or time out.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.