ArXiv TLDR

Price of Quality: Sufficient Conditions for Sparse Recovery using Mixed-Quality Data

🐦 Tweet
2605.10713

Youssef Chaabouni, David Gamarnik

stat.MLcs.ITcs.LGmath.ST

TLDR

This paper provides the first conditions for sparse recovery using mixed-quality data, revealing differences between information-theoretic and algorithmic thresholds.

Key contributions

  • Establishes sample-size conditions for sparse recovery with mixed-quality data, defining the 'Price of Quality'.
  • Information-theoretic 'Price of Quality' is bounded (agnostic) but can be arbitrary (informed decoder).
  • Analyzes LASSO in agnostic settings, showing its recovery threshold matches homogeneous-noise cases, robust to heterogeneity.
  • Highlights a fundamental difference in how information-theoretic vs. algorithmic thresholds adapt to data quality.

Why it matters

This research provides the first conditions for sparse recovery with mixed-quality data, a practical challenge. It highlights a key difference: while theoretical limits show a 'Price of Quality,' algorithms like LASSO are surprisingly robust to data heterogeneity. This informs better data acquisition and processing strategies.

Original Abstract

We study sparse recovery when observations come from mixed-quality sources: a small collection of high-quality measurements with small noise variance and a larger collection of lower-quality measurements with higher variance. For this heterogeneous-noise setting, we establish sample-size conditions for information-theoretic and algorithmic recovery. On the information-theoretic side, we show that it is sufficient for $(n_1, n_2)$ to satisfy a linear trade-off defining the Price of Quality: the number of low-quality samples needed to replace one high-quality sample. In the agnostic setting, where the decoder is completely agnostic to the quality of the data, it is uniformly bounded, and in particular one high-quality sample is never worth more than two low-quality samples for this sufficient condition to hold. In the informed setting, where the decoder is informed of per-sample variances, the price of quality can grow arbitrarily large. On the algorithmic side, we analyze the LASSO in the agnostic setting and show that the recovery threshold matches the homogeneous-noise case and only depends on the average noise level, revealing a striking robustness of computational recovery to data heterogeneity. Together, these results give the first conditions for sparse recovery with mixed-quality data and expose a fundamental difference between how the information-theoretic and algorithmic thresholds adapt to changes in data quality.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.