ArXiv TLDR

Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation

🐦 Tweet
2604.20763

Andrew Klearman, Radu Revutchi, Rohin Garg, Rishav Chakravarti, Samuel Marc Denton + 1 more

cs.IRcs.AIcs.LG

TLDR

This paper introduces semantic stratification to improve retrieval evaluation for RAG, providing better coverage and identifying failure modes than traditional methods.

Key contributions

  • Formalizes retrieval evaluation as a statistical estimation problem, highlighting limitations of current methods.
  • Introduces semantic stratification, organizing documents into entity-based clusters for evaluation.
  • Guarantees formal semantic coverage across diverse retrieval regimes.
  • Provides interpretable visibility into specific retrieval failure modes.

Why it matters

Biased RAG evaluation limits accuracy. This paper introduces semantic stratification for trustworthy retrieval evaluation. It offers clear insights into performance and failure modes, enabling more reliable RAG systems.

Original Abstract

Retrieval quality is the primary bottleneck for accuracy and robustness in retrieval-augmented generation (RAG). Current evaluation relies on heuristically constructed query sets, which introduce a hidden intrinsic bias. We formalize retrieval evaluation as a statistical estimation problem, showing that metric reliability is fundamentally limited by the evaluation-set construction. We further introduce \emph{semantic stratification}, which grounds evaluation in corpus structure by organizing documents into an interpretable global space of entity-based clusters and systematically generating queries for missing strata. This yields (1) formal semantic coverage guarantees across retrieval regimes and (2) interpretable visibility into retrieval failure modes. Experiments across multiple benchmarks and retrieval methods validate our framework. The results expose systematic coverage gaps, identify structural signals that explain variance in retrieval performance, and show that stratified evaluation yields more stable and transparent assessments while supporting more trustworthy decision-making than aggregate metrics.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.