ArXiv TLDR

Can QPP Choose the Right Query Variant? Evaluating Query Variant Selection for RAG Pipelines

🐦 Tweet
2604.22661

Negar Arabzadeh, Andrew Drozdov, Michael Bendersky, Matei Zaharia

cs.IRcs.CL

TLDR

This paper evaluates Query Performance Prediction (QPP) for selecting optimal query variants in RAG pipelines, revealing a "utility gap" between retrieval and generation.

Key contributions

  • Investigates QPP for intra-topic query variant selection in RAG pipelines.
  • Identifies a "utility gap" where retrieval-optimal variants don't always yield best generations.
  • Shows QPP can reliably improve end-to-end RAG quality over original queries.
  • Lightweight pre-retrieval QPP often matches or outperforms expensive post-retrieval methods.

Why it matters

Query reformulation in RAG is costly. This paper uses QPP to efficiently select optimal query variants, improving end-to-end RAG quality. It reveals a "utility gap" where retrieval-optimized variants don't always yield the best generations, offering practical, low-latency methods to address this.

Original Abstract

Large Language Models (LLMs) have made query reformulation ubiquitous in modern retrieval and Retrieval-Augmented Generation (RAG) pipelines, enabling the generation of multiple semantically equivalent query variants. However, executing the full pipeline for every reformulation is computationally expensive, motivating selective execution: can we identify the best query variant before incurring downstream retrieval and generation costs? We investigate Query Performance Prediction (QPP) as a mechanism for variant selection across ad-hoc retrieval and end-to-end RAG. Unlike traditional QPP, which estimates query difficulty across topics, we study intra-topic discrimination - selecting the optimal reformulation among competing variants of the same information need. Through large-scale experiments on TREC-RAG using both sparse and dense retrievers, we evaluate pre- and post-retrieval predictors under correlation- and decision-based metrics. Our results reveal a systematic divergence between retrieval and generation objectives: variants that maximize ranking metrics such as nDCG often fail to produce the best generated answers, exposing a "utility gap" between retrieval relevance and generation fidelity. Nevertheless, QPP can reliably identify variants that improve end-to-end quality over the original query. Notably, lightweight pre-retrieval predictors frequently match or outperform more expensive post-retrieval methods, offering a latency-efficient approach to robust RAG.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.