ArXiv TLDR

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

🐦 Tweet
2605.04018

Yilun Zhao, Jinbiao Wei, Tingyu Song, Siyue Zhang, Chen Zhao + 1 more

cs.CLcs.IR

TLDR

This paper introduces BRIGHT-Pro, a new benchmark, and RTriever-Synth, a training corpus, to advance reasoning-intensive retrieval for agentic search systems.

Key contributions

  • Introduced BRIGHT-Pro, an expert-annotated benchmark for reasoning-intensive retrieval in agentic search.
  • Developed RTriever-Synth, a novel synthetic corpus generating complementary positives and hard negatives for training.
  • Fine-tuned RTriever-4B using RTriever-Synth, achieving substantial performance gains over its base model.
  • Showed that aspect-aware and agentic evaluations reveal retriever behaviors missed by standard metrics.

Why it matters

Reasoning-intensive retrieval is crucial for advanced agentic search systems, but current evaluation and training methods are insufficient. This work provides a robust new benchmark (BRIGHT-Pro) and a novel training corpus (RTriever-Synth) to significantly advance retriever capabilities. The resulting RTriever-4B model demonstrates improved performance, paving the way for more effective AI agents.

Original Abstract

Reasoning-intensive retrieval aims to surface evidence that supports downstream reasoning rather than merely matching topical similarity. This capability is increasingly important for agentic search systems, where retrievers must provide complementary evidence across iterative search and synthesis. However, existing work remains limited on both evaluation and training: benchmarks such as BRIGHT provide narrow gold sets and evaluate retrievers in isolation, while synthetic training corpora often optimize single-passage relevance rather than evidence portfolio construction. We introduce BRIGHT-Pro, an expert-annotated benchmark that expands each query with multi-aspect gold evidence and evaluates retrievers under both static and agentic search protocols. We further construct RTriever-Synth, an aspect-decomposed synthetic corpus that generates complementary positives and positive-conditioned hard negatives, and use it to LoRA fine-tune RTriever-4B from Qwen3-Embedding-4B. Experiments across lexical, general-purpose, and reasoning-intensive retrievers show that aspect-aware and agentic evaluation expose behaviors hidden by standard metrics, while RTriever-4B substantially improves over its base model.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.