Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?
Tz-Huan Hsu, Jheng-Hong Yang, Jimmy Lin
TLDR
Pi-Serini demonstrates that well-tuned lexical retrieval with capable LLMs can effectively support deep agentic search, outperforming dense retrievers.
Key contributions
- Introduces Pi-Serini, a search agent combining BM25 with frontier LLMs for deep research.
- Shows lexical retrieval (BM25) is sufficient for effective agentic search when paired with capable LLMs.
- Pi-Serini achieves 83.1% accuracy and 94.7% recall on BrowseComp-Plus, surpassing dense retriever agents.
- BM25 tuning and increased retrieval depth significantly improve agentic search performance.
Why it matters
This paper challenges the common belief that dense retrievers are essential for advanced agentic search. It provides strong evidence that well-configured lexical methods, when paired with powerful LLMs, can be highly effective for complex research tasks. This insight could simplify agentic system design and reduce computational overhead.
Original Abstract
Does a lexical retriever suffice as large language models (LLMs) become more capable in an agentic loop? This question naturally arises when building deep research systems. We revisit it by pairing BM25 with frontier LLMs that have better reasoning and tool-use abilities. To support researchers asking the same question, we introduce Pi-Serini, a search agent equipped with three tools for retrieving, browsing, and reading documents. Our results show that, on BrowseComp-Plus, a well-configured lexical retriever with sufficient retrieval depth can support effective deep research when paired with more capable LLMs. Specifically, Pi-Serini with gpt-5.5 achieves 83.1% answer accuracy and 94.7% surfaced evidence recall, outperforming released search agents that use dense retrievers. Controlled ablations further show that BM25 tuning improves answer accuracy by 18.0% and surfaced evidence recall by 11.1% over the default BM25 setting, while increasing retrieval depth further improves surfaced evidence recall by 25.3% over the shallow-retrieval setting. Source code is available at https://github.com/justram/pi-serini.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.