Reproducing Complex Set-Compositional Information Retrieval

May 5, 20262605.03824

Vincent Degenhart, Dewi Timman, Arjen P. de Vries, Faegheh Hasibi, Mohanna Hoveyda

cs.CL

TLDR

This study benchmarks IR methods on complex set-compositional queries, revealing neural models struggle with true constraint satisfaction on a new controlled benchmark.

Key contributions

Introduces LIMIT+, a new controlled benchmark for set-compositional IR, reducing reliance on pretrained knowledge.
Neural retrievers double BM25 effectiveness on QUEST but collapse on LIMIT+ (Recall@100 from 0.42 to <0.02).
Lexical methods show strong performance on LIMIT+ (~0.96 Recall@100) and more stable performance with depth.
All methods degrade with increased compositional depth; dense approaches collapse significantly.

Why it matters

This paper highlights that current neural IR models may rely on "semantic shortcuts" rather than true constraint satisfaction for complex queries. It introduces a critical benchmark, LIMIT+, to rigorously test this, revealing significant limitations of dense retrievers. The findings push for developing more robust, compositionally-aware retrieval systems.

Original Abstract

Complex information needs may involve set-compositional queries using conjunction, disjunction, and exclusion, yet it remains unclear whether current retrieval paradigms genuinely satisfy such constraints or exploit `semantic shortcuts'. We conduct a reproducibility study to benchmark major retrieval families and reasoning-targeted methods on QUEST and QUEST+Variants, and introduce LIMIT+, a controlled benchmark where relevance depends on arbitrary attribute predicates and constraint satisfaction, and less on pretrained knowledge. Our findings show that (i) on QUEST, the best neural retrievers achieve an effectiveness that is more than double what can be achieved with BM25 (Recall@100 ${>}$0.41 vs.\ 0.20), but reasoning-targeted methods like ReasonIR and Search-R1 do not outperform general-purpose retrievers uniformly; (ii) on LIMIT+, gains fail to transfer, where the strongest QUEST method collapses from Recall@100${\approx}$0.42 to below 0.02, while classic lexical retrieval gains to ${\sim}$0.96. Lastly, (iii) stratifying by compositional depth reveals a consistent degradation across all methods, where algebraic sparse and lexical methods show more stable performance while dense approaches collapse. We release code and LIMIT+ data generation scripts to support future reproducibility and controlled evaluation.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers