SimEval-IR: A Unified Toolkit and Benchmark Suite for Evaluating User Simulators and Search Sessions

April 30, 20262604.27878

cs.IR

TLDR

SimEval-IR is a new toolkit and benchmark for evaluating user simulators in IR, distinguishing behavioral realism from tester reliability.

Key contributions

Enables standardized evaluation of user simulators by distinguishing behavioral realism from tester reliability.
Defines a canonical session schema for both search and conversational interactions.
Includes three executable benchmarks for behavioral realism, tester reliability, and their interrelation.
Reveals common 'human-likeness' tests poorly predict system ranking validity, unlike click-depth and Fréchet distance.

Why it matters

User simulators are crucial for interactive IR, but evaluation lacks standardization. SimEval-IR provides a unified toolkit, clarifying the distinct goals of behavioral realism and tester reliability. It also reveals that common 'human-likeness' tests are ineffective, guiding future simulator development towards better metrics.

Original Abstract

User simulators are increasingly central to interactive information retrieval, yet the community lacks standardized evaluation tools. Simulators serve two objectives, behavioral realism (matching real user behavior) and tester reliability (producing valid system rankings), and these are often conflated despite being distinct and sometimes conflicting. We present SimEval-IR, an open-source toolkit and benchmark suite that makes this distinction measurable. SimEval-IR provides: (1) a canonical session schema unifying session search and conversational interactions, with validated dataset adapters and explicit loss accounting; (2) three executable benchmarks covering behavioral realism, tester reliability with RATE-style estimation, and an analysis linking the two; and (3) baseline results across four real datasets in two languages and four simulator families. Our key finding: the classifier-discriminator ''human-likeness'' check, the dominant realism test in the literature, has essentially no pooled predictive power for system-ranking validity ($r{=}{+}0.09$, $n{=}48$), while marginal click-depth distance and Fréchet distance over session embeddings give a much stronger signal ($|r|{=}0.43$ and $0.40$, $p{\leq}0.005$). SimEval-IR is released with all configurations and scripts to reproduce the reported analysis.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers