ArXiv TLDR

SimEval-IR: A Unified Toolkit and Benchmark Suite for Evaluating User Simulators and Search Sessions

🐦 Tweet
2604.27878

Saber Zerhoudi

cs.IR

TLDR

SimEval-IR is a new toolkit and benchmark for evaluating user simulators in IR, distinguishing behavioral realism from tester reliability.

Key contributions

  • Enables standardized evaluation of user simulators by distinguishing behavioral realism from tester reliability.
  • Defines a canonical session schema for both search and conversational interactions.
  • Includes three executable benchmarks for behavioral realism, tester reliability, and their interrelation.
  • Reveals common 'human-likeness' tests poorly predict system ranking validity, unlike click-depth and Fréchet distance.

Why it matters

User simulators are crucial for interactive IR, but evaluation lacks standardization. SimEval-IR provides a unified toolkit, clarifying the distinct goals of behavioral realism and tester reliability. It also reveals that common 'human-likeness' tests are ineffective, guiding future simulator development towards better metrics.

Original Abstract

User simulators are increasingly central to interactive information retrieval, yet the community lacks standardized evaluation tools. Simulators serve two objectives, behavioral realism (matching real user behavior) and tester reliability (producing valid system rankings), and these are often conflated despite being distinct and sometimes conflicting. We present SimEval-IR, an open-source toolkit and benchmark suite that makes this distinction measurable. SimEval-IR provides: (1) a canonical session schema unifying session search and conversational interactions, with validated dataset adapters and explicit loss accounting; (2) three executable benchmarks covering behavioral realism, tester reliability with RATE-style estimation, and an analysis linking the two; and (3) baseline results across four real datasets in two languages and four simulator families. Our key finding: the classifier-discriminator ''human-likeness'' check, the dominant realism test in the literature, has essentially no pooled predictive power for system-ranking validity ($r{=}{+}0.09$, $n{=}48$), while marginal click-depth distance and Fréchet distance over session embeddings give a much stronger signal ($|r|{=}0.43$ and $0.40$, $p{\leq}0.005$). SimEval-IR is released with all configurations and scripts to reproduce the reported analysis.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.