ArXiv TLDR

Measuring and Mitigating the Distributional Gap Between Real and Simulated User Behaviors

🐦 Tweet
2605.07847

Shuhaib Mehri, Philippe Laban, Sumuk Shashidhar, Marwa Abdulhai, Sergey Levine + 2 more

cs.CL

TLDR

This paper introduces a method to measure the distributional gap between real and simulated user behaviors, evaluating 24 LLM-based simulators.

Key contributions

  • Developed a method to quantify the distributional gap between real and simulated user behaviors using clustering.
  • Systematically evaluated 24 LLM-based user simulators on coding and writing tasks, revealing a large distributional gap.
  • Showed that combining behaviorally complementary simulators significantly reduces the gap to real user distributions.
  • Used TF-IDF analysis to surface interpretable patterns of behaviors captured, missed, and hallucinated by simulators.

Why it matters

As AI assistants increasingly rely on user simulators, ensuring they accurately reflect diverse real user behaviors is crucial. This work provides a vital framework to measure and understand the limitations of current simulators. It identifies the distributional gap and proposes mitigation, paving the way for more robust AI assistant development.

Original Abstract

As user simulators are increasingly used for interactive training and evaluation of AI assistants, it is essential that they represent the diverse behaviors of real users. While existing works train user simulators to generate human-like responses, whether they capture the broad and heterogeneous distribution of real user behaviors remains an open question. In this work, we introduce a method to measure the distributional gap between real and simulated user behaviors, validated through a human study and ablations. Given a dataset of real and simulated conversations, our method extracts representations of user behavior from each conversation, quantizes them into discrete distributions via clustering, then computes divergence metrics. We provide the first systematic evaluation of 24 LLM-based user simulators on coding and writing tasks, and reveal a large distributional gap from real users that varies across model families, scales, and behavioral facets. Pairwise comparisons show that most simulators behave similarly, while a few stand apart. Combining behaviorally complementary simulators brings the resulting distribution closer to real users compared to either simulator on its own. Finally, a TF-IDF analysis of the clusters surfaces interpretable patterns of behaviors that simulators capture, miss, and hallucinate.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.