ArXiv TLDR

AgentSim: A Platform for Verifiable Agent-Trace Simulation

🐦 Tweet
2604.26653

Saber Zerhoudi, Michael Granitzer, Jelena Mitrovic

cs.IR

TLDR

AgentSim is an open-source platform that generates verifiable, stepwise reasoning traces for RAG agents, creating a new corpus for training trustworthy LLMs.

Key contributions

  • Introduces AgentSim, an open-source platform for simulating RAG agents and generating verifiable, stepwise reasoning traces.
  • Features Corpus-Aware Seeding and Active Validation to enhance trace diversity and quality.
  • Releases the Agent-Trace Corpus (ATC) with over 103,000 grounded reasoning steps across three IR benchmarks.
  • Provides a behavioral analysis revealing systematic differences in how SOTA models approach information seeking.

Why it matters

This paper addresses the critical lack of grounded reasoning data for training trustworthy agentic LLMs. AgentSim provides a novel way to generate verifiable traces, enabling deeper insights into agent behavior and improving the development of more reliable AI systems. The released corpus is a significant resource for future research.

Original Abstract

Training trustworthy agentic LLMs requires data that shows the grounded reasoning process, not just the final answer. Existing datasets fall short: question-answering data is outcome-only, chain-of-thought data is not tied to specific documents, and web-agent datasets track interface actions rather than the core retrieval and synthesis steps of a RAG workflow. We introduce AgentSim, an open-source platform for simulating RAG agents. It generates verifiable, stepwise traces of agent reasoning over any document collection. AgentSim uses a policy to ensure the agent widely explores the document set. It combines a multi-model validation pipeline with an active human-in-the-loop process. This approach focuses human effort on difficult steps where models disagree. Using AgentSim, we construct and release the Agent-Trace Corpus (ATC), a large collection of grounded reasoning trajectories spanning three established IR benchmarks. We make three contributions: (1) the AgentSim platform with two mechanisms, Corpus-Aware Seeding and Active Validation, that improve trace diversity and quality; (2) the Agent-Trace Corpus (ATC), over 103,000 verifiable reasoning steps spanning three IR benchmarks, with 100% grounding rate on substantive answers; and (3) a comparative behavioral analysis revealing systematic differences in how state-of-the-art models approach information seeking. Platform, toolkit, and corpus are publicly available.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.