MarketBench: Evaluating AI Agents as Market Participants
Andrey Fradkin, Rohit Krishnan
TLDR
MarketBench evaluates AI agents' self-assessment and cost estimation for market participation, revealing miscalibration as a key bottleneck.
Key contributions
- MarketBench: A new benchmark to evaluate AI agents' self-assessment for market participation.
- Finds LLMs are miscalibrated on task success and token cost, leading to inefficient market outcomes.
- Shows that adding prior capability data only modestly improves calibration and market efficiency.
- Pinpoints AI agent self-assessment as a critical bottleneck for effective market coordination.
Why it matters
This paper introduces a crucial benchmark, MarketBench, to evaluate how well AI agents can assess their own abilities and costs for market participation. It highlights that current LLMs are significantly miscalibrated, hindering efficient market coordination. Understanding and improving AI agent self-assessment is vital for building robust, market-driven AI systems.
Original Abstract
Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly. In order to effectively participate in markets, agents need to have informative signals of their own ability to successfully complete a task and the cost of doing so. We propose MarketBench, a benchmark for assessing whether AI agents have these capabilities. We use a 93-task subset of SWE-bench Lite, a software engineering benchmark, with six recently released LLMs as a demonstration. These LLMs are miscalibrated on both success probability and token usage, and auctions built from these self-reports diverge from a full-information allocation. A follow-up intervention where we add information about capabilities from prior experiments to the context improves calibration, but only modestly narrows the gap to a full-information benchmark. We also document the performance of a market-based scaffolding with these LLMs. Our results point to self-assessment as a key bottleneck for market-style coordination of AI agents.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.