WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

May 11, 20262605.10912

Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu + 12 more

cs.CL

TLDR

WildClawBench introduces a new benchmark for evaluating long-horizon, real-world agents using native runtimes and real tools.

Key contributions

Native-runtime benchmark with 60 human-authored, bilingual, multimodal tasks.
Tasks run in Docker with real tools and CLI harnesses, averaging 8 mins and 20+ tool calls.
Hybrid grading uses rule-based checks, environment-state auditing, and LLM/VLM judges.
Reveals current frontier models struggle, with the best achieving only 62.2% on Claude Opus 4.7.

Why it matters

This paper matters because it provides a much-needed real-world benchmark for evaluating AI agents, moving beyond synthetic environments. It highlights that even frontier models significantly struggle with long-horizon, native-runtime tasks, pushing for better agent development.

Original Abstract

Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls, and runs inside a reproducible Docker container hosting an actual CLI agent harness (OpenClaw, Claude Code, Codex, or Hermes Agent) with access to real tools rather than mock services. Grading is hybrid, combining deterministic rule-based checks, environment-state auditing of side effects, and an LLM/VLM judge for semantic verification. Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model by up to 18 points. These results show that long-horizon, native-runtime agent evaluation remains a far-from-resolved task for current frontier models. We release the tasks, code, and containerized tooling to support reproducible evaluation.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers