ArXiv TLDR

ClawBench: Can AI Agents Complete Everyday Online Tasks?

🐦 Tweet
2604.08523

Yuxuan Zhang, Yubo Wang, Yipeng Zhu, Penghui Du, Junwen Miao + 16 more

cs.CLcs.AI

TLDR

ClawBench introduces a real-world benchmark of 153 online tasks across 144 live platforms, revealing current AI agents struggle with everyday web automation.

Key contributions

  • Introduces ClawBench, an evaluation framework with 153 real-world online tasks across 144 live platforms.
  • Tasks demand advanced capabilities like multi-step navigation, document info extraction, and complex form filling.
  • Evaluates agents on production websites, capturing dynamic complexity, unlike static sandbox benchmarks.
  • Shows frontier AI models (e.g., Claude Sonnet 4.6 at 33.3%) struggle significantly with these tasks.

Why it matters

This paper introduces a crucial benchmark for evaluating AI agents on real-world web tasks, highlighting current limitations. It pushes the field towards developing more robust and general-purpose AI assistants capable of handling the dynamic nature of everyday online interactions.

Original Abstract

AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.