Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
Chenxin Li, Zhengyang Tang, Huangxin Lin, Yunlong Lin, Shijue Huang + 6 more
TLDR
Claw-Eval-Live is a live benchmark for LLM agents, evaluating their performance on evolving real-world workflows with verifiable execution.
Key contributions
- Uses a refreshable signal layer from public workflow demand for evolving tasks.
- Separates evolving workflow demand from reproducible, time-stamped release snapshots.
- Grades agents via execution traces, audit logs, service state, and structured LLM judging.
- Evaluates 13 frontier models on 105 tasks, revealing leading models pass only 66.7%.
Why it matters
This paper addresses the limitations of static agent benchmarks by introducing a dynamic, verifiable evaluation system. It highlights that current LLM agents struggle with real-world, evolving workflows, especially in complex business and multi-system tasks. This benchmark provides crucial insights for developing more robust and reliable workflow automation.
Original Abstract
LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow demand or verify whether a task was executed. We introduce Claw-Eval-Live, a live benchmark for workflow agents that separates a refreshable signal layer, updated across releases from public workflow-demand signals, from a reproducible, time-stamped release snapshot. Each release is constructed from public workflow-demand signals, with ClawHub Top-500 skills used in the current release, and materialized as controlled tasks with fixed fixtures, services, workspaces, and graders. For grading, Claw-Eval-Live records execution traces, audit logs, service state, and post-run workspace artifacts, using deterministic checks when evidence is sufficient and structured LLM judging only for semantic dimensions. The release contains 105 tasks spanning controlled business services and local workspace repair, and evaluates 13 frontier models under a shared public pass rule. Experiments reveal that reliable workflow automation remains far from solved: the leading model passes only 66.7% of tasks and no model reaches 70%. Failures are structured by task family and execution surface, with HR, management, and multi-system business workflows as persistent bottlenecks and local workspace repair comparatively easier but unsaturated. Leaderboard rank alone is insufficient because models with similar pass rates can diverge in overall completion, and task-level discrimination concentrates in a middle band of tasks. Claw-Eval-Live suggests that workflow-agent evaluation should be grounded twice, in fresh external demand and in verifiable agent action.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.