ArXiv TLDR

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

🐦 Tweet
2604.18543

Xirui Li, Ming Li, Derry Xu, Wei-Lin Chiang, Ion Stoica + 2 more

cs.AIcs.CL

TLDR

ClawEnvKit automates diverse environment generation for claw-like agents from natural language, enabling scalable evaluation and adaptive training.

Key contributions

  • Introduces ClawEnvKit, an autonomous pipeline for generating diverse, verified environments for claw-like agents.
  • Comprises a parser, generator, and validator to create task specifications from natural language descriptions.
  • Constructs Auto-ClawEval, a large-scale benchmark (1,040 environments) matching human quality at 13,800x lower cost.
  • Enables on-demand live evaluation and adaptive training environment generation based on agent weaknesses.

Why it matters

This paper addresses the scalability issue in creating training and evaluation environments for robotic agents. By automating environment generation from natural language, it drastically reduces costs and enables large-scale benchmarking. This allows for continuous, user-driven evaluation and adaptive training, pushing the boundaries of agent development.

Original Abstract

Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured generation parameters from natural language input; (2) a generator that produces the task specification, tool interface, and scoring configuration; and (3) a validator that enforces feasibility, diversity, structural validity, and internal consistency across the generated environments. Using ClawEnvKit, we construct Auto-ClawEval, the first large-scale benchmark for claw-like agents, comprising 1,040 environments across 24 categories. Empirically, Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Evaluated across 4 model families and 8 agent harness frameworks, we find that harness engineering boosts performance by up to 15.7 percentage points over a bare ReAct baseline, completion remains the primary axis of variation with no model saturating the benchmark, and automated generation enables evaluation at a scale previously infeasible. Beyond static benchmarking, ClawEnvKit enables live evaluation: users describe a desired capability in natural language and obtain a verified environment on demand, turning evaluation into a continuous, user-driven process. The same mechanism serves as an on-demand training environment generator, producing task distributions that adapt to an agent's current weaknesses rather than being bounded by existing user logs.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.