ArXiv TLDR

ClawGym: A Scalable Framework for Building Effective Claw Agents

🐦 Tweet
2604.26904

Fei Bai, Huatong Song, Shuang Sun, Daixuan Cheng, Yike Yang + 8 more

cs.CLcs.AIcs.LG

TLDR

ClawGym introduces a scalable framework for developing Claw-style agents, including a synthetic dataset, trained models, and an evaluation benchmark.

Key contributions

  • Introduces ClawGym, a scalable framework for the full lifecycle of Claw-style personal agent development.
  • Creates ClawGym-SynData, a 13.5K task dataset synthesized with mock workspaces and hybrid verification.
  • Trains ClawGym-Agents using supervised fine-tuning and explores reinforcement learning via parallelized rollouts.
  • Develops ClawGym-Bench, a 200-instance evaluation benchmark calibrated via automated filtering and human-LLM review.

Why it matters

Claw-style agents, which handle multi-step workflows over local files and tools, lack systematic development frameworks. ClawGym addresses this by providing a comprehensive solution for data synthesis, agent training, and reliable evaluation. This framework significantly advances the development and deployment of capable personal agents.

Original Abstract

Claw-style environments support multi-step workflows over local files, tools, and persistent workspace states. However, scalable development around these environments remains constrained by the absence of a systematic framework, especially one for synthesizing verifiable training data and integrating it with agent training and diagnostic evaluation. To address this challenge, we present ClawGym, a scalable framework that supports the full lifecycle of Claw-style personal agent development. Concretely, we construct ClawGym-SynData, a diverse dataset of 13.5K filtered tasks synthesized from persona-driven intents and skill-grounded operations, paired with realistic mock workspaces and hybrid verification mechanisms. We then train a family of capable Claw-style models, termed ClawGym-Agents, through supervised fine-tuning on black-box rollout trajectories, and further explore reinforcement learning via a lightweight pipeline that parallelizes rollouts across per-task sandboxes.To support reliable evaluation, we further construct ClawGym-Bench, a benchmark of 200 instances calibrated through automated filtering and human-LLM review. Relevant resources will be soon released at https://github.com/ClawGym.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.