ArXiv TLDR

Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

🐦 Tweet
2605.00416

Yi Wang, Xinchen Li, Pengwei Xie, Pu Yang, Buqing Nie + 11 more

cs.RO

TLDR

Learning While Deploying (LWD) is a fleet-scale reinforcement learning framework that continually improves generalist robot policies using real-world deployment data.

Key contributions

  • Introduces LWD, an offline-to-online RL framework for continual post-training of VLA robot policies.
  • Leverages fleet-scale deployment data, including autonomous rollouts and human interventions.
  • Stabilizes learning from sparse, heterogeneous data using DIVL for value and QAM for policy extraction.
  • Achieves 95% success on 8 real-world manipulation tasks with 16 dual-arm robots.

Why it matters

This paper addresses the challenge of deploying generalist robot policies, which struggle with real-world distribution shifts. LWD enables continuous improvement by learning from fleet-scale deployment data, including human interventions. This significantly enhances policy robustness and success rates for complex, long-horizon tasks.

Original Abstract

Generalist robot policies increasingly benefit from large-scale pretraining, but offline data alone is insufficient for robust real-world deployment. Deployed robots encounter distribution shifts, long-tail failures, task variations, and human correction opportunities that fixed demonstration datasets cannot fully capture. We present Learning While Deploying (LWD), a fleet-scale offline-to-online reinforcement learning framework for continual post-training of generalist Vision-Language-Action (VLA) policies. Starting from a pretrained VLA policy, LWD closes the loop between deployment, shared physical experience, policy improvement, and redeployment by using autonomous rollouts and human interventions collected across a robot fleet. To stabilize learning from heterogeneous, sparse-reward fleet data, LWD combines Distributional Implicit Value Learning (DIVL) for robust value estimation with Q-learning via Adjoint Matching (QAM) for policy extraction in flow-based VLA action generators. We validate LWD on a fleet of 16 dual-arm robots across eight real-world manipulation tasks, including semantic grocery restocking and 3--5 minute long-horizon tasks. A single generalist policy improves as fleet experience accumulates, reaching an average success rate of 95%, with the largest gains on long-horizon tasks.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.