ArXiv TLDR

When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

🐦 Tweet
2605.05172

Lakshita Dodeja, Ondrej Biza, Shivam Vats, Stephen Hart, Stefanie Tellex + 3 more

cs.ROcs.AI

TLDR

Q2RL extracts Q-values from Behavior Cloning to enable efficient on-robot reinforcement learning, improving policies online without losing learned skills.

Key contributions

  • Q2RL enables efficient offline-to-online learning by extracting Q-functions from BC policies.
  • Uses Q-Gating to switch between BC and RL actions, preventing policy degradation.
  • Achieves state-of-the-art performance on D4RL and robomimic benchmarks.
  • Demonstrates robust on-robot learning for complex tasks in 1-2 hours, up to 100% success.

Why it matters

Behavior Cloning (BC) is effective but lacks online improvement. Existing methods struggle with distribution mismatch, replacing good actions. Q2RL addresses this by seamlessly integrating BC with online RL, enabling rapid and robust policy refinement directly on robots.

Original Abstract

Behavior Cloning (BC) has emerged as a highly effective paradigm for robot learning. However, BC lacks a self-guided mechanism for online improvement after demonstrations have been collected. Existing offline-to-online learning methods often cause policies to replace previously learned good actions due to a distribution mismatch between offline data and online learning. In this work, we propose Q2RL, Q-Estimation and Q-Gating from BC for Reinforcement Learning, an algorithm for efficient offline-to-online learning. Our method consists of two parts: (1) Q-Estimation extracts a Q-function from a BC policy using a few interaction steps with the environment, followed by online RL with (2) Q-Gating, which switches between BC and RL policy actions based on their respective Q-values to collect samples for RL policy training. Across manipulation tasks from D4RL and robomimic benchmarks, Q2RL outperforms SOTA offline-to-online learning baselines on success rate and time to convergence. Q2RL is efficient enough to be applied in an on-robot RL setting, learning robust policies for contact-rich and high precision manipulation tasks such as pipe assembly and kitting, in 1-2 hours of online interaction, achieving success rates of up to 100% and up to 3.75x improvement against the original BC policy. Code and video are available at https://pages.rai-inst.com/q2rl_website/

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.