Jump-Start Reinforcement Learning with Vision-Language-Action Regularization
Angelo Moroncelli, Roberto Zanetti, Marco Maccarini, Loris Roveda
TLDR
VLAJS improves RL for robotic manipulation by integrating sparse, annealed VLA guidance to boost exploration and sample efficiency, surpassing baselines.
Key contributions
- Introduces VLAJS, bridging sparse VLA guidance with on-policy RL for efficient robotic manipulation.
- Augments PPO with directional action-consistency regularization, softly aligning RL with VLA actions.
- VLA guidance is transient, sparse, and annealed, allowing the RL agent to adapt and surpass the VLA.
- Achieves over 50% sample efficiency gains and robust sim-to-real transfer on manipulation tasks.
Why it matters
This paper addresses key limitations in RL for complex robotic tasks by leveraging multimodal VLA models without their inherent precision issues. By jump-starting exploration, VLAJS significantly reduces the data needed for training. This makes advanced robotic manipulation more practical and accessible for real-world applications.
Original Abstract
Reinforcement learning (RL) enables high-frequency, closed-loop control for robotic manipulation, but scaling to long-horizon tasks with sparse or imperfect rewards remains difficult due to inefficient exploration and poor credit assignment. Vision-Language-Action (VLA) models leverage large-scale multimodal pretraining to provide generalist, task-level reasoning, but current limitations hinder their direct use in fast and precise manipulation. In this paper, we propose Vision-Language-Action Jump-Starting (VLAJS), a method that bridges sparse VLA guidance with on-policy RL to improve exploration and learning efficiency. VLAJS treats VLAs as transient sources of high-level action suggestions that bias early exploration and improve credit assignment, while preserving the high-frequency, state-based control of RL. Our approach augments Proximal Policy Optimization (PPO) with a directional action-consistency regularization that softly aligns the RL agent's actions with VLA guidance during early training, without enforcing strict imitation, requiring demonstrations, or relying on continuous teacher queries. VLA guidance is applied sparsely and annealed over time, allowing the agent to adapt online and ultimately surpass the guiding policy. We evaluate VLAJS on six challenging manipulation tasks: lifting, pick-and-place, peg reorientation, peg insertion, poking, and pushing in simulation, and validate a subset on a real Franka Panda robot. VLAJS consistently outperforms PPO and distillation-style baselines in sample efficiency, reducing required environment interactions by over 50% in several tasks. Real-world experiments demonstrate zero-shot sim-to-real transfer and robust execution under clutter, object variation, and external perturbations.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.