Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation

May 4, 20262605.02757

Chenyu Hui, Xiaodi Huang, Siyu Xu, Yunke Wang, Shan You + 3 more

cs.CVcs.RO

TLDR

This paper presents an efficient framework to convert simulated VLA videos into realistic training data, bridging the sim-to-real gap for better robot generalization.

Key contributions

Introduces an efficient video augmentation framework to convert simulated VLA videos into realistic ones.
Extracts structured conditions and rewrites captions to diversify environments for realistic video synthesis.
Proposes diffusion feature-reuse and coreset sampling for scalable, accelerated video generation.
Achieves significant performance gains on VLA benchmarks (e.g., Robotwin 2.0, LIBERO-Plus) and real robots.

Why it matters

This paper addresses the critical sim-to-real gap for VLA models by efficiently converting simulated videos into realistic training data. This significantly improves real-world robot generalization, making large-scale data augmentation practical and accelerating robot learning.

Original Abstract

Vision-language-action (VLA) models typically rely on large-scale real-world videos, whereas simulated data, despite being inexpensive and highly parallelizable to collect, often suffers from a substantial visual domain gap and limited environmental diversity, resulting in weak real-world generalization. We present an efficient video augmentation framework that converts simulated VLA videos into realistic training videos while preserving task semantics and action trajectories. Our pipeline extracts structured conditions from simulation via video semantic segmentation and video captioning, rewrites captions to diversify environments, and uses a conditional video transfer model to synthesize realistic videos. To make augmentation practical at scale, we introduce a diffusion feature-reuse mechanism that reuses video tokens across adjacent timesteps to accelerate generation, and a coreset sampling strategy that identifies a compact, non-redundant subset for augmentation under limited computation. Extensive experiments on Robotwin 2.0, LIBERO, LIBERO-Plus, and a real robotic platform demonstrate consistent improvements. For example, our method improves RDT-1B by 8% on Robotwin 2.0, and boosts $π_0$ by 5.1% on the more challenging LIBERO-Plus benchmark. Code is available at: https://github.com/nanfangxiansheng/Seeing-Realism-from-Simulation.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers