VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
Xiaolei Lang, Yang Wang, Yukun Zhou, Chaojun Ni, Kerui Li + 8 more
TLDR
VAG is a dual-stream flow-matching framework that jointly generates aligned video and action pairs for embodied data synthesis, improving robot policy generalization.
Key contributions
- Introduces VAG, a unified dual-stream flow-matching framework for joint video and action generation.
- Synchronizes denoising and uses adaptive 3D pooling for strong cross-modal video-action alignment.
- Generates high-quality, aligned video-action pairs supporting executable trajectory replay.
- Provides effective synthetic pretraining data, improving downstream robot policy generalization.
Why it matters
Robot foundation models need vast amounts of data, which is costly to collect. VAG offers a practical solution by synthesizing high-quality, aligned video-action data. This synthetic data can significantly improve robot policy learning and generalization, accelerating the development of more capable robots.
Original Abstract
Recent advances in robot foundation models trained on large-scale human teleoperation data have enabled robots to perform increasingly complex real-world tasks. However, scaling these systems remains difficult because collecting task-specific demonstrations is expensive and labor-intensive. Synthetic data, especially generated videos, offer a promising direction, but existing World Models (WMs) are not directly suitable for policy learning since they do not provide paired action trajectories. World-Action (WA) models partially address this by predicting actions with visual outputs, yet often lack strong video-action alignment, while two-stage pipelines that generate video first and then infer actions introduce inefficiency and error accumulation. To address these limitations, we propose VAG, a unified flow-matching-based dual-stream framework that jointly generates video and action under visual and language conditioning. By synchronizing denoising in both branches and using an adaptive 3D pooling mechanism to transfer compact global video context to the action branch, VAG improves cross-modal consistency during generation. Across both simulated and real-world settings, VAG produces aligned video-action pairs with competitive prediction quality, supports executable trajectory replay, and provides useful synthetic pretraining data that improves downstream policy generalization, indicating its potential as a practical world-action model for embodied data synthesis.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.