RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models
Hao Wu, Yuqi Li, Yuan Gao, Fan Xu, Fan Zhang + 8 more
TLDR
RoboAlign-R1 improves robot video world models by using reward-aligned post-training and stabilized long-horizon inference, boosting task consistency and realism.
Key contributions
- Introduces RoboAlign-R1, a framework for reward-aligned post-training and stabilized long-horizon inference.
- Developed RobotWorldBench, a 10k video-instruction benchmark, and RoboAlign-Judge for 6D video evaluation.
- Distills RoboAlign-Judge into a lightweight reward model for efficient RL-based post-training.
- Proposes Sliding Window Re-encoding (SWR) for stable long-horizon prediction with minimal latency.
Why it matters
This paper improves robot video world models by aligning training objectives with real-world robot capabilities. It significantly enhances task consistency, physical realism, and long-horizon prediction quality through reward-aligned post-training and a stabilized inference strategy. This is crucial for more reliable autonomous robots.
Original Abstract
Existing robot video world models are typically trained with low-level objectives such as reconstruction and perceptual similarity, which are poorly aligned with the capabilities that matter most for robot decision making, including instruction following, manipulation success, and physical plausibility. They also suffer from error accumulation in long-horizon autoregressive prediction. We present RoboAlign-R1, a framework that combines reward-aligned post-training with stabilized long-horizon inference for robot video world models. We construct RobotWorldBench, a benchmark of 10,000 annotated video-instruction pairs collected from four robot data sources, and train a multimodal teacher judge, RoboAlign-Judge, to provide fine-grained six-dimensional evaluation of generated videos. We then distill the teacher into a lightweight student reward model for efficient reinforcement-learning-based post-training. To reduce long-horizon rollout drift, we further introduce Sliding Window Re-encoding (SWR), a training-free inference strategy that periodically refreshes the generation context. Under our in-domain evaluation protocol, RoboAlign-R1 improves the aggregate six-dimension score by 10.1% over the strongest baseline, including gains of 7.5% on Manipulation Accuracy and 4.6% on Instruction Following; these ranking improvements are further supported by an external VLM-based cross-check and a blinded human study. Meanwhile, SWR improves long-horizon prediction quality with only about 1% additional latency, yielding a 2.8% gain in SSIM and a 9.8% reduction in LPIPS. Together, these results show that reward-aligned post-training and stabilized long-horizon decoding improve task consistency, physical realism, and long-horizon prediction quality in robot video world models.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.