RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models

May 5, 20262605.03821

Hao Wu, Yuqi Li, Yuan Gao, Fan Xu, Fan Zhang + 8 more

cs.ROcs.AI

TLDR

RoboAlign-R1 improves robot video world models by using reward-aligned post-training and stabilized long-horizon inference, boosting task consistency and realism.

Key contributions

Introduces RoboAlign-R1, a framework for reward-aligned post-training and stabilized long-horizon inference.
Developed RobotWorldBench, a 10k video-instruction benchmark, and RoboAlign-Judge for 6D video evaluation.
Distills RoboAlign-Judge into a lightweight reward model for efficient RL-based post-training.
Proposes Sliding Window Re-encoding (SWR) for stable long-horizon prediction with minimal latency.

Why it matters

This paper improves robot video world models by aligning training objectives with real-world robot capabilities. It significantly enhances task consistency, physical realism, and long-horizon prediction quality through reward-aligned post-training and a stabilized inference strategy. This is crucial for more reliable autonomous robots.

Original Abstract

Existing robot video world models are typically trained with low-level objectives such as reconstruction and perceptual similarity, which are poorly aligned with the capabilities that matter most for robot decision making, including instruction following, manipulation success, and physical plausibility. They also suffer from error accumulation in long-horizon autoregressive prediction. We present RoboAlign-R1, a framework that combines reward-aligned post-training with stabilized long-horizon inference for robot video world models. We construct RobotWorldBench, a benchmark of 10,000 annotated video-instruction pairs collected from four robot data sources, and train a multimodal teacher judge, RoboAlign-Judge, to provide fine-grained six-dimensional evaluation of generated videos. We then distill the teacher into a lightweight student reward model for efficient reinforcement-learning-based post-training. To reduce long-horizon rollout drift, we further introduce Sliding Window Re-encoding (SWR), a training-free inference strategy that periodically refreshes the generation context. Under our in-domain evaluation protocol, RoboAlign-R1 improves the aggregate six-dimension score by 10.1% over the strongest baseline, including gains of 7.5% on Manipulation Accuracy and 4.6% on Instruction Following; these ranking improvements are further supported by an external VLM-based cross-check and a blinded human study. Meanwhile, SWR improves long-horizon prediction quality with only about 1% additional latency, yielding a 2.8% gain in SSIM and a 9.8% reduction in LPIPS. Together, these results show that reward-aligned post-training and stabilized long-horizon decoding improve task consistency, physical realism, and long-horizon prediction quality in robot video world models.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers