Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training

April 23, 20262604.21741

Yaxuan Li, Zhongyi Zhou, Yefei Chen, Yanjiang Guo, Jiaming Liu + 3 more

cs.RO

TLDR

Hi-WM enables scalable robot post-training by allowing human intervention directly within a learned world model, reducing real-world execution needs.

Key contributions

Proposes Hi-WM, a framework using a learned world model as a reusable corrective substrate for policy improvement.
Humans intervene directly in the world model to provide short corrective actions for policy failures.
Supports state caching, rollback, and branching for efficient reuse of failure states and dense supervision.
Improves real-world success by 37.9 points on average over base policies on manipulation tasks.

Why it matters

Current robot post-training is slow due to reliance on physical execution, requiring significant robot time and human supervision. Hi-WM offers a scalable alternative by shifting human corrections into a learned world model. This approach significantly speeds up policy improvement and reduces real-world resource demands.

Original Abstract

Post-training is essential for turning pretrained generalist robot policies into reliable task-specific controllers, but existing human-in-the-loop pipelines remain tied to physical execution: each correction requires robot time, scene setup, resets, and operator supervision in the real world. Meanwhile, action-conditioned world models have been studied mainly for imagination, synthetic data generation, and policy evaluation. We propose \textbf{Human-in-the-World-Model (Hi-WM)}, a post-training framework that uses a learned world model as a reusable corrective substrate for failure-targeted policy improvement. A policy is first rolled out in closed loop inside the world model; when the rollout becomes incorrect or failure-prone, a human intervenes directly in the model to provide short corrective actions. Hi-WM caches intermediate states and supports rollback and branching, allowing a single failure state to be reused for multiple corrective continuations and yielding dense supervision around behaviors that the base policy handles poorly. The resulting corrective trajectories are then added back to the training set for post-training. We evaluate Hi-WM on three real-world manipulation tasks spanning both rigid and deformable object interaction, and on two policy backbones. Hi-WM improves real-world success by 37.9 points on average over the base policy and by 19.0 points over a world-model closed-loop baseline, while world-model evaluation correlates strongly with real-world performance (r = 0.953). These results suggest that world models can serve not only as generators or evaluators, but also as effective corrective substrates for scalable robot post-training.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers