World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning

April 29, 20262604.26934

Wanyue Zhang, Wenxiang Wu, Wang Xu, Jiaxin Luo, Helu Zhi + 4 more

cs.CV

TLDR

World2VLM distills spatial imagination from generative world models into VLMs, improving dynamic spatial reasoning without inference-time overhead.

Key contributions

Introduces World2VLM, a framework to distill world model imagination into VLMs.
Synthesizes future views and generates supervision for forward/inverse spatial reasoning.
Achieves consistent improvements on dynamic spatial reasoning benchmarks.
Outperforms test-time world-model-coupled methods, reducing inference cost.

Why it matters

VLMs struggle with dynamic spatial reasoning. This paper introduces an efficient method to teach VLMs to imagine scene evolution by distilling knowledge from world models. It demonstrates that world models can serve as effective training-time teachers, making VLMs more capable and scalable for complex dynamic tasks.

Original Abstract

Vision-language models (VLMs) have shown strong performance on static visual understanding, yet they still struggle with dynamic spatial reasoning that requires imagining how scenes evolve under egocentric motion. Recent efforts address this limitation either by scaling spatial supervision with synthetic data or by coupling VLMs with world models at inference time. However, the former often lacks explicit modeling of motion-conditioned state transitions, while the latter incurs substantial computational overhead. In this work, we propose World2VLM, a training framework that distills spatial imagination from a generative world model into a vision-language model. Given an initial observation and a parameterized camera trajectory, we use a view-consistent world model to synthesize geometrically aligned future views and derive structured supervision for both forward (action-to-outcome) and inverse (outcome-to-action) spatial reasoning. We post-train the VLM with a two-stage recipe on a compact dataset generated by this pipeline and evaluate it on multiple spatial reasoning benchmarks. World2VLM delivers consistent improvements over the base model across diverse benchmarks, including SAT-Real, SAT-Synthesized, VSI-Bench, and MindCube. It also outperforms the test-time world-model-coupled methods while eliminating the need for expensive inference-time generation. Our results suggest that world models can serve not only as inference-time tools, but also as effective training-time teachers, enabling VLMs to internalize spatial imagination in a scalable and efficient manner.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers