ArXiv TLDR

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

🐦 Tweet
2604.18486

Jinghui Lu, Jiayi Guan, Zhijian Huang, Jinlong Li, Guang Li + 45 more

cs.CVcs.CLcs.RO

TLDR

OneVL introduces a unified VLA and World Model framework, achieving state-of-the-art latent Chain-of-Thought reasoning at real-time speed.

Key contributions

  • Unifies VLA and World Model for efficient latent reasoning in autonomous driving.
  • Uses dual decoders (language and visual world model) to supervise compact latent tokens.
  • Forces latent space to internalize causal dynamics of road geometry and agent motion.
  • Achieves state-of-the-art accuracy, surpassing explicit CoT, at answer-only inference latency.

Why it matters

This paper solves the critical latency issue of Chain-of-Thought reasoning in autonomous driving without sacrificing accuracy. By integrating a world model with language supervision, OneVL creates more robust and generalizable latent representations, paving the way for safer and more efficient real-time VLA systems.

Original Abstract

Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. At inference, the auxiliary decoders are discarded and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering state-of-the-art accuracy at answer-only latency, and providing direct evidence that tighter compression, when guided in both language and world-model supervision, produces more generalizable representations than verbose token-by-token reasoning. Project Page: https://xiaomi-embodied-intelligence.github.io/OneVL

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.