DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving

May 11, 20262605.10564

Lingjun Zhang, Changjie Wu, Linzhe Shi, Jiangyang Li, Jiaxin Liu + 4 more

cs.CVcs.RO

TLDR

DeepSight improves end-to-end autonomous driving with a world model predicting long-horizon latent states and adaptive text reasoning.

Key contributions

Proposes DeepSight, a driving world model for end-to-end autonomous driving.
Performs parallel prediction of latent semantic features in BEV space for long-horizon future world states.
Introduces an efficient, adaptive text reasoning mechanism using social knowledge for long-tail scenarios.
Achieves state-of-the-art (SOTA) results on the closed-loop Bench2drive benchmark.

Why it matters

Current VLM-based autonomous driving systems lack tailored reasoning for driving scenarios. DeepSight addresses this by enabling long-horizon world modeling and adaptive text reasoning. This improves performance in challenging situations, pushing the boundaries of autonomous driving safety and reliability.

Original Abstract

End-to-end autonomous driving systems are increasingly integrating Vision-Language Model (VLM) architectures, incorporating text reasoning or visual reasoning to enhance the robustness and accuracy of driving decisions. However, the reasoning mechanisms employed in most methods are direct adaptations from general domains, lacking in-depth exploration tailored to autonomous driving scenarios, particularly within visual reasoning modules. In this paper, we propose a driving world model that performs parallel prediction of latent semantic features for consecutive future frames in the bird's-eye-view (BEV) space, thereby enabling long-horizon modeling of future world states. We also introduce an efficient and adaptive text reasoning mechanism that utilizes additional social knowledge and reasoning capabilities to further improve driving performance in challenging long-tail scenarios. We present a novel, efficient, and effective approach that achieves state-of-the-art (SOTA) results on the closed-loop Bench2drive benchmark. Codes are available at: https://github.com/hotdogcheesewhite/DeepSight.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers