ArXiv TLDR

HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation

🐦 Tweet
2604.28196

Xin Zhou, Dingkang Liang, Xiwu Chen, Feiyang Tan, Dingyuan Zhang + 2 more

cs.CV

TLDR

HERMES++ unifies 3D scene understanding and future geometry prediction in a driving world model, outperforming specialist methods.

Key contributions

  • BEV representation consolidates multi-view spatial information for LLM compatibility.
  • LLM-enhanced world queries facilitate knowledge transfer from the understanding branch.
  • Current-to-Future Link bridges the temporal gap, conditioning geometry on semantic context.
  • Joint Geometric Optimization enforces structural integrity with explicit and implicit constraints.

Why it matters

HERMES++ bridges the critical gap between semantic understanding and physical simulation for autonomous driving. By unifying 3D scene understanding with future geometry prediction, it offers a more comprehensive approach to environmental dynamics. This advancement is crucial for developing safer and more intelligent autonomous systems.

Original Abstract

Driving world models serve as a pivotal technology for autonomous driving by simulating environmental dynamics. However, existing approaches predominantly focus on future scene generation, often overlooking comprehensive 3D scene understanding. Conversely, while Large Language Models (LLMs) demonstrate impressive reasoning capabilities, they lack the capacity to predict future geometric evolution, creating a significant disparity between semantic interpretation and physical simulation. To bridge this gap, we propose HERMES++, a unified driving world model that integrates 3D scene understanding and future geometry prediction within a single framework. Our approach addresses the distinct requirements of these tasks through synergistic designs. First, a BEV representation consolidates multi-view spatial information into a structure compatible with LLMs. Second, we introduce LLM-enhanced world queries to facilitate knowledge transfer from the understanding branch. Third, a Current-to-Future Link is designed to bridge the temporal gap, conditioning geometric evolution on semantic context. Finally, to enforce structural integrity, we employ a Joint Geometric Optimization strategy that integrates explicit geometric constraints with implicit latent regularization to align internal representations with geometry-aware priors. Extensive evaluations on multiple benchmarks validate the effectiveness of our method. HERMES++ achieves strong performance, outperforming specialist approaches in both future point cloud prediction and 3D scene understanding tasks. The model and code will be publicly released at https://github.com/H-EmbodVis/HERMESV2.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.