ArXiv TLDR

Is Your Driving World Model an All-Around Player?

🐦 Tweet
2605.10858

Lingdong Kong, Ao Liang, Tianyi Yan, Hongsi Liu, Wesley Yang + 18 more

cs.CVcs.RO

TLDR

WorldLens is a new benchmark, dataset, and agent for evaluating driving world models beyond visual realism, focusing on physical and behavioral fidelity.

Key contributions

  • Introduced WorldLens, a unified benchmark with 24 dimensions for evaluating driving world models' fidelity.
  • Revealed no existing model excels universally, with issues in physics, geometry, or closed-loop planning.
  • Created WorldLens-26K, a large human-annotated dataset for perceptual alignment.
  • Developed WorldLens-Agent, a vision-language evaluator for scalable, explainable auto-assessment.

Why it matters

This paper addresses a critical gap in driving world model evaluation by moving beyond visual realism to assess physical and behavioral fidelity. It provides a comprehensive benchmark, a large human preference dataset, and an automated evaluator. This work is crucial for developing safer and more reliable autonomous driving systems.

Original Abstract

Today's driving world models can generate remarkably realistic dash-cam videos, yet no single model excels universally. Some generate photorealistic textures but violate basic physics; others maintain geometric consistency but fail when subjected to closed-loop planning. This disconnect exposes a critical gap: the field evaluates how real generated worlds appear, but rarely whether they behave realistically. We introduce WorldLens, a unified benchmark that measures world-model fidelity across the full spectrum, from pixel quality and 4D geometry to closed-loop driving and human perceptual alignment, through five complementary aspects and 24 standardized dimensions. Our evaluation of six representative models reveals that no existing approach dominates across all axes: texture-rich models violate geometry, geometry-aware models lack behavioral fidelity, and even the strongest performers achieve only 2-3 out of 10 on human realism ratings. To bridge algorithmic metrics with human perception, we further contribute WorldLens-26K, a 26,808-entry human-annotated preference dataset pairing numerical scores with textual rationales, and WorldLens-Agent, a vision-language evaluator distilled from these judgments that enables scalable, explainable auto-assessment. Together, the benchmark, dataset, and agent form a unified ecosystem for assessing generated worlds not merely by visual appeal, but by physical and behavioral fidelity.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.