ArXiv TLDR

CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

🐦 Tweet
2604.08457

Rui Gan, Junyi Ma, Pei Li, Xingyou Yang, Kai Chen + 2 more

cs.CVcs.AIcs.RO

TLDR

CrashSight is a new vision-language benchmark using roadside camera data to evaluate AI models' understanding of traffic crashes.

Key contributions

  • Introduces CrashSight, a large-scale vision-language benchmark for traffic crash understanding.
  • Comprises 250 real-world crash videos from roadside cameras with 13K Q&A pairs.
  • Features a two-tier taxonomy evaluating visual grounding and complex causal/temporal reasoning.
  • Benchmarks 8 VLMs, highlighting their limitations in safety-critical temporal and causal reasoning.

Why it matters

Existing VLM benchmarks focus on ego-vehicle data, leaving a gap in evaluating infrastructure-centric crash understanding. CrashSight provides a critical tool for developing and testing AI models for cooperative autonomous driving, improving safety. It reveals current VLMs struggle with complex reasoning in these vital scenarios.

Original Abstract

Cooperative autonomous driving requires traffic scene understanding from both vehicle and infrastructure perspectives. While vision-language models (VLMs) show strong general reasoning capabilities, their performance in safety-critical traffic scenarios remains insufficiently evaluated due to the ego-vehicle focus of existing benchmarks. To bridge this gap, we present \textbf{CrashSight}, a large-scale vision-language benchmark for roadway crash understanding using real-world roadside camera data. The dataset comprises 250 crash videos, annotated with 13K multiple-choice question-answer pairs organized under a two-tier taxonomy. Tier 1 evaluates the visual grounding of scene context and involved parties, while Tier 2 probes higher-level reasoning, including crash mechanics, causal attribution, temporal progression, and post-crash outcomes. We benchmark 8 state-of-the-art VLMs and show that, despite strong scene description capabilities, current models struggle with temporal and causal reasoning in safety-critical scenarios. We provide a detailed analysis of failure scenarios and discuss directions for improving VLM crash understanding. The benchmark provides a standardized evaluation framework for infrastructure-assisted perception in cooperative autonomous driving. The CrashSight benchmark, including the full dataset and code, is accessible at https://mcgrche.github.io/crashsight.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.