ArXiv TLDR

Fail2Drive: Benchmarking Closed-Loop Driving Generalization

🐦 Tweet
2604.08535

Simon Gerstenecker, Andreas Geiger, Katrin Renz

cs.ROcs.CV

TLDR

Fail2Drive is a new CARLA benchmark with paired routes and 17 scenario classes to rigorously test and diagnose closed-loop driving generalization under distribution shifts.

Key contributions

  • Introduces Fail2Drive, a novel paired-route benchmark for closed-loop driving generalization in CARLA.
  • Features 200 routes and 17 new scenario classes covering diverse appearance, layout, and behavioral shifts.
  • Paired-route design isolates shift effects, enabling quantitative diagnosis of generalization failures.
  • Reveals significant performance degradation (22.8% drop) in SOTA models and unexpected failure modes.

Why it matters

This paper addresses a critical limitation in autonomous driving benchmarks by creating Fail2Drive, which rigorously tests generalization under distribution shifts. Its unique paired-route design and diverse scenarios provide crucial insights into why state-of-the-art models fail, paving the way for more robust and reliable self-driving systems.

Original Abstract

Generalization under distribution shift remains a central bottleneck for closed-loop autonomous driving. Although simulators like CARLA enable safe and scalable testing, existing benchmarks rarely measure true generalization: they typically reuse training scenarios at test time. Success can therefore reflect memorization rather than robust driving behavior. We introduce Fail2Drive, the first paired-route benchmark for closed-loop generalization in CARLA, with 200 routes and 17 new scenario classes spanning appearance, layout, behavioral, and robustness shifts. Each shifted route is matched with an in-distribution counterpart, isolating the effect of the shift and turning qualitative failures into quantitative diagnostics. Evaluating multiple state-of-the-art models reveals consistent degradation, with an average success-rate drop of 22.8\%. Our analysis uncovers unexpected failure modes, such as ignoring objects clearly visible in the LiDAR and failing to learn the fundamental concepts of free and occupied space. To accelerate follow-up work, Fail2Drive includes an open-source toolbox for creating new scenarios and validating solvability via a privileged expert policy. Together, these components establish a reproducible foundation for benchmarking and improving closed-loop driving generalization. We open-source all code, data, and tools at https://github.com/autonomousvision/fail2drive .

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.