ArXiv TLDR

Leveraging Previous-Traversal Point Cloud Map Priors for Camera-Based 3D Object Detection and Tracking

🐦 Tweet
2604.25405

Markus Käppeler, Özgün Çiçek, Yakov Miron, Abhinav Valada

cs.CVcs.RO

TLDR

DualViewMapDet enhances camera-only 3D object detection and tracking by fusing prior point cloud map data in both perspective and bird's-eye views.

Key contributions

  • Introduces DualViewMapDet, a camera-only framework leveraging prior point cloud maps for 3D object detection.
  • Proposes a dual-space camera-map fusion: map projected to PV for image features and encoded directly in BEV.
  • Fuses PV-enriched image features and BEV map features in a shared metric space to mitigate depth ambiguity.
  • Achieves significant improvements in 3D object localization over camera-only baselines on nuScenes and Argoverse 2.

Why it matters

This paper tackles a major limitation in camera-based 3D perception by effectively using prior point cloud maps. Its novel dual-view fusion significantly improves object localization without expensive online LiDAR, making robust 3D perception more accessible and practical for autonomous vehicles.

Original Abstract

Camera-based 3D object detection and tracking are central to autonomous driving, yet precise 3D object localization remains fundamentally constrained by depth ambiguity when no expensive, depth-rich online LiDAR is available at inference. In many deployments, however, vehicles repeatedly traverse the same environments, making static point cloud maps from prior traversals a practical source of geometric priors. We propose DualViewMapDet, a camera-only inference framework that retrieves such map priors online and leverages them to mitigate the absence of a LiDAR sensor during deployment. The key idea is a dual-space camera-map fusion strategy that avoids one-sided view conversion. Specifically, we (i) project the map into perspective view (PV) and encode multi-channel geometric cues to enrich image features and support BEV lifting, and (ii) encode the map directly in bird's-eye view (BEV) with a sparse voxel backbone and fuse it with lifted camera features in a shared metric space. Extensive evaluations on nuScenes and Argoverse 2 demonstrate consistent improvements over strong camera-only baselines, with particularly strong gains in object localization. Ablations further validate the contributions of PV/BEV fusion and prior-map coverage. We make the code and pre-trained models available at https://dualviewmapdet.cs.uni-freiburg.de .

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.