ArXiv TLDR

Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images

🐦 Tweet
2605.07978

Qiwei Wang, Zhongyao Tuo, Xianghui Ze, Yujiao Shi

cs.CV

TLDR

Cross3R reconstructs 3D scenes and camera poses from satellite, drone, and ground images, overcoming limitations of traditional cross-view localization.

Key contributions

  • Introduces Cross3R, a feed-forward model for 3D reconstruction and 6-DoF camera pose estimation.
  • Utilizes UAV images as an intermediate viewpoint to infer 3D structure, roll, pitch, and altitude.
  • Develops CrossGeo, a large-scale (278K images) tri-view dataset for training and evaluation.
  • Cross3R outperforms baselines in 3D reconstruction and cross-view localization, generalizing to KITTI.

Why it matters

Traditional cross-view localization is limited to 2D due to lack of 3D cues. This paper addresses this by integrating drone imagery, enabling full 3D reconstruction and 6-DoF pose estimation. This advancement is crucial for robust navigation and mapping in complex real-world environments.

Original Abstract

Cross-view localization classically asks: where does this ground image lie on the satellite tile? Existing methods are typically limited to 3-DoF estimates -- an $(x,y)$ position and a yaw angle -- because nadir satellite imagery provides no direct cues for roll, pitch, or altitude, forcing a reliance on planar-motion and zero-tilt assumptions. These assumptions break on real terrain with slopes, ramps, and tilted camera mounts. To overcome this, we introduce a single UAV image as an intermediate viewpoint: it reveals the 3D structure invisible from nadir, supplies the cues for roll, pitch, and altitude that the satellite alone cannot provide, and needs only spatial overlap with the ground camera -- no known relative pose is required. Building on this insight, we propose **Cross3R**, a flexible feed-forward model that ingests a satellite tile together with a UAV image, a ground image, or both, and, in a single forward pass, recovers a cross-view 3D point cloud, the 6-DoF poses of every input camera, and the on-tile $(x,y)$ position and yaw of each perspective camera. For training and evaluation, we also construct **CrossGeo**, a 278K-image tri-view dataset spanning 85 scenes across every continent except Antarctica. On CrossGeo, Cross3R consistently outperforms feed-forward 3D baselines in point-cloud reconstruction, 6-DoF camera-pose estimation, and cross-view localization. On KITTI, it outperforms dedicated cross-view methods trained on KITTI on most metrics, despite having no KITTI training itself.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.