Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images
Qiwei Wang, Zhongyao Tuo, Xianghui Ze, Yujiao Shi
TLDR
Cross3R reconstructs 3D scenes and camera poses from satellite, drone, and ground images, overcoming limitations of traditional cross-view localization.
Key contributions
- Introduces Cross3R, a feed-forward model for 3D reconstruction and 6-DoF camera pose estimation.
- Utilizes UAV images as an intermediate viewpoint to infer 3D structure, roll, pitch, and altitude.
- Develops CrossGeo, a large-scale (278K images) tri-view dataset for training and evaluation.
- Cross3R outperforms baselines in 3D reconstruction and cross-view localization, generalizing to KITTI.
Why it matters
Traditional cross-view localization is limited to 2D due to lack of 3D cues. This paper addresses this by integrating drone imagery, enabling full 3D reconstruction and 6-DoF pose estimation. This advancement is crucial for robust navigation and mapping in complex real-world environments.
Original Abstract
Cross-view localization classically asks: where does this ground image lie on the satellite tile? Existing methods are typically limited to 3-DoF estimates -- an $(x,y)$ position and a yaw angle -- because nadir satellite imagery provides no direct cues for roll, pitch, or altitude, forcing a reliance on planar-motion and zero-tilt assumptions. These assumptions break on real terrain with slopes, ramps, and tilted camera mounts. To overcome this, we introduce a single UAV image as an intermediate viewpoint: it reveals the 3D structure invisible from nadir, supplies the cues for roll, pitch, and altitude that the satellite alone cannot provide, and needs only spatial overlap with the ground camera -- no known relative pose is required. Building on this insight, we propose **Cross3R**, a flexible feed-forward model that ingests a satellite tile together with a UAV image, a ground image, or both, and, in a single forward pass, recovers a cross-view 3D point cloud, the 6-DoF poses of every input camera, and the on-tile $(x,y)$ position and yaw of each perspective camera. For training and evaluation, we also construct **CrossGeo**, a 278K-image tri-view dataset spanning 85 scenes across every continent except Antarctica. On CrossGeo, Cross3R consistently outperforms feed-forward 3D baselines in point-cloud reconstruction, 6-DoF camera-pose estimation, and cross-view localization. On KITTI, it outperforms dedicated cross-view methods trained on KITTI on most metrics, despite having no KITTI training itself.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.