Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation
Guangkai Xu, Hua Geng, Huanyi Zheng, Songyi Yin, Yanlong Sun + 2 more
TLDR
This paper investigates critical factors in 3D visual geometry estimation, revealing insights and introducing CARVE, a resolution-enhanced model for robust performance.
Key contributions
- Identified critical factors for 3D visual geometry, including data scaling, loss mechanisms, and supervision strategies.
- Found data diversity and quality are crucial, while common loss functions can hinder performance.
- Proposed a consistency loss and an efficient architecture for leveraging high-resolution inputs.
- Introduced CARVE, a resolution-enhanced model, achieving strong performance across diverse benchmarks.
Why it matters
This paper systematically investigates critical factors in 3D visual geometry, addressing the gap between multi-frame consistency and single-frame accuracy. It introduces CARVE, a novel model that integrates new loss functions and architectural designs, leading to robust and strong performance across diverse benchmarks. This advances the field by providing key insights and a high-performing solution.
Original Abstract
Feed-forward visual geometry estimation has recently made rapid progress. However, an important gap remains: multi-frame models usually produce better cross-frame consistency, yet they often underperform strong per-frame methods on single-frame accuracy. This observation motivates our systematic investigation into the critical factors driving model performance through rigorous ablation studies, which reveals several key insights: 1) Scaling up data diversity and quality unlocks further performance gains even in state-of-the-art visual geometry estimation methods; 2) Commonly adopted confidence-aware loss and gradient-based loss mechanisms may unintentionally hinder performance; 3) Joint supervision through both per-sequence and per-frame alignment improves results, while local region alignment surprisingly degrades performance. Furthermore, we introduce two enhancements to integrate the advantages of optimization-based methods and high-resolution inputs: a consistency loss function that enforces alignment between depth maps, camera parameters, and point maps, and an efficient architectural design that leverages high-resolution information. We integrate these designs into CARVE, a resolution-enhanced model for feed-forward visual geometry estimation. Experiments on point cloud reconstruction, video depth estimation, and camera pose/intrinsic estimation show that CARVE achieves strong and robust performance across diverse benchmarks.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.