Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation

April 23, 20262604.21713

Guangkai Xu, Hua Geng, Huanyi Zheng, Songyi Yin, Yanlong Sun + 2 more

cs.CV

TLDR

This paper investigates critical factors in 3D visual geometry estimation, revealing insights and introducing CARVE, a resolution-enhanced model for robust performance.

Key contributions

Identified critical factors for 3D visual geometry, including data scaling, loss mechanisms, and supervision strategies.
Found data diversity and quality are crucial, while common loss functions can hinder performance.
Proposed a consistency loss and an efficient architecture for leveraging high-resolution inputs.
Introduced CARVE, a resolution-enhanced model, achieving strong performance across diverse benchmarks.

Why it matters

This paper systematically investigates critical factors in 3D visual geometry, addressing the gap between multi-frame consistency and single-frame accuracy. It introduces CARVE, a novel model that integrates new loss functions and architectural designs, leading to robust and strong performance across diverse benchmarks. This advances the field by providing key insights and a high-performing solution.

Original Abstract

Feed-forward visual geometry estimation has recently made rapid progress. However, an important gap remains: multi-frame models usually produce better cross-frame consistency, yet they often underperform strong per-frame methods on single-frame accuracy. This observation motivates our systematic investigation into the critical factors driving model performance through rigorous ablation studies, which reveals several key insights: 1) Scaling up data diversity and quality unlocks further performance gains even in state-of-the-art visual geometry estimation methods; 2) Commonly adopted confidence-aware loss and gradient-based loss mechanisms may unintentionally hinder performance; 3) Joint supervision through both per-sequence and per-frame alignment improves results, while local region alignment surprisingly degrades performance. Furthermore, we introduce two enhancements to integrate the advantages of optimization-based methods and high-resolution inputs: a consistency loss function that enforces alignment between depth maps, camera parameters, and point maps, and an efficient architectural design that leverages high-resolution information. We integrate these designs into CARVE, a resolution-enhanced model for feed-forward visual geometry estimation. Experiments on point cloud reconstruction, video depth estimation, and camera pose/intrinsic estimation show that CARVE achieves strong and robust performance across diverse benchmarks.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers