Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
Tao Xie, Peishan Yang, Yudong Jin, Yingfeng Cai, Wei Yin + 6 more
TLDR
Scal3R introduces a scalable test-time training approach for large-scale 3D reconstruction, using a neural global context to improve accuracy and consistency.
Key contributions
- Proposes Scal3R, a novel neural global context representation for large-scale 3D reconstruction.
- Efficiently compresses long-range scene information, enhancing accuracy and consistency over long sequences.
- Utilizes lightweight neural sub-networks adapted at test-time via self-supervision, boosting memory capacity.
- Achieves state-of-the-art 3D reconstruction and leading pose accuracy on ultra-large scenes like KITTI.
Why it matters
This paper addresses a key challenge in 3D reconstruction: maintaining accuracy and consistency over very long video sequences. By introducing a scalable global context representation, Scal3R enables models to leverage extensive scene information. This significantly advances the state-of-the-art for large-scale 3D mapping and autonomous driving applications.
Original Abstract
This paper addresses the task of large-scale 3D scene reconstruction from long video sequences. Recent feed-forward reconstruction models have shown promising results by directly regressing 3D geometry from RGB images without explicit 3D priors or geometric constraints. However, these methods often struggle to maintain reconstruction accuracy and consistency over long sequences due to limited memory capacity and the inability to effectively capture global contextual cues. In contrast, humans can naturally exploit the global understanding of the scene to inform local perception. Motivated by this, we propose a novel neural global context representation that efficiently compresses and retains long-range scene information, enabling the model to leverage extensive contextual cues for enhanced reconstruction accuracy and consistency. The context representation is realized through a set of lightweight neural sub-networks that are rapidly adapted during test time via self-supervised objectives, which substantially increases memory capacity without incurring significant computational overhead. The experiments on multiple large-scale benchmarks, including the KITTI Odometry~\cite{Geiger2012CVPR} and Oxford Spires~\cite{tao2025spires} datasets, demonstrate the effectiveness of our approach in handling ultra-large scenes, achieving leading pose accuracy and state-of-the-art 3D reconstruction accuracy while maintaining efficiency. Code is available at https://zju3dv.github.io/scal3r.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.