AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model
Yutian Chen, Shi Guo, Renbiao Jin, Tianshuo Yang, Xin Cai + 5 more
TLDR
AnyRecon is a novel video diffusion model for scalable, robust 3D reconstruction from arbitrary sparse views, using global scene memory and geometry-aware conditioning.
Key contributions
- Introduces AnyRecon, a scalable framework for 3D reconstruction from arbitrary and unordered sparse inputs.
- Uses a persistent global scene memory with a capture view cache for long-range conditioning and consistency.
- Employs a geometry-aware conditioning strategy, coupling generation and reconstruction via 3D geometric memory.
- Achieves efficiency with 4-step diffusion distillation and context-window sparse attention.
Why it matters
This paper addresses critical limitations in sparse-view 3D reconstruction, enabling more robust and scalable scene modeling from casual captures. By integrating explicit geometric control with advanced diffusion techniques, AnyRecon significantly improves consistency and efficiency for large, diverse scenes.
Original Abstract
Sparse-view 3D reconstruction is essential for modeling scenes from casual captures, but remain challenging for non-generative reconstruction. Existing diffusion-based approaches mitigates this issues by synthesizing novel views, but they often condition on only one or two capture frames, which restricts geometric consistency and limits scalability to large or diverse scenes. We propose AnyRecon, a scalable framework for reconstruction from arbitrary and unordered sparse inputs that preserves explicit geometric control while supporting flexible conditioning cardinality. To support long-range conditioning, our method constructs a persistent global scene memory via a prepended capture view cache, and removes temporal compression to maintain frame-level correspondence under large viewpoint changes. Beyond better generative model, we also find that the interplay between generation and reconstruction is crucial for large-scale 3D scenes. Thus, we introduce a geometry-aware conditioning strategy that couples generation and reconstruction through an explicit 3D geometric memory and geometry-driven capture-view retrieval. To ensure efficiency, we combine 4-step diffusion distillation with context-window sparse attention to reduce quadratic complexity. Extensive experiments demonstrate robust and scalable reconstruction across irregular inputs, large viewpoint gaps, and long trajectories.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.