Predicting 3D structure by latent posterior sampling
TLDR
This paper introduces a method for 3D structure prediction by combining NeRFs with diffusion models for probabilistic latent posterior sampling.
Key contributions
- Combines NeRF-based 3D scene representation with diffusion models for probabilistic inference.
- Treats 3D reconstruction as a perception problem, explicitly modeling inherent uncertainty.
- Learns a prior over stochastic latent 3D scene variables using a two-stage training process.
- Performs accurate 3D reconstruction from diverse inputs like single-view, multi-view, and sparse data.
Why it matters
This paper offers a novel approach to 3D reconstruction by explicitly modeling uncertainty, which is crucial for robust perception. It demonstrates strong performance across various challenging input conditions, making it a versatile tool for 3D vision tasks.
Original Abstract
The remarkable achievements of both generative models of 2D images and neural field representations for 3D scenes present a compelling opportunity to integrate the strengths of both approaches. In this work, we propose a methodology that combines a NeRF-based representation of 3D scenes with probabilistic modeling and reasoning using diffusion models. We view 3D reconstruction as a perception problem with inherent uncertainty that can thereby benefit from probabilistic inference methods. The core idea is to represent the 3D scene as a stochastic latent variable for which we can learn a prior and use it to perform posterior inference given a set of observations. We formulate posterior sampling using the score-based inference method of diffusion models in conjunction with a likelihood term computed from a reconstruction model that includes volumetric rendering. We train the model using a two-stage process: first we train the reconstruction model while auto-decoding the latent representations for a dataset of 3D scenes, and then we train the prior over the latents using a diffusion model. By using the model to generate samples from the posterior we demonstrate that various 3D reconstruction tasks can be performed, differing by the type of observation used as inputs. We showcase reconstruction from single-view, multi-view, noisy images, sparse pixels, and sparse depth data. These observations vary in the amount of information they provide for the scene and we show that our method can model the varying levels of inherent uncertainty associated with each task. Our experiments illustrate that this approach yields a comprehensive method capable of accurately predicting 3D structure from diverse types of observations.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.