Realizing Immersive Volumetric Video: A Multimodal Framework for 6-DoF VR Engagement

April 10, 20262604.09473

Zhengxian Yang, Shengqi Wang, Shi Pan, Hongshuai Li, Haoxiang Wang + 6 more

cs.CV

TLDR

This paper introduces Immersive Volumetric Videos (IVV), a new format and framework for creating 6-DoF VR experiences from real-world captured video with synchronized audio.

Key contributions

Introduces Immersive Volumetric Videos (IVV), a new format for 6-DoF VR with dynamic content and audiovisual feedback.
Presents ImViD, a multi-view, multi-modal dataset with 5K@60FPS videos for IVV construction.
Develops a dynamic light field reconstruction framework using a Gaussian-based spatio-temporal representation.
Proposes the first method for sound field reconstruction from multi-view audiovisual data.

Why it matters

This work defines and provides a practical methodology for Immersive Volumetric Videos, enabling high-quality, 6-DoF VR experiences from real-world captures. It addresses a gap in creating truly immersive content beyond computer-generated scenes, pushing the boundaries of VR/AR realism.

Original Abstract

Fully immersive experiences that tightly integrate 6-DoF visual and auditory interaction are essential for virtual and augmented reality. While such experiences can be achieved through computer-generated content, constructing them directly from real-world captured videos remains largely unexplored. We introduce Immersive Volumetric Videos, a new volumetric media format designed to provide large 6-DoF interaction spaces, audiovisual feedback, and high-resolution, high-frame-rate dynamic content. To support IVV construction, we present ImViD, a multi-view, multi-modal dataset built upon a space-oriented capture philosophy. Our custom capture rig enables synchronized multi-view video-audio acquisition during motion, facilitating efficient capture of complex indoor and outdoor scenes with rich foreground--background interactions and challenging dynamics. The dataset provides 5K-resolution videos at 60 FPS with durations of 1-5 minutes, offering richer spatial, temporal, and multimodal coverage than existing benchmarks. Leveraging this dataset, we develop a dynamic light field reconstruction framework built upon a Gaussian-based spatio-temporal representation, incorporating flow-guided sparse initialization, joint camera temporal calibration, and multi-term spatio-temporal supervision for robust and accurate modeling of complex motion. We further propose, to our knowledge, the first method for sound field reconstruction from such multi-view audiovisual data. Together, these components form a unified pipeline for immersive volumetric video production. Extensive benchmarks and immersive VR experiments demonstrate that our pipeline generates high-quality, temporally stable audiovisual volumetric content with large 6-DoF interaction spaces. This work provides both a foundational definition and a practical construction methodology for immersive volumetric videos.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers