Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

April 10, 20262604.09429

Wonbong Jang, Shikun Liu, Soubhik Sanyal, Juan Camilo Perez, Kam Woh Ng + 4 more

cs.CVcs.AIcs.LG

TLDR

Rays as Pixels is a Video Diffusion Model that jointly learns to generate videos and predict camera trajectories, improving robustness in sparse data.

Key contributions

Introduces "Rays as Pixels," a Video Diffusion Model learning joint video and camera trajectory distributions.
Employs "dense ray pixels (raxels)" and Decoupled Self-Cross Attention for joint denoising.
Performs three tasks: camera trajectory prediction, joint video/trajectory generation, and controlled video generation.
Achieves self-consistency in predictions, with efficient trajectory estimation requiring fewer denoising steps.

Why it matters

This paper unifies camera parameter recovery and scene rendering, tasks traditionally treated separately. By learning a joint distribution of videos and camera trajectories, it offers a more robust solution for sparse data and ambiguous poses. This enables advanced capabilities like joint generation and efficient trajectory prediction.

Original Abstract

Recovering camera parameters from images and rendering scenes from novel viewpoints have long been treated as separate tasks in computer vision and graphics. This separation breaks down when image coverage is sparse or poses are ambiguous, since each task needs what the other produces. We propose Rays as Pixels, a Video Diffusion Model (VDM) that learns a joint distribution over videos and camera trajectories. We represent each camera as dense ray pixels (raxels) and denoise them jointly with video frames through Decoupled Self-Cross Attention mechanism. A single trained model handles three tasks: predicting camera trajectories from video, jointly generating video and camera trajectory from input images, and generating video from input images along a target camera trajectory. Because the model can both predict trajectories from a video and generate views conditioned on its own predictions, we evaluate it through a closed-loop self-consistency test, demonstrating that its forward and inverse predictions agree. Notably, trajectory prediction requires far fewer denoising steps than video generation, even a few denoising steps suffice for self-consistency. We report results on pose estimation and camera-controlled video generation.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers