UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
Houyuan Chen, Hong Li, Xianghao Kong, Tianrui Zhu, Shaocong Xu + 6 more
TLDR
UniVidX is a unified multimodal framework that leverages video diffusion priors for versatile video generation across diverse tasks with strong performance.
Key contributions
- UniVidX unifies multimodal video generation, formulating pixel-aligned tasks as conditional generation in a shared space.
- Introduces Stochastic Condition Masking (SCM) for omni-directional conditional generation, replacing fixed input-output mappings.
- Uses Decoupled Gated LoRA (DGL) to adapt to modality-specific distributions while preserving VDM priors.
- Employs Cross-Modal Self-Attention (CMSA) for information exchange and inter-modal alignment.
Why it matters
Existing methods train separate models for each video generation task, limiting correlation modeling. UniVidX offers a unified solution, improving efficiency and consistency. It achieves state-of-the-art performance even with limited training data.
Original Abstract
Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often train separate models for each problem setting, which fixes the input-output mapping and limits the modeling of correlations across modalities. We present UniVidX, a unified multimodal framework that leverages VDM priors for versatile video generation. UniVidX formulates pixel-aligned tasks as conditional generation in a shared multimodal space, adapts to modality-specific distributions while preserving the backbone's native priors, and promotes cross-modal consistency during synthesis. It is built on three key designs. Stochastic Condition Masking (SCM) randomly partitions modalities into clean conditions and noisy targets during training, enabling omni-directional conditional generation instead of fixed mappings. Decoupled Gated LoRA (DGL) introduces per-modality LoRAs that are activated when a modality serves as the generation target, preserving the strong priors of the VDM. Cross-Modal Self-Attention (CMSA) shares keys and values across modalities while keeping modality-specific queries, facilitating information exchange and inter-modal alignment. We instantiate UniVidX in two domains: UniVid-Intrinsic, for RGB videos and intrinsic maps including albedo, irradiance, and normal; and UniVid-Alpha, for blended RGB videos and their constituent RGBA layers. Experiments show that both models achieve performance competitive with state-of-the-art methods across distinct tasks and generalize robustly to in-the-wild scenarios, even when trained on fewer than 1,000 videos. Project page: https://houyuanchen111.github.io/UniVidX.github.io/
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.