OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

May 12, 20262605.12480

Guohui Zhang, XiaoXiao Ma, Jie Huang, Hang Xu, Hu Yu + 7 more

cs.CVcs.AI

TLDR

OmniNFT proposes a novel diffusion RL framework to improve joint audio-video generation by addressing multi-modal challenges like gradient imbalance.

Key contributions

Modality-wise advantage routing for independent per-reward advantages to respective generation branches.
Layer-wise gradient surgery to selectively detach video-branch gradients from shallow audio layers.
Region-wise loss reweighting to modulate policy optimization toward critical synchronization regions.

Why it matters

Existing RL for joint audio-video generation suffers from multi-objective inconsistencies and gradient imbalances, limiting fidelity and synchronization. OmniNFT addresses these by introducing modality-aware RL, significantly improving audio-video quality and alignment. This advances high-fidelity multi-modal content creation.

Original Abstract

Recent advances in joint audio-video generation have been remarkable, yet real-world applications demand strong per-modality fidelity, cross-modal alignment, and fine-grained synchronization. Reinforcement Learning (RL) offers a promising paradigm, but its extension to multi-objective and multi-modal joint audio-video generation remains unexplored. Notably, our in-depth analysis first reveals that the primary obstacles to applying RL in this stem from: (i) multi-objective advantages inconsistency, where the advantages of multimodal outputs are not always consistent within a group; (ii) multi-modal gradients imbalance, where video-branch gradients leak into shallow audio layers responsible for intra-modal generation; (iii) uniform credit assignment, where fine-grained cross-modal alignment regions fail to get efficient exploration. These shortcomings suggest that vanilla RL fine-tuning strategy with a single global advantage often leads to suboptimal results. To address these challenges, we propose OmniNFT, a novel modality-aware online diffusion RL framework with three key innovations: (1) Modality-wise advantage routing, which routes independent per-reward advantages to their respective modality generation branches. (2) Layer-wise gradient surgery, which selectively detaches video-branch gradients on shallow audio layers while retaining those for cross-modal interaction layers. (3) Region-wise loss reweighting, which modulates policy optimization toward critical regions related to audio-video synchronization and fine-grained alignment. Extensive experiments on JavisBench and VBench with the LTX-2 backbone demonstrate that OmniNFT achieves comprehensive improvements in audio and video perceptual quality, cross-modal alignment, and audio-video synchronization.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers