OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation
Guohui Zhang, XiaoXiao Ma, Jie Huang, Hang Xu, Hu Yu + 7 more
TLDR
OmniNFT proposes a novel diffusion RL framework to improve joint audio-video generation by addressing multi-modal challenges like gradient imbalance.
Key contributions
- Modality-wise advantage routing for independent per-reward advantages to respective generation branches.
- Layer-wise gradient surgery to selectively detach video-branch gradients from shallow audio layers.
- Region-wise loss reweighting to modulate policy optimization toward critical synchronization regions.
Why it matters
Existing RL for joint audio-video generation suffers from multi-objective inconsistencies and gradient imbalances, limiting fidelity and synchronization. OmniNFT addresses these by introducing modality-aware RL, significantly improving audio-video quality and alignment. This advances high-fidelity multi-modal content creation.
Original Abstract
Recent advances in joint audio-video generation have been remarkable, yet real-world applications demand strong per-modality fidelity, cross-modal alignment, and fine-grained synchronization. Reinforcement Learning (RL) offers a promising paradigm, but its extension to multi-objective and multi-modal joint audio-video generation remains unexplored. Notably, our in-depth analysis first reveals that the primary obstacles to applying RL in this stem from: (i) multi-objective advantages inconsistency, where the advantages of multimodal outputs are not always consistent within a group; (ii) multi-modal gradients imbalance, where video-branch gradients leak into shallow audio layers responsible for intra-modal generation; (iii) uniform credit assignment, where fine-grained cross-modal alignment regions fail to get efficient exploration. These shortcomings suggest that vanilla RL fine-tuning strategy with a single global advantage often leads to suboptimal results. To address these challenges, we propose OmniNFT, a novel modality-aware online diffusion RL framework with three key innovations: (1) Modality-wise advantage routing, which routes independent per-reward advantages to their respective modality generation branches. (2) Layer-wise gradient surgery, which selectively detaches video-branch gradients on shallow audio layers while retaining those for cross-modal interaction layers. (3) Region-wise loss reweighting, which modulates policy optimization toward critical regions related to audio-video synchronization and fine-grained alignment. Extensive experiments on JavisBench and VBench with the LTX-2 backbone demonstrate that OmniNFT achieves comprehensive improvements in audio and video perceptual quality, cross-modal alignment, and audio-video synchronization.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.