MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation

April 21, 20262604.19679

Liyang Li, Wen Wang, Canyu Zhao, Tianjian Feng, Zhiyue Zhao + 2 more

cs.CV

TLDR

MMControl enables fine-grained multi-modal control for synchronized joint audio-video generation using a dual-stream diffusion transformer.

Key contributions

Introduces dual-stream conditional injection for visual and acoustic controls in joint generation.
Supports diverse conditions: images, audio, depth maps, and pose sequences.
Allows dynamic, independent scaling of visual and audio guidance at inference.
Achieves identity-consistent video and timbre-consistent audio with structural constraints.

Why it matters

MMControl advances joint audio-video generation by enabling comprehensive, multi-modal control, improving cross-modal alignment and user-driven customization. This enhances applications in media creation and interactive content.

Original Abstract

Recent advances in Diffusion Transformers (DiTs) have enabled high-quality joint audio-video generation, producing videos with synchronized audio within a single model. However, existing controllable generation frameworks are typically restricted to video-only control. This restricts comprehensive controllability and often leads to suboptimal cross-modal alignment. To bridge this gap, we present MMControl, which enables users to perform Multi-Modal Control in joint audio-video generation. MMControl introduces a dual-stream conditional injection mechanism. It incorporates both visual and acoustic control signals, including reference images, reference audio, depth maps, and pose sequences, into a joint generation process. These conditions are injected through bypass branches into a joint audio-video Diffusion Transformer, enabling the model to simultaneously generate identity-consistent video and timbre-consistent audio under structural constraints. Furthermore, we introduce modality-specific guidance scaling, which allows users to independently and dynamically adjust the influence strength of each visual and acoustic condition at inference time. Extensive experiments demonstrate that MMControl achieves fine-grained, composable control over character identity, voice timbre, body pose, and scene layout in joint audio-video generation.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers