MoCoTalk: Multi-Conditional Diffusion with Adaptive Router for Controllable Talking Head Generation

May 8, 20262605.08050

cs.CV

TLDR

MoCoTalk is a multi-conditional diffusion framework that unifies four control signals for state-of-the-art, controllable talking head generation.

Key contributions

Unifies four diverse control signals (image, keypoints, 3DMM, audio) for comprehensive talking head generation.
Introduces an Adaptive Multi-Condition Router for dynamic, channel-wise fusion of heterogeneous control signals.
Proposes a Mouth-Augmented Shading Mesh to decouple and flexibly recombine head, mouth, expression, and lighting.
Achieves state-of-the-art talking head generation with fine-grained attribute-level controllability.

Why it matters

This paper tackles the complex challenge of generating realistic talking heads by integrating multiple control signals. Its novel adaptive fusion and decoupled 3DMM representation lead to superior results. This advancement enables more controllable and lifelike virtual characters for various applications.

Original Abstract

Talking-head generation requires joint modeling of identity, head pose, facial expression, and mouth dynamics. Existing methods typically address only a subset of these factors, and rely on fixed-weight or heuristic fusion when multiple conditions are involved. We present MoCoTalk, a multi-conditional video diffusion framework that unifies four complementary control signals: a reference image, facial keypoints, 3DMM-rendered shading meshes, and the corresponding speech audio. To resolve destructive interference among heterogeneous conditions, we introduce an Adaptive Multi-Condition Router that computes channel-wise, timestep-aware gating over the four condition streams, allowing the fusion strategy to vary with both feature subspace and noise level. To better capture speech-related facial dynamics, we design a Mouth-Augmented Shading Mesh, a 3DMM-based representation that decouples head motion, mouth motion, expression, and lighting. This design provides a temporally consistent geometric prior and allows flexible recombination of these attributes at inference. We further introduce a lip consistency loss to tighten audio-visual alignment. Extensive experiments show that MoCoTalk achieves state-of-the-art performance on the majority of structural, motion, and perceptual metrics, while offering attribute-level controllability that single-condition methods do not provide.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers