Coordinating Multiple Conditions for Trajectory-Controlled Human Motion Generation

May 13, 20262605.13729

cs.CVcs.AI

TLDR

CMC is a decoupled framework that generates human motions from text and trajectories, resolving conflicts and improving control accuracy.

Key contributions

Introduces CMC, a decoupled framework for text and trajectory-controlled human motion generation.
Uses a two-stage "divide-and-conquer" approach for stable and accurate trajectory following.
Employs a Selective Inpainting Mechanism (SIM) to prevent overfitting during motion completion.
Achieves state-of-the-art performance in control accuracy and motion quality on HumanML3D and KIT.

Why it matters

This paper addresses critical issues in human motion generation by resolving conflicts between text and trajectory conditions and improving representation stability. The proposed CMC framework significantly enhances motion quality and control accuracy, advancing realistic human animation and virtual character control.

Original Abstract

Trajectory-controlled human motion generation aims to synthesize realistic human motions conditioned on both textual descriptions and spatial trajectories. However, existing methods suffer from two critical limitations: first, the conflict between text and trajectory conditions disrupts the denoising process, resulting in compromised motion quality or inaccurate trajectory following; second, the use of redundant motion representations introduces inconsistencies between motion components, leading to instability during trajectory control. To address these challenges, we propose CMC, a decoupled framework that effectively coordinates text and trajectory conditions through a divide-and-conquer strategy. CMC follows a divide-and-conquer paradigm, comprising two cascaded stages: Trajectory Control and Motion Completion. In the first stage, a diffusion model generates a simplified representation of the controlled joints under trajectory guidance, based on the given trajectories, ensuring accurate and stable trajectory following. In the second stage, a text-conditioned diffusion inpainting model generates full-body motions using the simplified representation from the first stage as partial observations. To mitigate overfitting caused by limited inpainting training data, we further introduce the Selective Inpainting Mechanism (SIM), which alternates between text-to-motion generation and motion inpainting tasks during training. Experiments on HumanML3D and KIT datasets demonstrate that CMC achieves state-of-the-art performance in control accuracy and motion quality, demonstrating its effectiveness in coordinating multimodal conditions and representations.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers