Coordinating Multiple Conditions for Trajectory-Controlled Human Motion Generation
Deli Cai, Haoyang Ma, Changxing Ding
TLDR
CMC is a decoupled framework that generates human motions from text and trajectories, resolving conflicts and improving control accuracy.
Key contributions
- Introduces CMC, a decoupled framework for text and trajectory-controlled human motion generation.
- Uses a two-stage "divide-and-conquer" approach for stable and accurate trajectory following.
- Employs a Selective Inpainting Mechanism (SIM) to prevent overfitting during motion completion.
- Achieves state-of-the-art performance in control accuracy and motion quality on HumanML3D and KIT.
Why it matters
This paper addresses critical issues in human motion generation by resolving conflicts between text and trajectory conditions and improving representation stability. The proposed CMC framework significantly enhances motion quality and control accuracy, advancing realistic human animation and virtual character control.
Original Abstract
Trajectory-controlled human motion generation aims to synthesize realistic human motions conditioned on both textual descriptions and spatial trajectories. However, existing methods suffer from two critical limitations: first, the conflict between text and trajectory conditions disrupts the denoising process, resulting in compromised motion quality or inaccurate trajectory following; second, the use of redundant motion representations introduces inconsistencies between motion components, leading to instability during trajectory control. To address these challenges, we propose CMC, a decoupled framework that effectively coordinates text and trajectory conditions through a divide-and-conquer strategy. CMC follows a divide-and-conquer paradigm, comprising two cascaded stages: Trajectory Control and Motion Completion. In the first stage, a diffusion model generates a simplified representation of the controlled joints under trajectory guidance, based on the given trajectories, ensuring accurate and stable trajectory following. In the second stage, a text-conditioned diffusion inpainting model generates full-body motions using the simplified representation from the first stage as partial observations. To mitigate overfitting caused by limited inpainting training data, we further introduce the Selective Inpainting Mechanism (SIM), which alternates between text-to-motion generation and motion inpainting tasks during training. Experiments on HumanML3D and KIT datasets demonstrate that CMC achieves state-of-the-art performance in control accuracy and motion quality, demonstrating its effectiveness in coordinating multimodal conditions and representations.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.