CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation
Xiangyang Luo, Xiaozhe Xin, Tao Feng, Xu Guo, Meiguang Jin + 1 more
TLDR
CoInteract synthesizes physically consistent human-object interaction videos, improving structural stability and contact realism using a DiT with specialized experts.
Key contributions
- Introduces Human-Aware Mixture-of-Experts (MoE) for improved fine-grained structural fidelity in sensitive regions.
- Proposes Spatially-Structured Co-Generation, a dual-stream training for injecting interaction geometry priors.
- Achieves physically plausible contact and structural stability, avoiding common issues like interpenetration.
- Significantly outperforms existing methods in interaction realism, logical consistency, and structural stability.
Why it matters
Human-object interaction video synthesis is crucial for e-commerce and digital marketing. Existing diffusion models often fail on physical consistency and structural stability. CoInteract overcomes these limitations, enabling highly realistic and stable HOI videos for practical applications.
Original Abstract
Synthesizing human--object interaction (HOI) videos has broad practical value in e-commerce, digital advertising, and virtual marketing. However, current diffusion models, despite their photorealistic rendering capability, still frequently fail on (i) the structural stability of sensitive regions such as hands and faces and (ii) physically plausible contact (e.g., avoiding hand--object interpenetration). We present CoInteract, an end-to-end framework for HOI video synthesis conditioned on a person reference image, a product reference image, text prompts, and speech audio. CoInteract introduces two complementary designs embedded into a Diffusion Transformer (DiT) backbone. First, we propose a Human-Aware Mixture-of-Experts (MoE) that routes tokens to lightweight, region-specialized experts via spatially supervised routing, improving fine-grained structural fidelity with minimal parameter overhead. Second, we propose Spatially-Structured Co-Generation, a dual-stream training paradigm that jointly models an RGB appearance stream and an auxiliary HOI structure stream to inject interaction geometry priors. During training, the HOI stream attends to RGB tokens and its supervision regularizes shared backbone weights; at inference, the HOI branch is removed for zero-overhead RGB generation. Experimental results demonstrate that CoInteract significantly outperforms existing methods in structural stability, logical consistency, and interaction realism.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.