Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing
Zhiyuan Li, Wenyan Yang, Wenshuai Zhao, Yue Ma, Yuanpeng Tu + 2 more
TLDR
This paper introduces a generative framework for disentangled cross-embodiment video editing, enabling robots to learn from human demonstrations.
Key contributions
- Addresses the human-robot embodiment gap in learning manipulation from videos.
- Proposes a generative framework for disentangled task and embodiment representations.
- Uses a dual contrastive objective to create orthogonal latent spaces for task and embodiment.
- Synthesizes consistent robot execution videos from single human demos without paired data.
Why it matters
This paper offers a scalable solution to the data bottleneck in robotics by enabling robots to learn from abundant human videos. By bridging the embodiment gap, it significantly advances the potential for leveraging internet-scale data for robot autonomy.
Original Abstract
Learning robotic manipulation from human videos is a promising solution to the data bottleneck in robotics, but the distribution shift between humans and robots remains a critical challenge. Existing approaches often produce entangled representations, where task-relevant information is coupled with human-specific kinematics, limiting their adaptability. We propose a generative framework for cross-embodiment video editing that directly addresses this by learning explicitly disentangled task and embodiment representations. Our method factorizes a demonstration video into two orthogonal latent spaces by enforcing a dual contrastive objective: it minimizes mutual information between the spaces to ensure independence while maximizing intra-space consistency to create stable representations. A parameter-efficient adapter injects these latent codes into a frozen video diffusion model, enabling the synthesis of a coherent robot execution video from a single human demonstration, without requiring paired cross-embodiment data. Experiments show our approach generates temporally consistent and morphologically accurate robot demonstrations, offering a scalable solution to leverage internet-scale human video for robot learning.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.