Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing

May 5, 20262605.03637

Zhiyuan Li, Wenyan Yang, Wenshuai Zhao, Yue Ma, Yuanpeng Tu + 2 more

cs.RO

TLDR

This paper introduces a generative framework for disentangled cross-embodiment video editing, enabling robots to learn from human demonstrations.

Key contributions

Addresses the human-robot embodiment gap in learning manipulation from videos.
Proposes a generative framework for disentangled task and embodiment representations.
Uses a dual contrastive objective to create orthogonal latent spaces for task and embodiment.
Synthesizes consistent robot execution videos from single human demos without paired data.

Why it matters

This paper offers a scalable solution to the data bottleneck in robotics by enabling robots to learn from abundant human videos. By bridging the embodiment gap, it significantly advances the potential for leveraging internet-scale data for robot autonomy.

Original Abstract

Learning robotic manipulation from human videos is a promising solution to the data bottleneck in robotics, but the distribution shift between humans and robots remains a critical challenge. Existing approaches often produce entangled representations, where task-relevant information is coupled with human-specific kinematics, limiting their adaptability. We propose a generative framework for cross-embodiment video editing that directly addresses this by learning explicitly disentangled task and embodiment representations. Our method factorizes a demonstration video into two orthogonal latent spaces by enforcing a dual contrastive objective: it minimizes mutual information between the spaces to ensure independence while maximizing intra-space consistency to create stable representations. A parameter-efficient adapter injects these latent codes into a frozen video diffusion model, enabling the synthesis of a coherent robot execution video from a single human demonstration, without requiring paired cross-embodiment data. Experiments show our approach generates temporally consistent and morphologically accurate robot demonstrations, offering a scalable solution to leverage internet-scale human video for robot learning.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers