ArXiv TLDR

Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

🐦 Tweet
2604.24681

Yifan Xie, YuAn Wang, Guangyu Chen, Jinkun Liu, Yu Sun + 1 more

cs.RO

TLDR

MoT-HRA learns human-intention priors from 2.2M human video demonstrations to enable robust robotic manipulation through a hierarchical vision-language-action framework.

Key contributions

  • Introduces MoT-HRA, a hierarchical framework for learning human-intention priors from videos.
  • Curates HA-2.2M, a 2.2M-episode action-language dataset from heterogeneous human videos.
  • Factorizes manipulation into vision-language, intention (human motion prior), and fine robot action experts.
  • Achieves improved motion plausibility and robust control in simulated and real-world robot tasks.

Why it matters

This paper addresses the challenge of leveraging human videos for robot learning by disentangling complex observations. MoT-HRA's novel architecture and large-scale dataset significantly advance robotic manipulation, making robots more capable and adaptable to diverse tasks. It offers a path to more human-like and robust robot control.

Original Abstract

Human videos contain rich manipulation priors, but using them for robot learning remains difficult because raw observations entangle scene understanding, human motion, and embodiment-specific action. We introduce MoT-HRA, a hierarchical vision-language-action framework that learns human-intention priors from large-scale human demonstrations. We first curate HA-2.2M, a 2.2M-episode action-language dataset reconstructed from heterogeneous human videos through hand-centric filtering, spatial reconstruction, temporal segmentation, and language alignment. On top of this dataset, MoT-HRA factorizes manipulation into three coupled experts: a vision-language expert predicts an embodiment-agnostic 3D trajectory, an intention expert models MANO-style hand motion as a latent human-motion prior, and a fine expert maps the intention-aware representation to robot action chunks. A shared-attention trunk and read-only key-value transfer allow downstream control to use human priors while limiting interference with upstream representations. Experiments on hand motion generation, simulated manipulation, and real-world robot tasks show that MoT-HRA improves motion plausibility and robust control under distribution shift.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.