UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

April 21, 20262604.19734

Boyu Chen, Yi Chen, Lu Qiu, Jerry Bai, Yuying Ge + 1 more

cs.ROcs.AI

TLDR

UniT creates a unified physical language for human-to-humanoid transfer, enabling efficient policy learning and world modeling from human data.

Key contributions

UniT framework uses a tri-branch cross-reconstruction to create embodiment-agnostic physical intents.
Policy Learning (VLA-UniT) achieves state-of-the-art data efficiency and zero-shot transfer on humanoids.
World Modeling (WM-UniT) enables direct human-to-humanoid action transfer for enhanced video generation.
Induces a highly aligned cross-embodiment representation, verified by feature convergence into a shared manifold.

Why it matters

Humanoid foundation models are limited by scarce robotic data. UniT addresses this by bridging the gap between abundant human data and humanoid robots, offering a scalable path to distill human knowledge into general-purpose humanoid capabilities. This enables more efficient and robust robot learning.

Original Abstract

Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data. While massive egocentric human data offers a scalable alternative, bridging the cross-embodiment chasm remains a fundamental challenge due to kinematic mismatches. We introduce UniT (Unified Latent Action Tokenizer via Visual Anchoring), a framework that establishes a unified physical language for human-to-humanoid transfer. Grounded in the philosophy that heterogeneous kinematics share universal visual consequences, UniT employs a tri-branch cross-reconstruction mechanism: actions predict vision to anchor kinematics to physical outcomes, while vision reconstructs actions to filter out irrelevant visual confounders. Concurrently, a fusion branch synergies these purified modalities into a shared discrete latent space of embodiment-agnostic physical intents. We validate UniT across two paradigms: 1) Policy Learning (VLA-UniT): By predicting these unified tokens, it effectively leverages diverse human data to achieve state-of-the-art data efficiency and robust out-of-distribution (OOD) generalization on both humanoid simulation benchmark and real-world deployments, notably demonstrating zero-shot task transfer. 2) World Modeling (WM-UniT): By aligning cross-embodiment dynamics via unified tokens as conditions, it realizes direct human-to-humanoid action transfer. This alignment ensures that human data seamlessly translates into enhanced action controllability for humanoid video generation. Ultimately, by inducing a highly aligned cross-embodiment representation (empirically verified by t-SNE visualizations revealing the convergence of human and humanoid features into a shared manifold), UniT offers a scalable path to distill vast human knowledge into general-purpose humanoid capabilities.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers