ArXiv TLDR

UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception

🐦 Tweet
2604.14089

Ziming Wang

cs.ROcs.AI

TLDR

UMI-3D extends the Universal Manipulation Interface with a LiDAR sensor for robust 3D spatial perception, enabling reliable data collection in challenging real-world environments.

Key contributions

  • Integrates a lightweight, low-cost LiDAR for robust 3D spatial perception and accurate metric-scale pose estimation.
  • Develops a hardware-synchronized multimodal sensing pipeline with unified spatiotemporal calibration.
  • Significantly improves collected data quality and policy performance for embodied manipulation tasks.
  • Enables learning of complex tasks previously infeasible for vision-only UMI, like deformable object manipulation.

Why it matters

UMI-3D overcomes limitations of vision-only systems, making embodied manipulation data collection more robust and scalable. By integrating LiDAR, it enables reliable operation in complex real-world settings, crucial for advancing robot learning. This opens doors for tackling more challenging manipulation tasks.

Original Abstract

We present UMI-3D, a multimodal extension of the Universal Manipulation Interface (UMI) for robust and scalable data collection in embodied manipulation. While UMI enables portable, wrist-mounted data acquisition, its reliance on monocular visual SLAM makes it vulnerable to occlusions, dynamic scenes, and tracking failures, limiting its applicability in real-world environments. UMI-3D addresses these limitations by introducing a lightweight and low-cost LiDAR sensor tightly integrated into the wrist-mounted interface, enabling LiDAR-centric SLAM with accurate metric-scale pose estimation under challenging conditions. We further develop a hardware-synchronized multimodal sensing pipeline and a unified spatiotemporal calibration framework that aligns visual observations with LiDAR point clouds, producing consistent 3D representations of demonstrations. Despite maintaining the original 2D visuomotor policy formulation, UMI-3D significantly improves the quality and reliability of collected data, which directly translates into enhanced policy performance. Extensive real-world experiments demonstrate that UMI-3D not only achieves high success rates on standard manipulation tasks, but also enables learning of tasks that are challenging or infeasible for the original vision-only UMI setup, including large deformable object manipulation and articulated object operation. The system supports an end-to-end pipeline for data acquisition, alignment, training, and deployment, while preserving the portability and accessibility of the original UMI. All hardware and software components are open-sourced to facilitate large-scale data collection and accelerate research in embodied intelligence: \href{https://umi-3d.github.io}{https://umi-3d.github.io}.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.