SigLoMa: Learning Open-World Quadrupedal Loco-Manipulation from Ego-Centric Vision

May 5, 20262605.03846

Shiyi Chen, Haiyi Liu, Mingye Yang, Jiaqi Zhang, Debing Zhang

cs.RO

TLDR

SigLoMa enables robust, onboard quadrupedal loco-manipulation using ego-centric vision, addressing latency and sim-to-real gaps with novel techniques.

Key contributions

Fully onboard, ego-centric vision-based system for quadrupedal loco-manipulation.
Introduces Sigma Points, a lightweight geometric representation for scalable exteroception and sim-to-real alignment.
Ego-centric Kalman Filter bridges slow perception and fast control for robust, high-rate state estimation.
Active Sampling Curriculum and temporal encoding improve sample efficiency and handle visual blind spots.

Why it matters

Existing quadrupedal loco-manipulation systems heavily rely on external hardware and struggle with vision latency and sim-to-real gaps. SigLoMa offers a fully onboard, ego-centric vision solution, eliminating these dependencies. This enables robust, dynamic loco-manipulation in open-world settings, advancing practical robot autonomy.

Original Abstract

Designing an open-world quadrupedal loco-manipulation system is highly challenging. Traditional reinforcement learning frameworks utilizing exteroception often suffer from extreme sample inefficiency and massive sim-to-real gaps. Furthermore, the inherent latency of visual tracking fundamentally conflicts with the high-frequency demands of precise floating-base control. Consequently, existing systems lean heavily on expensive external motion capture and off-board computation. To eliminate these dependencies, we present SigLoMa, a fully onboard, ego-centric vision-based pick-and-place framework. At the core of SigLoMa is the introduction of Sigma Points, a lightweight geometric representation for exteroception that guarantees high scalability and native sim-to-real alignment. To bridge the frequency divide between slow perception and fast control, we design an ego-centric Kalman Filter to provide robust, high-rate state estimation. On the learning front, we alleviate sample inefficiency via an Active Sampling Curriculum guided by Hint Poses, and tackle the robot's structural visual blind spots using temporal encoding coupled with simulated random-walk drift. Real-world experiments validate that, relying solely on a 5Hz (200 ms latency) open-vocabulary detector, SigLoMa successfully executes dynamic loco-manipulation across multiple tasks, achieving performance comparable to expert human teleoperation.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers