Stereo Multistage Spatial Attention for Real-Time Mobile Manipulation Under Visual Scale Variation and Disturbances

May 1, 20262605.00471

Xianbo Cai, Hideyuki Ichiwara, Hyogo Hiruma, Masaki Yoshikawa, Hiroshi Ito + 1 more

cs.RO

TLDR

A new stereo spatial attention method enables robust mobile manipulation in dynamic environments, handling visual scale variations and disturbances.

Key contributions

Introduces a stereo multistage spatial attention method for real-time mobile manipulation.
Extracts task-relevant spatial attention from stereo images, integrating with robot states.
Employs a hierarchical recurrent architecture for robust, closed-loop action prediction.
Achieves improved robustness and success rates over baselines in diverse real-world tasks.

Why it matters

Mobile robots struggle with visual scale variations and disturbances in unstructured environments. This paper presents a robust stereo spatial attention method, significantly improving manipulation success and reliability in dynamic real-world tasks.

Original Abstract

Robots operating in open, unstructured real-world environments must rely on onboard visual perception while autonomously moving across different locations. Continuous changes in onboard camera viewpoints cause significant visual scale variations in target objects, affecting vision-based motion generation. In this work, we present a stereo multistage spatial attention-based deep predictive learning method for real-time mobile manipulation. The proposed methods extracts task-relevant spatial attention points from stereo images and integrates them with robot states through a hierarchical recurrent architecture for closed-loop action prediction. We evaluate the system on four real-world mobile manipulation tasks using a mobile manipulator, including rigid placement, articulated object manipulation, and deformable object interaction. Experiments under randomized initial positions and visual disturbance conditions demonstrate improved robustness and task success rates compared to representative imitation learning and vision-language-action baselines under identical control settings. The results indicate that structured stereo spatial attention combined with predictive temporal modeling provides an effective solution within the evaluated mobile manipulation scenarios.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers