ArXiv TLDR

MSACT: Multistage Spatial Alignment for Stable Low-Latency Fine Manipulation

🐦 Tweet
2605.00475

Xianbo Cai, Hideyuki Ichiwara, Masaki Yoshikawa, Tetsuya Ogata

cs.ROcs.CV

TLDR

MSACT introduces a multistage spatial attention module for stable 2D attention point extraction and temporal alignment, improving low-latency fine manipulation.

Key contributions

  • Introduces MSACT, a multistage spatial attention module for stable 2D attention point extraction.
  • Predicts future attention sequences using a temporal alignment loss for consistent object tracking.
  • Uses a self-supervised objective to align predicted attention with future visual features, reducing drift.
  • Improves localization stability and task performance in low-latency fine manipulation on ALOHA.

Why it matters

Fine manipulation needs stable visual localization and low-latency control, often failing with limited data. MSACT prevents localization drift and maintains fast inference, making robots more robust for complex real-world tasks.

Original Abstract

Real-world fine manipulation, particularly in bimanual manipulation, typically requires low-latency control and stable visual localization, while collecting large-scale data is costly and limited demonstrations may lead to localization drift. Existing approaches make different trade-offs: action-chunking policies such as ACT enable low-latency execution and data efficiency but rely on dense visual features without explicit spatial consistency, generative methods such as Diffusion Policy improve expressiveness but can incur iterative sampling latency, vision-language-action and voxel-based methods enhance generalization and geometric grounding but require higher computational cost and system complexity. We introduce a multistage spatial attention module that extracts stable 2D attention points and jointly predicts future attention sequences with a temporal alignment loss. Built upon ACT with a pretrained ResNet visual prior, a multistage attention module extracts task-relevant 2D attention points as a local spatial modality for action prediction. To maintain consistent object tracking, we introduce a self-supervised objective that aligns predicted attention sequences with visual features from future frames, suppressing drift without keypoint annotations and improving stability of the vision-to-action mapping under limited data. Experiments on simulated and real-world fine manipulation tasks, conducted on the ALOHA bimanual platform, evaluate task success, attention drift, inference latency, and robustness to visual disturbances. Results indicate improvements in localization stability and task performance while maintaining low-latency inference under the tested conditions.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.