A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning in Robotics

April 16, 20262604.15215

Fawad Javed Fateh, Ali Shah Ali, Murad Popattia, Usman Nizamani, Andrey Konin + 2 more

cs.RO

TLDR

HiST-AT is a new hierarchical spatiotemporal action tokenizer that achieves state-of-the-art in-context imitation learning for robotics.

Key contributions

Proposes HiST-AT, a hierarchical spatiotemporal action tokenizer for in-context imitation learning.
Uses two levels of vector quantization to cluster actions, exploiting both spatial and temporal cues.
Simultaneously recovers input actions and their associated timestamps for improved learning.
Achieves state-of-the-art performance on various simulation and real robotic manipulation benchmarks.

Why it matters

This paper introduces HiST-AT, a novel approach that significantly advances in-context imitation learning for robotics. By effectively tokenizing actions with both spatial and temporal information, it enables robots to learn complex tasks more efficiently. This breakthrough has strong implications for developing more capable and adaptable robotic systems.

Original Abstract

We present a novel hierarchical spatiotemporal action tokenizer for in-context imitation learning. We first propose a hierarchical approach, which consists of two successive levels of vector quantization. In particular, the lower level assigns input actions to fine-grained subclusters, while the higher level further maps fine-grained subclusters to clusters. Our hierarchical approach outperforms the non-hierarchical counterpart, while mainly exploiting spatial information by reconstructing input actions. Furthermore, we extend our approach by utilizing both spatial and temporal cues, forming a hierarchical spatiotemporal action tokenizer, namely HiST-AT. Specifically, our hierarchical spatiotemporal approach conducts multi-level clustering, while simultaneously recovering input actions and their associated timestamps. Finally, extensive evaluations on multiple simulation and real robotic manipulation benchmarks show that our approach establishes a new state-of-the-art performance in in-context imitation learning.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers