ArXiv TLDR

AttenA+: Rectifying Action Inequality in Robotic Foundation Models

🐦 Tweet
2605.13548

Daojie Peng, Fulong Ma, Jiahang Cao, Qiang Zhang, Xupeng Xie + 5 more

cs.ROcs.AI

TLDR

AttenA+ rectifies action inequality in robotic foundation models by prioritizing kinematically critical, low-velocity segments for improved manipulation.

Key contributions

  • Addresses the "flat" training paradigm in robotic FMs, which assumes temporal homogeneity.
  • Introduces AttenA+, an architecture-agnostic framework using velocity-driven action attention.
  • Prioritizes low-velocity, kinematically critical segments by reweighting the training objective.
  • Significantly improves SOTA models (e.g., OpenVLA-OFT to 98.6%) and shows real-world robustness.

Why it matters

Current robotic foundation models struggle with complex tasks due to uniform action weighting. AttenA+ offers a physics-aware, efficient plug-and-play solution to align learning with physical demands. This paves a new path for general-purpose robotic control by leveraging intrinsic action sequence priors.

Original Abstract

Existing robotic foundation models, while powerful, are predicated on an implicit assumption of temporal homogeneity: treating all actions as equally informative during optimization. This "flat" training paradigm, inherited from language modeling, remains indifferent to the underlying physical hierarchy of manipulation. In reality, robot trajectories are fundamentally heterogeneous, where low-velocity segments often dictate task success through precision-demanding interactions, while high-velocity motions serve as error-tolerant transitions. Such a misalignment between uniform loss weighting and physical criticality fundamentally limits the performance of current Vision-Language-Action (VLA) models and World-Action Models (WAM) in complex, long-horizon tasks. To rectify this, we introduce AttenA+, an architecture-agnostic framework that prioritizes kinematically critical segments via velocity-driven action attention. By reweighting the training objective based on the inverse velocity field, AttenA+ naturally aligns the model's learning capacity with the physical demands of manipulation. As a plug-and-play enhancement, AttenA+ can be integrated into existing backbones without structural modifications or additional parameters. Extensive experiments demonstrate that AttenA+ significantly elevates the ceilings of current state-of-the-art models. Specifically, it improves OpenVLA-OFT to 98.6% (+1.5%) on the Libero benchmark and pushes FastWAM to 92.4% (+0.6%) on RoboTwin 2.0. Real-world validation on a Franka manipulator further showcases its robustness and cross-task generalization. Our work suggests that mining the intrinsic structural priors of action sequences offers a highly efficient, physics-aware complement to standard scaling laws, paving a new path for general-purpose robotic control.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.