ArXiv TLDR

PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations

🐦 Tweet
2604.27472

Yang Zhang, Jiangyuan Zhao, Chenyou Fan, Fangzheng Yan, Tian Li + 9 more

cs.AIcs.LGcs.RO

TLDR

PRTS is a VLA model that uses contrastive Goal-Conditioned RL to learn goal-reachability, significantly improving robot task execution and long-horizon planning.

Key contributions

  • Introduces PRTS, a VLA foundation model using Goal-Conditioned Reinforcement Learning for pretraining.
  • Learns a unified embedding space where state-action and goal similarity predicts goal reachability.
  • Extracts dense goal-reachability supervision directly from offline trajectories without reward annotations.
  • Achieves SOTA on diverse benchmarks, significantly improving long-horizon and zero-shot robot tasks.

Why it matters

This paper introduces a novel pretraining paradigm for VLA models, shifting from behavior cloning to goal-conditioned reinforcement learning. By embedding goal-reachability awareness, PRTS enables robots to better understand temporal task progress and plan for complex, long-horizon tasks. This significantly advances general-purpose robotic foundation policies.

Original Abstract

Vision-Language-Action (VLA) models advance robotic control via strong visual-linguistic priors. However, existing VLAs predominantly frame pretraining as supervised behavior cloning, overlooking the fundamental nature of robot learning as a goal-reaching process that requires understanding temporal task progress. We present \textbf{PRTS} (\textbf{P}rimitive \textbf{R}easoning and \textbf{T}asking \textbf{S}ystem), a VLA foundation model that reformulates pretraining through Goal-Conditioned Reinforcement Learning. By treating language instructions as goals and employing contrastive reinforcement learning, PRTS learns a unified embedding space where the inner product of state-action and goal embeddings approximates the log-discounted goal occupancy, the probability of reaching the language-specified goal from the current state-action, quantitatively assessing physical feasibility beyond static semantic matching. PRTS draws this dense goal-reachability supervision directly from offline trajectories without reward annotations, and folds it into the VLM backbone via a role-aware causal mask, incurring negligible overhead over vanilla behavior cloning. This paradigm endows the high-level reasoning system with intrinsic goal reachability awareness, bridging semantic reasoning and temporal task progress, and further benefits goal-conditioned action prediction. Pretrained on 167B tokens of diverse manipulation and embodied-reasoning data, PRTS reaches state-of-the-art performance on LIBERO, LIBERO-Pro, LIBERO-Plus, SimplerEnv, and a real-world suite of 14 complex tasks, with particularly substantial gains on long-horizon, contact-rich, and zero-shot novel-instruction settings, confirming that injecting goal-reachability awareness significantly improves both execution success and long-horizon planning of general-purpose robotic foundation policies.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.