RLDX-1 Technical Report
Dongyoung Kim, Huiwon Jang, Myungkyu Koo, Suhyeok Jang, Taeyoung Kim + 63 more
TLDR
RLDX-1 is a new robotic policy using a Multi-Stream Action Transformer to achieve superior dexterous manipulation in complex real-world tasks.
Key contributions
- Introduces RLDX-1, a general-purpose robotic policy for dexterous manipulation.
- Utilizes Multi-Stream Action Transformer (MSAT) to unify heterogeneous modalities for broad functional capabilities.
- Incorporates synthesized training data, specialized learning, and inference optimizations for real-time deployment.
- Outperforms frontier VLAs like $π_{0.5}$ and GR00T N1.6, achieving 86.8% success in ALLEX humanoid tasks.
Why it matters
This paper addresses the limitations of current VLAs in complex real-world tasks by introducing RLDX-1, a policy designed for dexterous manipulation. Its novel architecture and system-level improvements significantly advance robotic control, especially for high-DoF humanoid robots. This work is a crucial step towards reliable and versatile robotic systems.
Original Abstract
While Vision-Language-Action models (VLAs) have shown remarkable progress toward human-like generalist robotic policies through the versatile intelligence (i.e. broad scene understanding and language-conditioned generalization) inherited from pre-trained Vision-Language Models, they still struggle with complex real-world tasks requiring broader functional capabilities (e.g. motion awareness, memory-aware decision making, and physical sensing). To address this, we introduce RLDX-1, a general-purpose robotic policy for dexterous manipulation built on the Multi-Stream Action Transformer (MSAT), an architecture that unifies these capabilities by integrating heterogeneous modalities through modality-specific streams with cross-modal joint self-attention. RLDX-1 further combines this architecture with system-level design choices, including synthesizing training data for rare manipulation scenarios, learning procedures specialized for human-like manipulation, and inference optimizations for real-time deployment. Through empirical evaluation, we show that RLDX-1 consistently outperforms recent frontier VLAs (e.g. $π_{0.5}$ and GR00T N1.6) across both simulation benchmarks and real-world tasks that require broad functional capabilities beyond general versatility. In particular, RLDX-1 shows superiority in ALLEX humanoid tasks by achieving success rates of 86.8% while $π_{0.5}$ and GR00T N1.6 achieve around 40%, highlighting the ability of RLDX-1 to control a high-DoF humanoid robot under diverse functional demands. Together, these results position RLDX-1 as a promising step toward reliable VLAs for complex, contact-rich, and dynamic real-world dexterous manipulation.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.