ArXiv TLDR

WOMBET: World Model-based Experience Transfer for Robust and Sample-efficient Reinforcement Learning

🐦 Tweet
2604.08958

Mintae Kim, Koushil Sreenath

cs.LGcs.AIcs.RO

TLDR

WOMBET is a new RL framework that generates and transfers reliable prior data using world models for robust and sample-efficient learning.

Key contributions

  • Introduces WOMBET, a framework for jointly generating and utilizing prior data for RL transfer.
  • Learns a world model in the source task to generate offline data via uncertainty-penalized planning.
  • Filters generated trajectories for high return and low uncertainty, then fine-tunes with adaptive sampling.
  • Theoretically shows uncertainty-penalized planning provides a lower bound on true return.

Why it matters

This paper addresses a key challenge in RL: generating reliable prior data for efficient transfer. By jointly optimizing data generation and utilization, WOMBET significantly improves sample efficiency and performance in robotics. This approach could accelerate real-world RL applications by reducing data collection costs.

Original Abstract

Reinforcement learning (RL) in robotics is often limited by the cost and risk of data collection, motivating experience transfer from a source task to a target task. Offline-to-online RL leverages prior data but typically assumes a given fixed dataset and does not address how to generate reliable data for transfer. We propose \textit{World Model-based Experience Transfer} (WOMBET), a framework that jointly generates and utilizes prior data. WOMBET learns a world model in the source task and generates offline data via uncertainty-penalized planning, followed by filtering trajectories with high return and low epistemic uncertainty. It then performs online fine-tuning in the target task using adaptive sampling between offline and online data, enabling a stable transition from prior-driven initialization to task-specific adaptation. We show that the uncertainty-penalized objective provides a lower bound on the true return and derive a finite-sample error decomposition capturing distribution mismatch and approximation error. Empirically, WOMBET improves sample efficiency and final performance over strong baselines on continuous control benchmarks, demonstrating the benefit of jointly optimizing data generation and transfer.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.