Offline-Online Reinforcement Learning for Linear Mixture MDPs
Zhongjun Zhang, Sean R. Sinclair
TLDR
An adaptive offline-online RL method for linear mixture MDPs uses offline data only when beneficial, improving performance or matching online-only.
Key contributions
- Proposes an adaptive algorithm for offline-online RL in linear mixture MDPs with environment shift.
- Algorithm intelligently leverages informative offline data to improve over purely online learning.
- Safely ignores uninformative offline data, matching online-only performance without degradation.
- Provides regret upper and lower bounds characterizing when offline data is beneficial.
Why it matters
Offline data is abundant but often mismatched, posing a challenge for RL. This work offers a robust solution for leveraging such data. It provides a principled way to integrate offline knowledge, ensuring performance gains when data is useful and safety when it's not.
Original Abstract
We study offline-online reinforcement learning in linear mixture Markov decision processes (MDPs) under environment shift. In the offline phase, data are collected by an unknown behavior policy and may come from a mismatched environment, while in the online phase the learner interacts with the target environment. We propose an algorithm that adaptively leverages offline data. When the offline data are informative, either due to sufficient coverage or small environment shift, the algorithm provably improves over purely online learning. When the offline data are uninformative, it safely ignores them and matches the online-only performance. We establish regret upper bounds that explicitly characterize when offline data are beneficial, together with nearly matching lower bounds. Numerical experiments further corroborate our theoretical findings.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.