Q-MMR: Off-Policy Evaluation via Recursive Reweighting and Moment Matching
TLDR
Q-MMR is a novel off-policy evaluation framework using recursive reweighting and moment matching, offering dimension-free finite-sample guarantees.
Key contributions
- Introduces Q-MMR, a novel framework for off-policy evaluation in finite-horizon MDPs.
- Learns data-point weights via recursive moment matching to approximate target policy returns.
- Provides dimension-free, finite-sample guarantees for general function approximation.
- Connects to existing OPE methods and clarifies the concept of coverage in offline RL.
Why it matters
This paper introduces a robust off-policy evaluation method with strong theoretical guarantees. Its dimension-free error bound is significant, simplifying analysis and potentially improving reliability in complex offline RL settings. It also deepens understanding of coverage.
Original Abstract
We present a novel theoretical framework, Q-MMR, for off-policy evaluation in finite-horizon MDPs. Q-MMR learns a set of scalar weights, one for each data point, such that the reweighted rewards approximate the expected return under the target policy. The weights are learned inductively in a top-down manner via a moment matching objective against a value-function discriminator class. Notably, and perhaps surprisingly, a data-dependent finite-sample guarantee for general function approximation can be established under only the realizability of $Q^π$, with a dimension-free bound -- that is, the error does not depend on the statistical complexity of the function class. We also establish connections to several existing methods, such as importance sampling and linear FQE. Further theoretical analyses shed new light on the nature of coverage, a concept of fundamental importance to offline RL.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.