ArXiv TLDR

Reinforcement Learning with Markov Risk Measures and Multipattern Risk Approximation

🐦 Tweet
2605.00654

Andrzej Ruszczynski, Tiangang Zhang

cs.LGcs.AImath.OCstat.ML

TLDR

This paper introduces mini-batch Markov risk measures and a multipattern Q-learning method for risk-averse RL, proving a high-probability regret bound.

Key contributions

  • Introduces "mini-batch measures," a new class of Markov coherent risk measures for risk-averse MDPs.
  • Defines "multipattern risk-averse problems," generalizing linear systems in risk-averse RL.
  • Proposes a feature-based Q-learning method using these concepts with a regret bound.
  • Presents an economical Q-learning version to streamline policy evaluation.

Why it matters

This work advances risk-averse reinforcement learning by introducing novel risk measures and a Q-learning approach with strong theoretical guarantees. It provides a more robust framework for decision-making in uncertain environments, applicable to various real-world problems.

Original Abstract

For a risk-averse finite-horizon Markov Decision Problem, we introduce a special class of Markov coherent risk measures, called mini-batch measures. We also define the class of multipattern risk-averse problems that generalizes the class of linear systems. We use both concepts in a feature-based $Q$-learning method with multipattern $Q$-factor approximation and we prove a high-probability regret bound of $\mathcal{O}\big(H^2 N^H \sqrt{ K}\big)$, where $H$ is the horizon, $N$ is the mini-batch size, and $K$ is the number of episodes. We also propose an economical version of the $Q$-learning method that streamlines the policy evaluation (backward) step. The theoretical results are illustrated on a stochastic assignment problem and a short-horizon multi-armed bandit problem.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.