ArXiv TLDR

Unified Framework of Distributional Regret in Multi-Armed Bandits and Reinforcement Learning

🐦 Tweet
2605.05102

Harin Lee, Min-hwan Oh

cs.LGstat.ML

TLDR

A unified framework for distributional regret in MAB and RL is presented, achieving optimal bounds and confirming a key conjecture.

Key contributions

  • Proposes a unified framework to analyze distributional regret in MAB and episodic RL.
  • Introduces a UCBVI-style algorithm with a novel exploration bonus for regret control.
  • Derives general gap-independent and gap-dependent distributional regret bounds.
  • Confirms a conjecture by Lattimore & Szepesvári (2020) with optimal MAB regret bounds.

Why it matters

This research provides a principled way to understand and control the full distribution of regret, not just its expectation, in sequential decision-making. It offers optimal trade-offs between performance and risk, advancing the theoretical foundations of MAB and RL.

Original Abstract

We study the distribution of regret in stochastic multi-armed bandits and episodic reinforcement learning through a unified framework. We formalize a distributional regret bound as a probabilistic guarantee that holds uniformly over all confidence levels $δ\in (0,1]$, thereby characterizing the regret distribution across the full range of $δ$. We present a simple UCBVI-style algorithm with exploration bonus $\min\{c_{1,k}/N, c_{2,k}/\sqrt{N}\}$, where $N$ denotes the visit count and $(c_{1,k},c_{2,k})$ are user-specified parameters. For arbitrary parameter sequences, we derive general gap-independent and gap-dependent distributional regret bounds, yielding a principled characterization of how the parameters control the trade-off between expected performance, tail risk, and instance-dependent behavior. In particular, our bounds achieve optimal trade-offs between expected and distributional regret in both minimax and instance-dependent regimes. As a special case, for multi-armed bandits with $A$ arms and horizon $T$, we obtain a distributional regret bound of order $\mathcal{O}(\sqrt{AT}\log(1/δ))$, confirming the conjecture of Lattimore & Szepesvári (2020, Section 17.1) for the first time.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.