A single algorithm for both restless and rested rotting bandits

April 23, 20262604.21432

Julien Seznec, Pierre Ménard, Alessandro Lazaric, Michal Valko

stat.MLcs.LG

TLDR

RAW-UCB is a novel algorithm that achieves near-optimal regret in both restless and rested rotting bandit settings, unifying previously distinct problems.

Key contributions

Introduces RAW-UCB, a novel algorithm for rotting bandit problems.
Achieves near-optimal regret in both restless and rested rotting bandit settings.
Requires no prior knowledge of the specific bandit setting or non-stationarity.
Unifies previously distinct rotting bandit problems, contrasting prior negative results.

Why it matters

This paper addresses a long-standing challenge in bandit theory by providing a single, robust algorithm for both restless and rested rotting bandits. It simplifies the approach to problems where rewards decay, which is common in many real-world applications like recommender systems. This unification improves efficiency and broadens applicability.

Original Abstract

In many application domains (e.g., recommender systems, intelligent tutoring systems), the rewards associated to the actions tend to decrease over time. This decay is either caused by the actions executed in the past (e.g., a user may get bored when songs of the same genre are recommended over and over) or by an external factor (e.g., content becomes outdated). These two situations can be modeled as specific instances of the rested and restless bandit settings, where arms are rotting (i.e., their value decrease over time). These problems were thought to be significantly different, since Levine et al. (2017) showed that state-of-the-art algorithms for restless bandit perform poorly in the rested rotting setting. In this paper, we introduce a novel algorithm, Rotting Adaptive Window UCB (RAW-UCB), that achieves near-optimal regret in both rotting rested and restless bandit, without any prior knowledge of the setting (rested or restless) and the type of non-stationarity (e.g., piece-wise constant, bounded variation). This is in striking contrast with previous negative results showing that no algorithm can achieve similar results as soon as rewards are allowed to increase. We confirm our theoretical findings on a number of synthetic and dataset-based experiments.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers