Are Stochastic Multi-objective Bandits Harder than Single-objective Bandits?

April 8, 20262604.07096

cs.LGstat.ML

TLDR

This paper shows stochastic multi-objective bandits are not harder than single-objective ones, achieving optimal Pareto regret with a new algorithm.

Key contributions

Shows stochastic multi-objective bandit Pareto regret is governed by the maximum sub-optimality gap.
Proves Pareto regret is Ω(K log T / g†), implying it's not harder than single-objective.
Develops an optimal algorithm achieving O(K log T / g†) Pareto regret.
Algorithm uses nested two-layer uncertainty quantification and a top-two racing strategy.

Why it matters

This paper resolves a long-standing question about the inherent difficulty of stochastic multi-objective bandits. It demonstrates that their performance is not fundamentally harder to optimize than single-objective ones, contrary to prior suggestions. This provides a crucial theoretical foundation and an optimal algorithm for practical applications.

Original Abstract

Multi-objective bandits have attracted increasing attention because of their broad applicability and mathematical elegance, where the reward of each arm is a multi-dimensional vector rather than a scalar. This naturally introduces Pareto order relations and Pareto regret. A long-standing question in this area is whether performance is fundamentally harder to optimize because of this added complexity. A recent surprising result shows that, in the adversarial setting, Pareto regret is no larger than classical regret; however, in the stochastic setting, where the regret notion is different, the picture remains unclear. In fact, existing work suggests that Pareto regret in the stochastic case increases with the dimensionality. This controversial yet subtle phenomenon motivates our central question: \emph{are multi-objective bandits actually harder than single-objective ones?} We answer this question in full by showing that, in the stochastic setting, Pareto regret is in fact governed by the maximum sub-optimality gap \(g^\dagger\), and hence by the minimum marginal regret of order \(Ω(\frac{K\log T}{g^\dagger})\). We further develop a new algorithm that achieves Pareto regret of order \(O(\frac{K\log T}{g^\dagger})\), and is therefore optimal. The algorithm leverages a nested two-layer uncertainty quantification over both arms and objectives through upper and lower confidence bound estimators. It combines a top-two racing strategy for arm selection with an uncertainty-greedy rule for dimension selection. Together, these components balance exploration and exploitation across the two layers. We also conduct comprehensive numerical experiments to validate the proposed algorithm, showing the desired regret guarantee and significant gains over benchmark methods.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers