Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs

May 8, 20262605.08053

Gugan Thoppe, L. A. Prashanth, Ankur Naskar, Sanjay Bhat

cs.LG

TLDR

This paper introduces novel Q-value-style algorithms for reinforcement learning with exponential utility in discounted MDPs, proving their convergence.

Key contributions

Derived two Q-value extensions for exponential utility, proving their contraction properties.
Showed fixed points of these operators yield optimal greedy policies for exponential utility.
Introduced a two-timescale Q-learning algorithm with almost-sure and finite-time convergence.
Developed a one-timescale algorithm, proving convergence using local Lipschitzness and Dini derivatives.

Why it matters

This work provides the first principled value-based RL algorithms for exponential utility, addressing a significant gap in risk-sensitive RL. It offers a strong theoretical foundation with convergence guarantees, enabling more robust and practical applications where risk-aversion is critical.

Original Abstract

Reinforcement learning (RL) for exponential-utility optimization in discounted Markov decision processes (MDPs) lacks principled value-based algorithms. We address this gap in the fixed risk-aversion setting. Building on the Bellman-type equation for exponential utility studied in \cite{porteus1975optimality}, we derive two Q-value-style extensions and show that the associated operators are contractions in the $L_\infty$ and sup-log/Thompson metrics, respectively. We characterize their fixed points and prove that the induced greedy stationary policy is optimal for the exponential-utility objective among stationary policies. These structural results lead to two model-free algorithms: a two-timescale Q-learning--style algorithm, for which we establish almost-sure convergence and provide finite-time convergence rates via timescale separation, and a one-timescale algorithm governed by a sublinear power-law operator. Since the latter does not admit a global contraction in standard metrics, we prove its convergence using delicate arguments based on local Lipschitzness, monotonicity, homogeneity, and Dini derivatives, and provide a scalar finite-time analysis that highlights the challenges in obtaining convergence rates in the vector case. Our work provides a foundation for value-based RL under exponential-utility objectives.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers