Beyond the Independence Assumption: Finite-Sample Guarantees for Deep Q-Learning under $τ$-Mixing

May 7, 20262605.06373

Leon Halgryn, Sophie Langer, Janusz M. Meylahn, E. Moritz Hahn

stat.MLcs.LG

TLDR

This paper provides finite-sample guarantees for Deep Q-learning by explicitly modeling temporally dependent replay data as $\tau$-mixing.

Key contributions

Extends DQN statistical analysis to dependent data using a $\tau$-mixing model.
Derives finite-sample risk bounds for DQN updates under temporal dependence.
Shows temporal dependence degrades statistical rates, adding a dimensionality penalty.
Empirically validates replay data exhibits $\tau$-mixing correlations.

Why it matters

This paper challenges the common independence assumption in DQN analysis, providing finite-sample guarantees under realistic temporal dependence. This offers a more accurate understanding of DQN's performance limits, crucial for developing robust deep reinforcement learning algorithms.

Original Abstract

Finite-sample analyses of deep Q-learning typically treat replayed data as independent, even though it is sampled from temporally dependent state-action trajectories. We study the Deep Q-networks (DQN) algorithm under explicit dependence by modelling the minibatches used for updating the network as $τ$-mixing. We show that this assumption holds under certain dependence conditions on the underlying trajectories and the mechanism used to sample minibatches. Building on this observation, we extend statistical analyses of DQN with fully connected ReLU architectures to dependent data. We formulate each update as a nonparametric regression problem with $τ$-mixing observations and derive finite-sample risk bounds under this dependence structure. Our results show that temporal dependence leads to a degradation in the statistical rate by inducing an additional dimensionality penalty in the rate exponent, reflecting the reduced effective sample size of $τ$-mixing data. Moreover, we derive the sample complexity of DQN under $tau$-mixing from these risk bounds. Finally, we empirically demonstrate on standard Gymnasium environments that the independence assumption is systematically violated and that replay sampling yields approximately exponentially decaying correlations, supporting our theoretical framework.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers