Interpreting Reinforcement Learning Agents with Susceptibilities

May 8, 20262605.08007

Chris Elliott, Einar Urdshals, David Quarel, Daniel Murfet

cs.LG

TLDR

This paper adapts neural network susceptibilities to deep reinforcement learning to interpret agent behavior and reveal internal model development.

Key contributions

Generalizes neural network "susceptibilities" for interpretability to deep reinforcement learning.
Applies susceptibilities to study agent regret in a gridworld, revealing complex stagewise development.
Shows susceptibilities uncover internal model features in parameter space, beyond just policy changes.
Validates findings using activation-steering and proposes extension to RLHF post-training.

Why it matters

Interpreting complex RL agents is vital for trust and improvement. This work introduces a novel method to reveal internal learning dynamics, offering insights beyond just observing policy changes. It could enhance our understanding and development of more robust and explainable RL systems.

Original Abstract

Susceptibilities are a technique for neural network interpretability that studies the response of posterior expectation values of observables to perturbations of the loss. We generalize this construction to the setting of the regret in deep reinforcement learning and investigate the utility of susceptibilities in a simple gridworld model that nevertheless exhibits non-trivial stagewise development. We argue that susceptibilities reveal internal features of the development of the model in parameter space that one cannot detect purely by studying the development of the learned policy. We validate these results with activation-steering, and discuss the framework's extension to RLHF post-training.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers