Interpreting Reinforcement Learning Agents with Susceptibilities
Chris Elliott, Einar Urdshals, David Quarel, Daniel Murfet
TLDR
This paper adapts neural network susceptibilities to deep reinforcement learning to interpret agent behavior and reveal internal model development.
Key contributions
- Generalizes neural network "susceptibilities" for interpretability to deep reinforcement learning.
- Applies susceptibilities to study agent regret in a gridworld, revealing complex stagewise development.
- Shows susceptibilities uncover internal model features in parameter space, beyond just policy changes.
- Validates findings using activation-steering and proposes extension to RLHF post-training.
Why it matters
Interpreting complex RL agents is vital for trust and improvement. This work introduces a novel method to reveal internal learning dynamics, offering insights beyond just observing policy changes. It could enhance our understanding and development of more robust and explainable RL systems.
Original Abstract
Susceptibilities are a technique for neural network interpretability that studies the response of posterior expectation values of observables to perturbations of the loss. We generalize this construction to the setting of the regret in deep reinforcement learning and investigate the utility of susceptibilities in a simple gridworld model that nevertheless exhibits non-trivial stagewise development. We argue that susceptibilities reveal internal features of the development of the model in parameter space that one cannot detect purely by studying the development of the learned policy. We validate these results with activation-steering, and discuss the framework's extension to RLHF post-training.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.