ArXiv TLDR

Attributions All the Way Down? The Metagame of Interpretability

🐦 Tweet
2605.06295

Hubert Baniecki, Przemyslaw Biecek, Fabian Fumagalli

cs.LGcs.AIstat.ML

TLDR

The 'metagame' framework quantifies second-order interaction effects in model explanations using Shapley values, providing deeper interpretability.

Key contributions

  • Introduces "metagame" framework to quantify second-order interaction effects in model explanations.
  • Defines "meta-attribution" as directional influence between features using Shapley values.
  • Theoretically proves attributions hierarchically decompose into meta-attributions.
  • Applies framework to LMs, vision-language encoders, and multimodal diffusion transformers.

Why it matters

Existing interpretability methods often miss complex feature interactions. This paper introduces a "metagame" framework to quantify how features influence each other's attributions, offering crucial insights for debugging and improving AI.

Original Abstract

We introduce the metagame, a conceptual framework for quantifying second-order interaction effects of model explanations. For any first-order attribution $φ(f)$ explaining a model $f$, we measure the directional influence of feature $j$ on the attribution of feature $i$, denoted as meta-attribution $\varphi_{j \to i}(f)$, by treating the attribution method itself as a cooperative game and computing its Shapley value. Theoretically, we prove that attributions hierarchically decompose into meta-attributions, and establish these as directional extensions of existing interaction indices. Empirically, we demonstrate that the metagame delivers insights across diverse interpretability applications: (i) quantifying token interactions in instruction-tuned language models, (ii) explaining cross-modal similarity in vision-language encoders, and (iii) interpreting text-to-image concepts in multimodal diffusion transformers.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.