Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims

May 8, 20262605.08012

cs.LGcs.AIcs.CL

TLDR

Mechanistic interpretability papers often make causal claims without disclosing necessary identification assumptions, proposing a new norm for scientific rigor.

Key contributions

MI papers frequently use causal claims (e.g., circuits) without explicit identification assumptions.
An audit of 10 papers found no dedicated section for identification assumptions.
Validation metrics are often incorrectly presented as sufficient causal support.
Proposes a disclosure norm for causal claims, identification strategy, and assumptions.

Why it matters

This paper addresses a critical methodological flaw in mechanistic interpretability, where causal claims often lack proper justification. By proposing a disclosure norm, it aims to enhance scientific rigor and prevent misinterpretation of findings. This will lead to more robust and trustworthy conclusions in the field.

Original Abstract

Mechanistic interpretability papers increasingly use causal vocabulary: circuits, mediators, causal abstraction, monosemanticity. Such claims require explicit identification assumptions. A purposive audit of 10 papers across four methodological strands finds no dedicated identification-assumptions section and a recurring pattern: validation metrics such as faithfulness, completeness, monosemanticity, alignment, or ablation effects are reported as causal support without stating the assumptions that make them identifying. A two-human-coder audit on $n=30$ reproduces the direction of the main finding: dedicated identification sections are absent, and validation-metric substitution is common, though exact Dim B/D counts are coding-rule sensitive. The paper proposes a disclosure norm: state whether the claim is causal, name the identification strategy, enumerate assumptions, stress at least one, and explain how conclusions shift if assumptions fail. Validation is not identification.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers