Causal Drawbridges: Characterizing Gradient Blocking of Syntactic Islands in Transformer LMs

April 15, 20262604.13950

cs.CL

TLDR

Transformers replicate human judgments on syntactic island extraction, revealing how "and" is represented differently in extractable contexts.

Key contributions

Transformers replicate human gradient judgments on syntactic island extraction.
Causal interventions reveal filler-gap mechanisms are selectively blocked in coordination islands.
Proposes "and" is represented differently in extractable vs. non-extractable syntactic constructions.
Illustrates how mechanistic interpretability can generate novel hypotheses about linguistic representation.

Why it matters

This paper bridges mechanistic interpretability and linguistics by using causal interventions in Transformers to explain complex syntactic phenomena. It offers a novel hypothesis about how "and" is represented, potentially advancing our understanding of both LMs and human language processing.

Original Abstract

We show how causal interventions in Transformer models provide insights into English syntax by focusing on a long-standing challenge for syntactic theory: syntactic islands. Extraction from coordinated verb phrases is often degraded, yet acceptability varies gradiently with lexical content (e.g., "I know what he hates art and loves" vs. "I know what he looked down and saw"). We show that modern Transformer language models replicate human judgments across this gradient. Using causal interventions that isolate functionally relevant subspaces in Transformer blocks, attention modules, and MLPs, we demonstrate that extraction from coordination islands engages the same filler-gap mechanisms as canonical wh-dependencies, but that these mechanisms are selectively blocked to varying degrees. By projecting a large corpus of unrelated text onto these causally identified subspaces, we derive a novel linguistic hypothesis: the conjunction "and" is represented differently in extractable versus non-extractable constructions, corresponding to expressions encoding relational dependencies versus purely conjunctive uses. These results illustrate how mechanistic interpretability can inform syntax, generating new hypotheses about linguistic representation and processing.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers