Beyond Decodability: Reconstructing Language Model Representations with an Encoding Probe
Gaofei Shen, Martijn Bentum, Tom Lentz, Afra Alishahi, Grzegorz Chrupała
TLDR
This paper introduces an Encoding Probe to reconstruct language model representations, offering a new way to understand feature contributions beyond decodability.
Key contributions
- Presents an 'Encoding Probe' to reconstruct language model representations from interpretable features.
- Solves limitations of decoding probes by enabling direct comparison of feature contributions.
- Evaluated on text and speech transformers using acoustic, phonetic, syntactic, and speaker ID features.
- Shows speaker effects vary by training, while syntax and lexicon contribute independently to reconstruction.
Why it matters
Understanding how language models represent information is crucial for their development. This new Encoding Probe offers a more nuanced way to analyze feature contributions, moving beyond simple decodability. It provides insights into how different features are encoded, which can guide future model design and interpretability efforts.
Original Abstract
Probing is widely used to study which features can be decoded from language model representations. However, the common decoding probe approach has two limitations that we aim to solve with our new encoding probe approach: contributions of different features to model representations cannot be directly compared, and feature correlations can affect probing results. We present an Encoding Probe that reverses this direction and reconstructs internal representations of models using interpretable features. We evaluate this method on text and speech transformer models, using feature sets spanning acoustics, phonetics, syntax, lexicon, and speaker identity. Our results suggest that speaker-related effects vary strongly across different training objectives and datasets, while syntactic and lexical features contribute independently to reconstruction. These results show that the Encoding Probe provides a complementary perspective on interpreting model representations beyond decodability.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.