Do Sparse Autoencoders Capture Concept Manifolds?

April 30, 20262604.28119

Usha Bhalla, Thomas Fel, Can Rager, Sheridan Feucht, Tal Haklay + 7 more

cs.LGcs.AI

TLDR

This paper explores how Sparse Autoencoders (SAEs) capture concept manifolds, finding they often mix global and local solutions, leading to "dilution."

Key contributions

Developed a theoretical framework for how Sparse Autoencoders (SAEs) capture concept manifolds.
Identified two ways SAEs capture manifolds: globally (atoms span) or locally (features tile regions).
Empirically found SAEs suboptimally recover continuous structures, mixing solutions in a "dilution" regime.
Motivates future interpretability methods to focus on geometric objects rather than isolated directions.

Why it matters

This paper challenges the assumption that SAEs capture concepts as independent linear directions, showing they can capture geometric manifolds. It explains why manifold structure is often hidden in individual SAE features, motivating new interpretability methods focused on groups of atoms and geometric objects.

Original Abstract

Sparse autoencoders (SAEs) are widely used to extract interpretable features from neural network representations, often under the implicit assumption that concepts correspond to independent linear directions. However, a growing body of evidence suggests that many concepts are instead organized along low-dimensional manifolds encoding continuous geometric relationships. This raises three basic questions: what does it mean for an SAE to capture a manifold, when do existing SAE architectures do so, and how? We develop a theoretical framework that answers these questions and show that SAEs can capture manifolds in two fundamentally different ways: globally, by allocating a compact group of atoms whose linear span contains the entire manifold, or locally, by distributing it across features that each selectively tile a restricted region of the underlying geometry. Empirically, we find that SAEs suboptimally recover continuous structures, mixing the global subspace and local tiling solutions in a fragmented regime we call dilution. This explains why manifold structure is rarely visible at the level of individual concepts and motivates post-hoc unsupervised discovery methods that search for coherent groups of atoms rather than isolated directions. More broadly, our results suggest that future representation learning methods should treat geometric objects, not just individual directions, as the basic units of interpretability.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers