ArXiv TLDR

SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders

🐦 Tweet
2605.06610

Jakub Stępień, Marcin Mazur, Jacek Tabor, Przemysław Spurek

cs.LGcs.CV

TLDR

SoftSAE introduces a dynamic Top-K selection mechanism for sparse autoencoders, adapting feature sparsity to input complexity for better interpretability.

Key contributions

  • Addresses fixed sparsity in SAEs, which limits their ability to adapt to varying data complexity.
  • Proposes SoftSAE, using a differentiable Soft Top-K operator to learn input-dependent sparsity.
  • Dynamically adjusts the number of active features based on each input's complexity.
  • Improves feature representation and explanation length by matching data structure.

Why it matters

This paper introduces a crucial improvement to Sparse Autoencoders, making them more adaptive and accurate. By allowing the model to dynamically adjust its sparsity, SoftSAE provides more precise and interpretable feature representations. This advancement is vital for understanding complex models like LLMs and ViTs.

Original Abstract

Sparse Autoencoders (SAEs) have become an important tool in mechanistic interpretability, helping to analyze internal representations in both Large Language Models (LLMs) and Vision Transformers (ViTs). By decomposing polysemantic activations into sparse sets of monosemantic features, SAEs aim to translate neural network computations into human-understandable concepts. However, common architectures such as TopK SAEs rely on a fixed sparsity level. They enforce the same number of active features (K) across all inputs, ignoring the varying complexity of real-world data. Natural data often lies on manifolds with varying local intrinsic dimensionality, meaning the number of relevant factors can change significantly across samples. This suggests that a fixed sparsity level is not optimal. Simple inputs may require only a few features, while more complex ones need more expressive representations. Using a constant K can therefore introduce noise in simple cases or miss important structure in more complex ones. To address this issue, we propose SoftSAE, a sparse autoencoder with a Dynamic Top-K selection mechanism. Our method uses a differentiable Soft Top-K operator to learn an input-dependent sparsity level k. This allows the model to adjust the number of active features based on the complexity of each input. As a result, the representation better matches the structure of the data, and the explanation length reflects the amount of information in the input. Experimental results confirm that SoftSAE not only finds meaningful features, but also selects the right number of features for each concept. The source code is available at: https://anonymous.4open.science/r/SoftSAE-8F71/.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.