Transformers with Selective Access to Early Representations

May 5, 20262605.03953

Skye Gunasekaran, Téa Wright, Rui-Jie Zhu, Jason Eshraghian

cs.LGcs.CL

TLDR

SATFormer introduces a context-dependent gating mechanism for Transformers to selectively access early representations, improving performance and efficiency.

Key contributions

Proposes Selective Access Transformer (SATFormer) for controlled, context-dependent reuse of early layer features.
Achieves consistent improvements in validation loss and zero-shot accuracy across models from 130M to 1.3B parameters.
Outperforms static value residuals by ~1.5 points on retrieval-intensive tasks while maintaining throughput and memory.
Gate analysis reveals sparse, depth-dependent, and head-specific access patterns, confirming selective reuse.

Why it matters

This paper addresses the challenge of retaining low-level features in deep Transformers without high computational cost. By treating early representation reuse as a retrieval problem, SATFormer offers a more efficient and effective solution than prior methods. This approach could lead to more robust and accurate large language models.

Original Abstract

Several recent Transformer architectures expose later layers to representations computed in the earliest layers, motivated by the observation that low-level features can become harder to recover as the residual stream is repeatedly transformed through depth. The cheapest among these methods add static value residuals: learned mixing coefficients that expose the first-layer value projection V_1 uniformly across tokens and heads. More expressive dense or dynamic alternatives recover finer-grained access, but at higher memory cost and lower throughput. The usefulness of V_1 is unlikely to be constant across tokens, heads, and contexts; different positions plausibly require different amounts of access to early lexical or semantic information. We therefore treat early-representation reuse as a retrieval problem rather than a connectivity problem, and introduce Selective Access Transformer (SATFormer), which preserves the first-layer value pathway while controlling access with a context-dependent gate. Across models from 130M to 1.3B parameters, SATFormer consistently improves validation loss and zero-shot accuracy over the static value-residual and Transformer baselines. Its strongest gains appear on retrieval-intensive benchmarks, where it improves over static value residuals by approximately 1.5 average points, while maintaining throughput and memory usage close to the baseline Transformer. Gate analyses suggest sparse, depth-dependent, head-specific, and category-sensitive access patterns, supporting the interpretation that SATFormer learns selective reuse of early representations rather than uniform residual copying. Our code is available at https://github.com/SkyeGunasekaran/SATFormer.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers