Sparsity Moves Computation: How FFN Architecture Reshapes Attention in Small Transformers
Gabriel Smithline, Chris Mascioli
TLDR
FFN architecture, especially sparsity, significantly reshapes how Transformers compute, shifting work from FFNs to attention mechanisms.
Key contributions
- Sparse MoE routing shifts computation from FFNs to attention, especially for carry addition tasks.
- Architectural sparsity, not learned routing, primarily drives this computation redistribution.
- GLU gating rotates task-relevant structure, making neuron-level interpretability less effective.
- Demonstrates that local FFN design choices have non-local consequences for Transformer computation.
Why it matters
This paper reveals how FFN architecture fundamentally alters Transformer computation, showing that local design choices have global effects. Understanding this interaction is crucial for designing more efficient and interpretable Transformer models, highlighting architectural sparsity as a powerful tool.
Original Abstract
Architectural choices inside the Transformer feedforward network (FFN) block do not merely affect the block itself; they reshape the computations learned by the rest of the model. We study this effect in one-layer Transformers trained on digit addition with carry, modular arithmetic, and histogram counting. Comparing dense FFNs, gated linear units (GLUs), mixture-of-experts (MoE), and MoE-GLUs, we find that sparse MoE routing can shift computation from FFN to attention, with the strongest ablation-visible effect on carry-based addition. We decompose this redistribution into reduced per-token FFN capacity and sparse partitioning across experts. Critically, frozen random routing nearly matches learned routing, suggesting that redistribution is driven largely by architectural sparsity rather than router-learned specialization. As a secondary finding, GLU-style multiplicative gating rotates task-relevant Fourier structure out of the per-neuron basis and into distributed subspaces, making neuron-level interpretability less informative while preserving structured computation. We validate these conclusions with random-routing, narrow-FFN, and top-2 MoE controls, plus parameter-matching, activation-function, and width-scaling analyses. Together, these results show that local FFN design choices can have nonlocal consequences for Transformer computation.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.