INCRT: An Incremental Transformer That Determines Its Own Architecture
TLDR
INCRT is an incremental Transformer that dynamically determines its optimal architecture during training, reducing redundancy and parameters.
Key contributions
- Dynamically determines Transformer architecture by adding/pruning attention heads during training.
- Uses an online-computable geometric quantity to guide architectural growth and pruning decisions.
- Achieves minimal and sufficient configurations with theoretical guarantees (homeostatic convergence).
- Matches or exceeds BERT-base performance with 3-7x fewer parameters and no pre-training.
Why it matters
Existing Transformers are often over-parameterized due to fixed, trial-and-error architectures. This paper offers a principled method to dynamically optimize Transformer structure during training. It results in significantly more efficient and compact models, reducing parameter count and computational overhead.
Original Abstract
Transformer architectures are designed by trial and error: the number of attention heads, the depth, and the head size are fixed before training begins, with no mathematical principle to guide the choice. The result is systematic structural redundancy -- between half and four-fifths of all heads in a trained model can be removed without measurable loss -- because the architecture allocates capacity without reference to the actual requirements of the task.This paper introduces INCRT (Incremental Transformer), an architecture that determines its own structure during training. Starting from a single head, INCRT adds one attention head at a time whenever its current configuration is provably insufficient, and prunes heads that have become redundant. Each growth decision is driven by a single, online-computable geometric quantity derived from the task's directional structure, requiring no separate validation phase and no hand-tuned schedule. Two theorems form the theoretical backbone. The first (homeostatic convergence) establishes that the system always reaches a finite stopping configuration that is simultaneously minimal (no redundant heads) and sufficient (no uncaptured directional energy above the threshold). The second (compressed-sensing analogy) provides a geometric upper bound on the number of heads that this configuration can contain, as a function of the spectral complexity of the task. Experiments on SARS-CoV-2 variant classification and SST-2 sentiment analysis confirm both results: the predicted and observed head counts agree within 12% across all benchmarks, and the final architectures match or exceed BERT-base on distribution-specific tasks while using between three and seven times fewer parameters and no pre-training.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.