Adaptive Head Budgeting for Efficient Multi-Head Attention

April 24, 20262604.22583

Bilal Faye, Abdoulaye Mbaye, Hanane Azzag, Mustapha Lebbah

cs.LG

TLDR

BudgetFormer adaptively allocates attention heads per input, reducing computational cost and improving performance in Transformer models.

Key contributions

Introduces BudgetFormer, a Transformer with an adaptive multi-head attention mechanism.
Dynamically learns a head budget and selects the most informative attention heads per input.
Employs an exploration-exploitation training strategy to optimize head configurations.
Achieves reduced inference cost (FLOPs, memory) and improved performance on text classification.

Why it matters

Transformers are powerful but often inefficient due to fixed attention heads. This work provides a principled method to dynamically allocate heads, significantly improving efficiency and effectiveness for real-world applications.

Original Abstract

Transformers have become the dominant architecture across a wide range of domains, largely due to the effectiveness of multi-head attention in capturing diverse representation subspaces. However, standard multi-head attention activates all heads uniformly for every input, regardless of task requirements or input complexity. In many scenarios, particularly for coarse-grained tasks such as text classification, the relevant information is often global and does not require the full diversity of attention heads. As a consequence, using a fixed number of heads can introduce unnecessary computational cost or lead to suboptimal performance when the allocation does not match the input. To address this limitation, we introduce BudgetFormer, a Transformer architecture equipped with an adaptive multi-head attention mechanism that dynamically allocates computational resources. Our approach learns, for each input, both a head budget corresponding to the number of attention heads required, and a relevance distribution that selects the most informative heads. We also propose a training strategy based on an exploration and exploitation trade-off, allowing the model to discover effective head configurations before converging to efficient usage patterns. Experiments on text classification tasks of varying complexity show that our method reduces inference cost in terms of FLOPs and memory, while also achieving performance that can surpass standard full multi-head attention. These results highlight the potential of adaptive head allocation as a principled approach to improving both efficiency and effectiveness in Transformer models.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers