Hierarchical Kernel Transformer: Multi-Scale Attention with an Information-Theoretic Approximation Analysis
TLDR
Hierarchical Kernel Transformer (HKT) introduces a multi-scale attention mechanism with trainable downsampling, achieving performance gains at a small computational cost.
Key contributions
- HKT processes sequences at multiple resolution levels via trainable causal downsampling.
- Computational cost is bounded at 1.31x for L=3 compared to standard attention.
- Provides theoretical analysis on kernel properties, attention decomposition, and approximation error.
- Achieves significant performance gains on ListOps, CIFAR-10, and IMDB over standard attention baselines.
Why it matters
This paper introduces a novel attention mechanism that improves performance on various tasks without a large increase in computational overhead. Its strong theoretical foundation and empirical gains make it a promising advancement for sequence processing models.
Original Abstract
The Hierarchical Kernel Transformer (HKT) is a multi-scale attention mechanism that processes sequences at L resolution levels via trainable causal downsampling, combining level-specific score matrices through learned convex weights. The total computational cost is bounded by 4/3 times that of standard attention, reaching 1.3125x for L = 3. Four theoretical results are established. (i) The hierarchical score matrix defines a positive semidefinite kernel under a sufficient condition on the symmetrised bilinear form (Proposition 3.1). (ii) The asymmetric score matrix decomposes uniquely into a symmetric part controlling reciprocal attention and an antisymmetric part controlling directional attention; HKT provides L independent such pairs across scales, one per resolution level (Propositions 3.5-3.6). (iii) The approximation error decomposes into three interpretable components with an explicit non-Gaussian correction and a geometric decay bound in L (Theorem 4.3, Proposition 4.4). (iv) HKT strictly subsumes single-head standard attention and causal convolution (Proposition 3.4). Experiments over 3 random seeds show consistent gains over retrained standard attention baselines: +4.77pp on synthetic ListOps (55.10+-0.29% vs 50.33+-0.12%, T = 512), +1.44pp on sequential CIFAR-10 (35.45+-0.09% vs 34.01+-0.19%, T = 1,024), and +7.47pp on IMDB character-level sentiment (70.19+-0.57% vs 62.72+-0.40%, T = 1,024), all at 1.31x overhead.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.