Linearizing Vision Transformer with Test-Time Training
Yining Li, Dongchen Han, Zeyu Liu, Hanyi Wang, Yulin Wang + 1 more
TLDR
A new method linearizes Vision Transformers using Test-Time Training, enabling efficient conversion of pretrained models for faster inference with comparable quality.
Key contributions
- Identifies Test-Time Training (TTT) as a linear-complexity architecture structurally aligned with Softmax attention.
- Enables direct inheritance of pretrained Softmax attention weights for efficient conversion.
- Introduces key instance normalization and a locality enhancement module for representational alignment.
- Linearizes Stable Diffusion 3.5 (SD3.5-T^5), achieving comparable quality with 1.3-1.4x faster inference.
Why it matters
Training linear attention models is expensive. This work efficiently converts pretrained Softmax Transformers to linear-complexity, accelerating inference in large models like Stable Diffusion without quality loss, making them more practical.
Original Abstract
While linear-complexity attention mechanisms offer a promising alternative to Softmax attention for overcoming the quadratic bottleneck, training such models from scratch remains prohibitively expensive. Inheriting weights from pretrained Transformers provides an appealing shortcut, yet the fundamental representational gap between Softmax and linear attention prevents effective weight transfer. In this work, we address this conversion challenge from two perspectives: architectural alignment and representational alignment. We identify Test-Time Training (TTT) as a linear-complexity architecture whose two-layer dynamic formulation is structurally aligned with Softmax attention, enabling direct inheritance of pretrained attention weights. To further align representational properties, including key shift-invariance and locality, we introduce key instance normalization and a lightweight locality enhancement module. We validate our approach by linearizing Stable Diffusion 3.5 and introduce SD3.5-T$^5$ (Transformer To Test Time Training). With only 1 hour of fine-tuning on 4$\times$H20 GPUs, SD3.5-T$^5$ achieves comparable text-to-image quality to the fine-tuned Softmax model, while accelerating inference by 1.32$\times$ and 1.47$\times$ at 1K and 2K resolutions.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.