Linearizing Vision Transformer with Test-Time Training

May 4, 20262605.02772

Yining Li, Dongchen Han, Zeyu Liu, Hanyi Wang, Yulin Wang + 1 more

cs.CV

TLDR

A new method linearizes Vision Transformers using Test-Time Training, enabling efficient conversion of pretrained models for faster inference with comparable quality.

Key contributions

Identifies Test-Time Training (TTT) as a linear-complexity architecture structurally aligned with Softmax attention.
Enables direct inheritance of pretrained Softmax attention weights for efficient conversion.
Introduces key instance normalization and a locality enhancement module for representational alignment.
Linearizes Stable Diffusion 3.5 (SD3.5-T^5), achieving comparable quality with 1.3-1.4x faster inference.

Why it matters

Training linear attention models is expensive. This work efficiently converts pretrained Softmax Transformers to linear-complexity, accelerating inference in large models like Stable Diffusion without quality loss, making them more practical.

Original Abstract

While linear-complexity attention mechanisms offer a promising alternative to Softmax attention for overcoming the quadratic bottleneck, training such models from scratch remains prohibitively expensive. Inheriting weights from pretrained Transformers provides an appealing shortcut, yet the fundamental representational gap between Softmax and linear attention prevents effective weight transfer. In this work, we address this conversion challenge from two perspectives: architectural alignment and representational alignment. We identify Test-Time Training (TTT) as a linear-complexity architecture whose two-layer dynamic formulation is structurally aligned with Softmax attention, enabling direct inheritance of pretrained attention weights. To further align representational properties, including key shift-invariance and locality, we introduce key instance normalization and a lightweight locality enhancement module. We validate our approach by linearizing Stable Diffusion 3.5 and introduce SD3.5-T$^5$ (Transformer To Test Time Training). With only 1 hour of fine-tuning on 4$\times$H20 GPUs, SD3.5-T$^5$ achieves comparable text-to-image quality to the fine-tuned Softmax model, while accelerating inference by 1.32$\times$ and 1.47$\times$ at 1K and 2K resolutions.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers