Taming Outlier Tokens in Diffusion Transformers

May 6, 20262605.05206

Xiaoyu Wu, Yifei Wang, Tsu-Jui Fu, Liang-Chieh Chen, Zhe Gan + 1 more

cs.CVcs.AIcs.LG

TLDR

This paper introduces Dual-Stage Registers (DSR) to effectively tame outlier tokens in Diffusion Transformers, improving image generation quality.

Key contributions

Discovers outlier tokens in both ViT encoders and DiT denoisers, corrupting local patch semantics in RAE-DiT models.
Reveals that simple masking of high-norm tokens fails to improve DiT performance, indicating a deeper issue.
Introduces Dual-Stage Registers (DSR), a register-based method to control outliers in both encoder and denoiser.
DSR consistently reduces outlier artifacts and improves image generation quality across diverse benchmarks.

Why it matters

Outlier tokens degrade image quality in Diffusion Transformers, a critical issue for generative AI. This work provides a novel, effective solution, Dual-Stage Registers, to control these tokens. It highlights the importance of outlier management for building more robust and higher-quality DiT models.

Original Abstract

We study outlier tokens in Diffusion Transformers (DiTs) for image generation. Prior work has shown that Vision Transformers (ViTs) can produce a small number of high-norm tokens that attract disproportionate attention while carrying limited local information, but their role in generative models remains underexplored. We show that this phenomenon appears in both the encoder and denoiser of modern Representation Autoencoder (RAE)-DiT pipelines: pretrained ViT encoders can produce outlier representations, and DiTs themselves can develop internal outlier tokens, especially in intermediate layers. Moreover, simply masking high-norm tokens does not improve performance, indicating that the problem is not only caused by a few extreme values, but is more closely related to corrupted local patch semantics. To address this issue, we introduce Dual-Stage Registers (DSR), a register-based intervention for both components: trained registers when available, recursive test-time registers otherwise, and diffusion registers for the denoiser. Across ImageNet and large-scale text-to-image generation, these interventions consistently reduce outlier artifacts and improve generation quality. Our results highlight outlier-token control as an important ingredient in building stronger DiTs.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers