Taming Outlier Tokens in Diffusion Transformers
Xiaoyu Wu, Yifei Wang, Tsu-Jui Fu, Liang-Chieh Chen, Zhe Gan + 1 more
TLDR
This paper introduces Dual-Stage Registers (DSR) to effectively tame outlier tokens in Diffusion Transformers, improving image generation quality.
Key contributions
- Discovers outlier tokens in both ViT encoders and DiT denoisers, corrupting local patch semantics in RAE-DiT models.
- Reveals that simple masking of high-norm tokens fails to improve DiT performance, indicating a deeper issue.
- Introduces Dual-Stage Registers (DSR), a register-based method to control outliers in both encoder and denoiser.
- DSR consistently reduces outlier artifacts and improves image generation quality across diverse benchmarks.
Why it matters
Outlier tokens degrade image quality in Diffusion Transformers, a critical issue for generative AI. This work provides a novel, effective solution, Dual-Stage Registers, to control these tokens. It highlights the importance of outlier management for building more robust and higher-quality DiT models.
Original Abstract
We study outlier tokens in Diffusion Transformers (DiTs) for image generation. Prior work has shown that Vision Transformers (ViTs) can produce a small number of high-norm tokens that attract disproportionate attention while carrying limited local information, but their role in generative models remains underexplored. We show that this phenomenon appears in both the encoder and denoiser of modern Representation Autoencoder (RAE)-DiT pipelines: pretrained ViT encoders can produce outlier representations, and DiTs themselves can develop internal outlier tokens, especially in intermediate layers. Moreover, simply masking high-norm tokens does not improve performance, indicating that the problem is not only caused by a few extreme values, but is more closely related to corrupted local patch semantics. To address this issue, we introduce Dual-Stage Registers (DSR), a register-based intervention for both components: trained registers when available, recursive test-time registers otherwise, and diffusion registers for the denoiser. Across ImageNet and large-scale text-to-image generation, these interventions consistently reduce outlier artifacts and improve generation quality. Our results highlight outlier-token control as an important ingredient in building stronger DiTs.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.