Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

March 25, 20212103.14030

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei + 3 more

cs.CVcs.LG

TLDR

The Swin Transformer introduces a hierarchical vision Transformer using shifted windows to efficiently model images at multiple scales, achieving state-of-the-art results across various vision tasks.

Key contributions

Proposes a hierarchical Transformer architecture with shifted non-overlapping local windows to enable efficient self-attention and cross-window connections.
Achieves linear computational complexity relative to image size, making it scalable to high-resolution images.
Demonstrates superior performance on image classification, object detection, and semantic segmentation benchmarks, surpassing previous state-of-the-art models.

Why it matters

This paper addresses critical challenges in adapting Transformers from language to vision by introducing a novel shifted window mechanism and hierarchical design that balance efficiency and modeling power. By enabling scalable and effective self-attention for high-resolution images, the Swin Transformer establishes a versatile backbone that significantly advances the performance of Transformer-based models in diverse computer vision tasks, paving the way for broader adoption in the field.

Original Abstract

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with \textbf{S}hifted \textbf{win}dows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at~\url{https://github.com/microsoft/Swin-Transformer}.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers