Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料

Yusuke Uchida (@yu4u)
株式会社 Mobility Technologies
Swin Transformer:
Hierarchical Vision Transformer
Using Shifted Windows
本資料はDeNA+MoTでの
輪講資料に加筆したものです

2
▪ 本家
▪ https://github.com/microsoft/Swin-
Transformer/blob/main/models/swin_transformer.py
▪ timm版（ほぼ本家のporting）
▪ https://github.com/rwightman/pytorch-image-
models/blob/master/timm/models/swin_transformer.py
▪ バックボーンとして使うならこちら
▪ https://github.com/SwinTransformer/Swin-Transformer-Object-
Detection/blob/master/mmdet/models/backbones/swin_transformer.py
本家実装が参考になるので合わせて見ましょう

3
▪ Equal contribution多すぎィ
どうでもいいところから

4
利用者の声
個人の感想です

5
▪ TransformerはNLPでデファクトバックボーンとなった
▪ TransformerをVisionにおけるCNNのように
汎用的なバックボーンとすることはできないか？ → Swin Transformer!
▪ NLPとVisionのドメインの違いに対応する拡張を提案
▪ スケールの問題
▪ NLPではword tokenが処理の最小単位、
画像はmulti-scaleの処理が重要なタスクも存在（e.g. detection）
→パッチマージによる階層的な特徴マップの生成
▪ 解像度の問題
▪ パッチ単位よりも細かい解像度の処理が求められるタスクも存在
→Shift Windowによる計算量削減、高解像度特徴マップ実現
概要

6
▪ C2-C5特徴マップが出力でき、CNNと互換性がある
▪ チャネルが2倍で増えていく部分も同じ
アーキテクチャ
C2 C3 C4 C5
理屈上は

7
timm版はクラス分類以外のバックボーンとしては使いづらい
timm Swin-Transformer-Object-Detection
この段階で
avgpoolされてる
ちゃんと各レベルの特徴が
BCHWのshapeのリストで得られる

8
timm版はクラス分類以外のバックボーンとしては使いづらい
https://github.com/rwightman/pytorch-image-models/issues/614

9
▪ 主な構成モジュール
アーキテクチャ
Patch Partition
&
Linear Embedding
Patch Merging Swin Transformer Block

10
▪ Patch Partition
▪ ViTと同じく画像を固定サイズのパッチに分割
▪ デフォルトだと 4x4 のパッチ
→RGB画像だと 4x4x3 次元のtokenができる
▪ Linear Embedding
▪ パッチ (token) をC次元に変換
▪ 実際は上記2つをkernel_size=stride=パッチサイズの
conv2dで行っている
▪ デフォルトではその後 Layer Normalization
Patch Partition & Linear Embedding

11
▪ 近傍 2x2 のC次元パッチを統合
▪ concat → 4C次元
▪ Layer Normalization
▪ Linear → 2C次元
Patch Merging
(B, HW, C) にしてるのでpixel_unshuffle
使いづらい？

12
▪ Transformerのencoder layerとほぼ同じ
▪ 差分は Shifted Window-based Multi-head Self-attention
Swin Transformer Block
Two Successive
Swin Transformer Blocks
ココがポイント

13
Two Successive
Pre-norm
Post-norm

14
▪ Learning Deep Transformer Models for Machine Translation, ACL’19.
▪ On Layer Normalization in the Transformer Architecture, ICML’20.
Post-norm vs. Pre-norm
ResNetのpost-act, pre-actを
思い出しますね？

15
Two Successive

16
▪ 特徴マップをサイズがMxMのwindowに区切り
window内でのみself-attentionを求める
▪ hxw個のパッチが存在する特徴マップにおいて、
(hw)x(hw)の計算量が、M2xM2 x (h/M)x(w/M) = M2hwに削減
▪ M=7 (入力サイズ224の場合）
▪ C2（stride=4, 56x56のfeature map）だと、8x8個window
Window-based Multi-head Self-attention (W-MSA)
per window window数
パッチ数の2乗

17
▪ (M/2, M/2) だけwindowをshiftしたW-MSA
▪ 通常のwindow-basedと交互に適用することで
隣接したwindow間でのconnectionが生まれる
Shifted Window-based Multi-head Self-attention (SW-MSA)
h=w=8, M=4の例

18
▪ 下記だと9個のwindowができるが、特徴マップをshiftし
シフトなしと同じ2x2のwindowとしてattention計算
▪ 実際は複数windowが混じっているwindowは
maskを利用してwindow間のattentionを0にする
効率的なSW-MSAの実装

19
実装
shift
逆shift
(S)W-MSA本体

20
▪ Self-attention自体は単なる集合のencoder
▪ Positional encodingにより系列データであることを教えている
▪ SwinではRelative Position Biasを利用
▪ Relativeにすることで、translation invarianceを表現
Relative Position Bias
Window内の相対的な位置関係によって
attention強度を調整（learnable）

21
▪ 相対位置関係は縦横[−M + 1, M −1]のrangeで(2M-1)2パターン
▪ このbiasとindexの関係を保持しておき、使うときに引く
実装

22
▪ On Position Embeddings in BERT, ICLR’21
▪ https://openreview.net/forum?id=onxoVA9FxMw
▪ https://twitter.com/akivajp/status/1442241252204814336
▪ Rethinking and Improving Relative Position Encoding for Vision
Transformer, ICCV’21. thanks to @sasaki_ts
▪ CSWin Transformer: A General Vision Transformer Backbone with
Cross-Shaped Windows, arXiv’21. thanks to @Ocha_Cocoa
Positional Encoding（余談）

23
img_size (int | tuple(int)): Input image size. Default 224
patch_size (int | tuple(int)): Patch size. Default: 4
in_chans (int): Number of input image channels. Default: 3
num_classes (int): Number of classes for classification head. Default: 1000
embed_dim (int): Patch embedding dimension. Default: 96
depths (tuple(int)): Depth of each Swin Transformer layer. [2, 2, 6, 2]
num_heads (tuple(int)): Number of attention heads in different layers. [3, 6, 12, 24]
window_size (int): Window size. Default: 7
mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4
qkv_bias (bool): If True, add a learnable bias to query, key, value. Default: True
qk_scale (float): Override default qk scale of head_dim ** -0.5 if set. Default: None
drop_rate (float): Dropout rate. Default: 0
attn_drop_rate (float): Attention dropout rate. Default: 0
drop_path_rate (float): Stochastic depth rate. Default: 0.1
norm_layer (nn.Module): Normalization layer. Default: nn.LayerNorm.
ape (bool): If True, add absolute position embedding to the patch embedding. Default: False
patch_norm (bool): If True, add normalization after patch embedding. Default: True
use_checkpoint (bool): Whether to use checkpointing to save memory. Default: False
パラメータとか
Stochastic depthをガッツリ使っている
次元の増加に合わせhead数増加

24
▪ クラス分類学習時stochastic depthのdrop確率
T: 0.2, S: 0.3, B: 0.5
▪ Detection, segmentationだと全て0.2
Model Configuration

25
▪ MSAとMLP (FF) 両方に適用
Stochastic Depth

26
▪ SOTA! SUGOI!
実験結果

27
▪ Shifted window, rel. pos.重要
Ablation Study

28
▪ Shiftedが精度同等で高速
Sliding window vs. shifted window

29
▪ チャネルを2等分して、縦横のstripeでのself-attention
関連手法：CSWin Transformer
X. Dong, et al., "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped
Windows," in arXiv:2107.00652.

30
関連手法：Pyramid Vision Transformer
W. Wang, et al., "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without
Convolutions," in Proc. of ICCV, 2021.
https://github.com/whai362/PVT

31

32
複数パッチを統合してflatten, liner, norm
linerとnormの順番が逆なだけでPatch Mergingと同じ

33
Position Embeddingは
普通の学習するやつ

34
Spatial-Reduction Attention
(SRA) がポイント

35
▪ K, V（辞書側）のみ空間サイズを縮小
▪ 実装としてはConv2D -> LayerNorm
▪ Qはそのままなので
出力サイズは変わらない
▪ 削減率は8, 4, 2, 1 とstrideに合わせる
Spatial-Reduction Attention (SRA)

36
▪ V2もあるよ！
▪ 2020年ではなく2021年なので誰かPR出してあげてください
https://github.com/whai362/PVT

37
▪ でっかいモデルをGPUになんとか押し込みました！
▪ post-normになってる…
関連手法：Swin Transformer V2
Ze Liu, et al., "Swin Transformer V2: Scaling Up Capacity and Resolution," in arXiv:2111.09883.

38
▪ Token mixerよりもTransformerの一般的な構造自体が重要
▪ Token mixer = self-attention, MLP
▪ Token mixerが単なるpoolingのPoolFormerを提案
関連手法： MetaFormer
W. Yu, et al., "MetaFormer is Actually What You Need for Vision," in arXiv:2111.11418.
Conv3x3
stride=2
Ave pool3x3

Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料

Ähnlich wie Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料 (20)

Mehr von Yusuke Uchida

Mehr von Yusuke Uchida (20)

Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料