SlideShare ist ein Scribd-Unternehmen logo
1 von 23
TAVE Research
Seminar
2021.03.30
An Image is Worth 16x16 Words:
Transformers for Image Recognition at Scale
Presenter : Changdae
Oh bnormal16@naver.co
m
ICLR 2021
2
Contents
1. Summing-up
2. Method
3. Experiments
4. Conclusion
3
Summing-up
Background
• Transformer
: NLP에서는 지배적인 standard architecture가 되었으나 vision분야에서의 활용은 제한
적임.
• Attention을 적용한다고 해도 합성곱 네트워크와 혼합되어 사용되거나,
ConvNet의 전반적인 틀을 유지한 채 몇몇 요소들만 대체하는 식으로 사용되어 왔음.
• Computer Vision 분야에서는 Convolutional architecture가 아직까지 dominant함.
TOP10 중 ViT모델 2개를 제외한 나머지 모
두가
EfficientNet, ResNet 기반
https://paperswithcode.com/sota/image-classification-on-imagenet
(2021 03 29 기준)
4
Summing-up
• 컴퓨터비전 분야에서 CNN에 대한 의존이 필수적이지 않다는 것을 입증.
• 대규모 데이터셋을 이용한 사전훈련에 대한 탐구를 진행하여 insight 발견.
Contribution
순수 Transformer 구조를 이용하여 image classification을 SOTA 수준으로 수
행.
충분한 양의 훈련 데이터는 inductive bias의 필요성을 감소시킴.
Inductive bias
for generalization
 Linear Regression
 Convolutional Networks
 Recurrent Networks
Linear assumption
Locality
Sequentiality
5
Method
• 원본 이미지를 작은 patch들로 분할.
• Patch들의 linear embeddings의 시퀀스를 Transformer encoder에 전달하여
feature extraction 진행.
• MLP를 Classification head로써 트랜스포머 인코더 위에 추가하여 분류 task 수행.
Overview
Vision Transformer
• 본 연구에서 대부분의 실험은 대규모의 dataset으로 사전훈련하고
더 작은 downstream task에 fine-tine하는 식으로 진행되었음.
• 사전훈련시의 MLP classification head를 single linear layer로 변경.
• pre-train시 보다 higher resolution의 데이터셋으로 fine-tuning.
Fine-tuning & Higher resolution
6
Method
ViT explain
1. Flatten
2. Linear projection (embedding)
3. Prepend [class] token
- similar to BERT
4. Add position embeddings
- use standard 1D p.e.
7
Method
ViT explain
• LayerNorm is applied before every block
• Residual connection after every block
• MLP contains two layers
with a GELU non-linearity
8
Method
ViT explain
9
Method
10
Experiments
 Comparison to SOTA
 Pre-training Data Requirements
 Performance vs Compute trade-off
 Inspecting ViT
 Self-supervision
11
Experiments
0. Setup
• imageNet (1k classes, 1.3M images )
• imageNet-21k (21k classes, 14M images)
• JFT (18k classes, 303M images)
1) Datasets
Pre-train
Benchmark
• imageNet / imageNet ReaL
• CIFAR-10/100
• Oxford-IIIT Pets / Oxford Flowers
• VTAB
12
Experiments
0. Setup
2) Model Variants
• ViT / BiT(ResNet based) / Hybrid
• ViT-L/16 means the “Large” variant with 16*16 input patch size.
• Hybrid 모델은 raw image가 아닌 ResNet의 intermediate feature maps를
patch로 쪼개 ViT에 input한다.
13
Experiments
0. Setup
3) Training & Fine-tuning
• Adam 𝛽1 = 0.9, 𝛽2 = 0.999
• Batch size = 4096, weight decay = 0.1
• Linear learning rate warmup and decay
-----
• SGD with Momentum
• Batch size = 512
4) Metrics
• Accuracy
• few-shot accuracy
14
Experiments
1. Comparison to SOTA
• JFT에 pre-train된 ViT-H/14, ViT-L/16가 기존의 SOTA 능가하는 성능.
• 동시에 pre-train resource는 더 낮음.
15
Experiments
2. Data Requirements
• Pre-training Dataset의 크기와 모델 용량 간의 상호작용존재.
- 충분한 data + 충분한 모델 capacity => 성능 향상.
Cited from paper ‘BiT (Kolesnikov et al. 2020)’
16
Experiments
2. Data Requirements
• 사전훈련 데이터셋의 크기가 작을 때는 BiT의 성능이 ViT보다 명백히 좋
음.
• 그 크기가 증가함에 따라 ViT가 점차 BiT를 초월.
17
Experiments
3. Performance vs Compute trade-off
• 모든 ViT 모델들이 성능/계산 trade-off에서 BiT를 압도.
• 동일한 성능을 달성하기 위해 드는 계산 비용이 ViT가 2 ~ 4배는 적음.
• Hybrid 모델이 비교적 작은 계산구간에서는 ViT의 계산효율성을 앞지름.
(convolutional local feature processing이 어떤 size의 ViT에도 훌륭한 보조 component로 활용될 수 있
음.)
18
Experiments
4. Inspecting ViT
• Convolutional component가 일체 사용되지 않았음에도
가로선이나 세로선 등 기본적인 공간 특징의 기저가 되는 저수준 representation을 학
습.
• Position embedding에서 이미지 내부의 거리개념을 인코딩하는 방법이 학습됨.
=> 가까운 패치들끼리, 같은 행/열의 패치끼리는 유사한 임베딩 값을 가짐.
Linear projection = PCA
19
Experiments
4. Inspecting ViT
• Self-attention은 이론상 모델에게 매우 광활한 수용력을 부여함.
네트워크가 실제로 그 수용력을 얼마나 이용할까?
• 최 하위층에서부터 일부 head에서 global한 attend 발생, 깊어질수록 평균거리
증가.
• CNN의 receptive field size와 유사한 측도.
hybrid
pure
Model attends to image regions that are
semantically relevant for classification
https://arxiv.org/abs/2005.00928
20
Experiments
5. Self-supervision
• BERT의 masked language modeling task를 모방하여
masked patch prediction for self-supervision를 실험.
• Scratch로부터 학습시키는 것보다는 유의미한 성능향상을 가져다 주었으나,
supervised pre-training 이후 transfer 하는 방식에는 훨씬 못 미치는 성능.
https://arxiv.org/pdf/2003.11562.pdf
21
Conclusion
 어떠한 image-specific inductive biases도 모델에 주입하지 않고 SOTA 달성.
(대신 이미지를 patch들의 시퀀스로 간주하여 standard Transformer에 입력.)
 거대한 데이터셋에 대해 pre-training이 이루어져야만 좋은 성능을 줄 수 있음.
 ViT는 Performance vs Computation trade off가 우수한 모델.
한계점 및 향후 연구방
향 Detection이나 segmentation 등 다른 비전분야 task들로의 확장.
 Self-supervised pre-training의 향상.
 성능 향상을 위한 ViT의 확장.
요약
22
Q & A
Discussion
23
Q & A
Discussion
Changdae Oh
bnormal16@naver.com
https://velog.io/@changdaeoh
https://github.com/changdaeoh

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
ViT.pptx
ViT.pptxViT.pptx
ViT.pptx
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)
 
Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
State of transformers in Computer Vision
State of transformers in Computer VisionState of transformers in Computer Vision
State of transformers in Computer Vision
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
 
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksPR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
 
Cnn
CnnCnn
Cnn
 

Ähnlich wie Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Image Recognition at Scale Review

Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Sitakanta Mishra
 
convolutional_neural_networks.pptx
convolutional_neural_networks.pptxconvolutional_neural_networks.pptx
convolutional_neural_networks.pptx
MsKiranSingh
 
“High-fidelity Conversion of Floating-point Networks for Low-precision Infere...
“High-fidelity Conversion of Floating-point Networks for Low-precision Infere...“High-fidelity Conversion of Floating-point Networks for Low-precision Infere...
“High-fidelity Conversion of Floating-point Networks for Low-precision Infere...
Edge AI and Vision Alliance
 

Ähnlich wie Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Image Recognition at Scale Review (20)

Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A survey
 
04 Deep CNN (Ch_01 to Ch_3).pptx
04 Deep CNN (Ch_01 to Ch_3).pptx04 Deep CNN (Ch_01 to Ch_3).pptx
04 Deep CNN (Ch_01 to Ch_3).pptx
 
Mlp mixer image_process_210613 deeplearning paper review!
Mlp mixer image_process_210613 deeplearning paper review!Mlp mixer image_process_210613 deeplearning paper review!
Mlp mixer image_process_210613 deeplearning paper review!
 
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
 
ResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksResNeSt: Split-Attention Networks
ResNeSt: Split-Attention Networks
 
lec6a.ppt
lec6a.pptlec6a.ppt
lec6a.ppt
 
convolutional_neural_networks.pptx
convolutional_neural_networks.pptxconvolutional_neural_networks.pptx
convolutional_neural_networks.pptx
 
Image Classification using deep learning
Image Classification using deep learning Image Classification using deep learning
Image Classification using deep learning
 
GNR638_Course Project for spring semester
GNR638_Course Project for spring semesterGNR638_Course Project for spring semester
GNR638_Course Project for spring semester
 
GNR638_project ppt.pdf
GNR638_project ppt.pdfGNR638_project ppt.pdf
GNR638_project ppt.pdf
 
kanimozhi2019.pdf
kanimozhi2019.pdfkanimozhi2019.pdf
kanimozhi2019.pdf
 
ImageNet Classification with Deep Convolutional Neural Networks
ImageNet Classification with Deep Convolutional Neural NetworksImageNet Classification with Deep Convolutional Neural Networks
ImageNet Classification with Deep Convolutional Neural Networks
 
CNN FEATURES ARE ALSO GREAT AT UNSUPERVISED CLASSIFICATION
CNN FEATURES ARE ALSO GREAT AT UNSUPERVISED CLASSIFICATION CNN FEATURES ARE ALSO GREAT AT UNSUPERVISED CLASSIFICATION
CNN FEATURES ARE ALSO GREAT AT UNSUPERVISED CLASSIFICATION
 
Presentation vision transformersppt.pptx
Presentation vision transformersppt.pptxPresentation vision transformersppt.pptx
Presentation vision transformersppt.pptx
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Handwritten Digit Recognition and performance of various modelsation[autosaved]
Handwritten Digit Recognition and performance of various modelsation[autosaved]Handwritten Digit Recognition and performance of various modelsation[autosaved]
Handwritten Digit Recognition and performance of various modelsation[autosaved]
 
Unsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object trackingUnsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object tracking
 
Modern Convolutional Neural Network techniques for image segmentation
Modern Convolutional Neural Network techniques for image segmentationModern Convolutional Neural Network techniques for image segmentation
Modern Convolutional Neural Network techniques for image segmentation
 
“High-fidelity Conversion of Floating-point Networks for Low-precision Infere...
“High-fidelity Conversion of Floating-point Networks for Low-precision Infere...“High-fidelity Conversion of Floating-point Networks for Low-precision Infere...
“High-fidelity Conversion of Floating-point Networks for Low-precision Infere...
 

Kürzlich hochgeladen

The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 

Kürzlich hochgeladen (20)

GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curve
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 

Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Image Recognition at Scale Review

  • 1. TAVE Research Seminar 2021.03.30 An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Presenter : Changdae Oh bnormal16@naver.co m ICLR 2021
  • 2. 2 Contents 1. Summing-up 2. Method 3. Experiments 4. Conclusion
  • 3. 3 Summing-up Background • Transformer : NLP에서는 지배적인 standard architecture가 되었으나 vision분야에서의 활용은 제한 적임. • Attention을 적용한다고 해도 합성곱 네트워크와 혼합되어 사용되거나, ConvNet의 전반적인 틀을 유지한 채 몇몇 요소들만 대체하는 식으로 사용되어 왔음. • Computer Vision 분야에서는 Convolutional architecture가 아직까지 dominant함. TOP10 중 ViT모델 2개를 제외한 나머지 모 두가 EfficientNet, ResNet 기반 https://paperswithcode.com/sota/image-classification-on-imagenet (2021 03 29 기준)
  • 4. 4 Summing-up • 컴퓨터비전 분야에서 CNN에 대한 의존이 필수적이지 않다는 것을 입증. • 대규모 데이터셋을 이용한 사전훈련에 대한 탐구를 진행하여 insight 발견. Contribution 순수 Transformer 구조를 이용하여 image classification을 SOTA 수준으로 수 행. 충분한 양의 훈련 데이터는 inductive bias의 필요성을 감소시킴. Inductive bias for generalization  Linear Regression  Convolutional Networks  Recurrent Networks Linear assumption Locality Sequentiality
  • 5. 5 Method • 원본 이미지를 작은 patch들로 분할. • Patch들의 linear embeddings의 시퀀스를 Transformer encoder에 전달하여 feature extraction 진행. • MLP를 Classification head로써 트랜스포머 인코더 위에 추가하여 분류 task 수행. Overview Vision Transformer • 본 연구에서 대부분의 실험은 대규모의 dataset으로 사전훈련하고 더 작은 downstream task에 fine-tine하는 식으로 진행되었음. • 사전훈련시의 MLP classification head를 single linear layer로 변경. • pre-train시 보다 higher resolution의 데이터셋으로 fine-tuning. Fine-tuning & Higher resolution
  • 6. 6 Method ViT explain 1. Flatten 2. Linear projection (embedding) 3. Prepend [class] token - similar to BERT 4. Add position embeddings - use standard 1D p.e.
  • 7. 7 Method ViT explain • LayerNorm is applied before every block • Residual connection after every block • MLP contains two layers with a GELU non-linearity
  • 10. 10 Experiments  Comparison to SOTA  Pre-training Data Requirements  Performance vs Compute trade-off  Inspecting ViT  Self-supervision
  • 11. 11 Experiments 0. Setup • imageNet (1k classes, 1.3M images ) • imageNet-21k (21k classes, 14M images) • JFT (18k classes, 303M images) 1) Datasets Pre-train Benchmark • imageNet / imageNet ReaL • CIFAR-10/100 • Oxford-IIIT Pets / Oxford Flowers • VTAB
  • 12. 12 Experiments 0. Setup 2) Model Variants • ViT / BiT(ResNet based) / Hybrid • ViT-L/16 means the “Large” variant with 16*16 input patch size. • Hybrid 모델은 raw image가 아닌 ResNet의 intermediate feature maps를 patch로 쪼개 ViT에 input한다.
  • 13. 13 Experiments 0. Setup 3) Training & Fine-tuning • Adam 𝛽1 = 0.9, 𝛽2 = 0.999 • Batch size = 4096, weight decay = 0.1 • Linear learning rate warmup and decay ----- • SGD with Momentum • Batch size = 512 4) Metrics • Accuracy • few-shot accuracy
  • 14. 14 Experiments 1. Comparison to SOTA • JFT에 pre-train된 ViT-H/14, ViT-L/16가 기존의 SOTA 능가하는 성능. • 동시에 pre-train resource는 더 낮음.
  • 15. 15 Experiments 2. Data Requirements • Pre-training Dataset의 크기와 모델 용량 간의 상호작용존재. - 충분한 data + 충분한 모델 capacity => 성능 향상. Cited from paper ‘BiT (Kolesnikov et al. 2020)’
  • 16. 16 Experiments 2. Data Requirements • 사전훈련 데이터셋의 크기가 작을 때는 BiT의 성능이 ViT보다 명백히 좋 음. • 그 크기가 증가함에 따라 ViT가 점차 BiT를 초월.
  • 17. 17 Experiments 3. Performance vs Compute trade-off • 모든 ViT 모델들이 성능/계산 trade-off에서 BiT를 압도. • 동일한 성능을 달성하기 위해 드는 계산 비용이 ViT가 2 ~ 4배는 적음. • Hybrid 모델이 비교적 작은 계산구간에서는 ViT의 계산효율성을 앞지름. (convolutional local feature processing이 어떤 size의 ViT에도 훌륭한 보조 component로 활용될 수 있 음.)
  • 18. 18 Experiments 4. Inspecting ViT • Convolutional component가 일체 사용되지 않았음에도 가로선이나 세로선 등 기본적인 공간 특징의 기저가 되는 저수준 representation을 학 습. • Position embedding에서 이미지 내부의 거리개념을 인코딩하는 방법이 학습됨. => 가까운 패치들끼리, 같은 행/열의 패치끼리는 유사한 임베딩 값을 가짐. Linear projection = PCA
  • 19. 19 Experiments 4. Inspecting ViT • Self-attention은 이론상 모델에게 매우 광활한 수용력을 부여함. 네트워크가 실제로 그 수용력을 얼마나 이용할까? • 최 하위층에서부터 일부 head에서 global한 attend 발생, 깊어질수록 평균거리 증가. • CNN의 receptive field size와 유사한 측도. hybrid pure Model attends to image regions that are semantically relevant for classification https://arxiv.org/abs/2005.00928
  • 20. 20 Experiments 5. Self-supervision • BERT의 masked language modeling task를 모방하여 masked patch prediction for self-supervision를 실험. • Scratch로부터 학습시키는 것보다는 유의미한 성능향상을 가져다 주었으나, supervised pre-training 이후 transfer 하는 방식에는 훨씬 못 미치는 성능. https://arxiv.org/pdf/2003.11562.pdf
  • 21. 21 Conclusion  어떠한 image-specific inductive biases도 모델에 주입하지 않고 SOTA 달성. (대신 이미지를 patch들의 시퀀스로 간주하여 standard Transformer에 입력.)  거대한 데이터셋에 대해 pre-training이 이루어져야만 좋은 성능을 줄 수 있음.  ViT는 Performance vs Computation trade off가 우수한 모델. 한계점 및 향후 연구방 향 Detection이나 segmentation 등 다른 비전분야 task들로의 확장.  Self-supervised pre-training의 향상.  성능 향상을 위한 ViT의 확장. 요약
  • 23. 23 Q & A Discussion Changdae Oh bnormal16@naver.com https://velog.io/@changdaeoh https://github.com/changdaeoh

Hinweis der Redaktion

  1. TPUv3-core-days 는 사전훈련시 사용 코어 수 x 걸린 일(day)수
  2. BiT 논문에서 진행된 데이터셋과 모델용량에 관한 실험
  3. 데이터 셋 용량문제가 아니라 종류 문제일 수도 잇지않냐 해서 JFT를 서브샘플링해서 조사
  4. 네트워크의 어떤 component가 데이터를 조회하는 range
  5. 네트워크의 어떤 component가 데이터를 조회하는 range