SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
Visual Transformers
Kwanghee Choi (Jonas)
Table of Contents
● Preliminary
○ Key, Value, Query, Attention
○ Pooling
○ Multi-head Attention
○ Unsupervised Representation Learning
○ Syntactic Knowledge
● State-of-the-art Papers
○ Generative Pretraining from Pixels (ICML 2020)
○ An Image is Worth 16x16 Words (ICLR 2021)
○ End-to-End Object Detection with Transformers (ECCV 2020)
○ Additional Works
Key, Value, Query, Attention
● Problem: Given a set of data points (xi
, yi
), find unknown y for x.
● Simplest approach:
● A bit more complicated approach: Watson-Nadaraya Estimator (1964)
● Key, value pairs (xi
, yi
)
● Query x
● Attention ⍺
Reference. Attention in Deep Learning (Alex Smola, ICML 2019 Tutorials, http://alex.smola.org/talks/ICML19-attention.pdf )
Pooling
● Nonlinearity ⍴, ɸ, learnable weight w
● Deep sets (Zaheer et al. 2017)
○ Permutation Invariant
● Word2Vec (Mikolov et al. 2013)
○ Embed each word in a sentence
● Attention Weighting (Wang et al. 2016)
○ Query x depends on the context ⍺
● Iterative Attention Pooling (Yang et al. 2016)
○ Repeatedly update internal state qt
Reference. Attention in Deep Learning (Alex Smola, ICML 2019 Tutorials, http://alex.smola.org/talks/ICML19-attention.pdf )
Multi-head Attention
● Attention module
○ Softmax acts as an attention function.
○ Dot product of Q and K acts as a similarity.
○ sqrt(dk
): Standard deviation of the dot product when Q, V ~ N(0, 1)
● Multi-head Attention
○ Single-head limits the ability of focusing on a specific position.
○ Multi-head gives attention layers different representation subspace.
Attention Is All You Need (Vaswani et al. NeurIPS 2017)
Unsupervised Representation Learning
● Input sequence x=(x1
, x2
, … )
● Autoregressive (AR)
○ ex) ELMo, GPT
○ No bidirectional context.
○ ELMO: Need to separately train forward/backward context.
● Auto Encoding (AE)
○ Corrupted input x’=(x1
, x2
, …, [MASK], … )
○ ex) BERT
○ Bi-directional self-attention
○ Different input distribution due to corruption
Understanding XLNet https://www.borealisai.com/en/blog/understanding-xlnet
Syntactic Knowledge
● BERT representations are hierarchical rather
than linear.
○ Open Sesame: Getting Inside BERT’s Linguistic Knowledge
(Lin et al. ACLW 2019)
● BERT “naturally” learns some syntactic
information, although it is not very similar to
linguistic annotated resources.
○ Perturbed Masking: Parameter-free Probing for Analyzing
and Interpreting BERT (Wu et al. ACL 2020)
A Primer in BERTology: What we know about how BERT works (Rogers et al. TACL 2020)
Generative Pretraining from Pixels
ICML 2020, OpenAI
Towards a general “image” model
● Just as a general LM can generate coherent text, Image GPT can
generate coherent images.
● “Analysis by Synthesis” suggests that model will also know about
object categories after it learns to do so.
● Generative sequence modeling is a universal unsupervised algorithm.
Image GPT (https://openai.com/blog/image-gpt/)
Approach
Generative Pretraining from Pixels (Chen et al. ICML 2020)
What representation works best?
● In supervised pre-training, representation quality tends to increase
monotonically with depth, but with generative pre-training, it is not
obvious whether a task like pixel prediction is relevant to image
classification.
● Representations first improve as a function of depth, and then,
starting around the middle layer, begin to deteriorate.
○ In the first phase, each position gathers information from its surrounding context in
order to build a more global image representation.
○ In the second phase, this contextualized input is used to solve the conditional next
pixel prediction task.
○ This could resemble the behavior of encoder-decoder architectures, but learned
within a monolithic architecture via a pre-training objective.
Generative Pretraining from Pixels (Chen et al. ICML 2020)
Performance on CIFAR dataset
● We find that both increasing the
scale of our models and training for
more iterations result in better
generative performance, which
directly translates into better
feature quality.
● Generative models produce much
better features than BERT models
after pre-training, but BERT
models catch up after fine-tuning.
Generative Pretraining from Pixels (Chen et al. ICML 2020)
An Image is Worth 16x16 Words:
Transformers for Image Recognition at Scale
ICLR 2021, Google
When does Transformers work?
● When trained on mid-sized datasets (i.e. ImageNet), Transformers
yield modest accuracies, few % below ResNets of comparable size.
● However, large scale training (14M-300M images) trumps inductive
bias of CNNs such as translation invariance & locality.
● Naive application of self-attention to images would require that each
pixel attends to every other pixel. With quadratic cost in the number
of pixels, this does not scale to realistic input sizes.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
Model overview
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
Performance
With self-supervised pre-training (masked patch prediction), our smaller ViT-B/16 model achieves 79.9% accuracy
on ImageNet, a significant improvement of 2% to training from scratch, but still 4% behind supervised pre-training.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
Interpreting the Results
● Positional embeddings
○ We speculate that learning to represent the spatial relations in
this resolution (14 x 14) is equally easy for different strategies.
○ Closer patches tend to have more similar position embeddings.
○ Row-column structure & sinusoidal structure appears.
● Self-attention
○ “Attention distance” analogous to “receptive field size”.
○ Highly localized attention may serve a similar function as early
convolutional layers in CNNs.
○ Model attends to image regions that are semantically relevant
for classification.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
End-to-End Object Detection
with Transformers
ECCV 2020, Facebook
End-to-end object detection
Object detection as a direct set prediction problem.
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Removing NMS
● Conventional CNN to learn a 2D representation + Positional encoding
● 100 learned positional embeddings as object queries
● Global reasoning using pairwise relations
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Encoder’s attention mechanism in action
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Decoder’s attention mechanism in action
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Performance in Object Detection
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Panoptic Segmentation
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Performance in Panoptic Segmentation
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Additional Works
Notable Extensions
● Training data-efficient image transformers & distillation through
attention (Touvron et al. Arxiv 2021)
○ Add another token: distillation token to ViT. Using only the classification token
doesn’t help much.
○ Soft distillation (teacher model’s softmax output) and hard-distillation (teacher
model’s argmax with label smoothing).
○ Surpasses SOTA yet again.
● DALL·E: Creating Images from Text (Ramesh et al. 2021)
○ Decoder-only transformer that receives both the text and the image as a single
stream of tokens (Text: 256, Image: 1024) and models all of them autoregressively.
○ Creates images from text captions for a wide range of concepts expressible in natural
language.
Task-specific: Object Detection
● End-to-End Object Detection with Adaptive Clustering Transformer
(Zheng et al. Arxiv 2020)
○ ACT cluster the query features adaptively using Locality Sensitive Hashing (LSH) and
approximate the query-key interaction using the prototype-key interaction.
○ ACT can replace the original self-attention module in DETR without degrading the
performance of pre-trained DETR model.
● Deformable DETR: Deformable Transformers for End-to-End Object
Detection (Zhu et al. ICLR 2021)
○ Deformable DETR can achieve better performance than DETR (especially on small
objects) with 10× less training epochs.
○ Deformable attention module: Choose only prominent feature map pixels, aggregate
multi-scale features.
A Survey on Visual Transformer (Han et al. Arxiv 2021)
Task-specific: Object Detection
● UP-DETR: Unsupervised Pre-training for Object Detection with
Transformers (Dai et al. Arxiv 2020)
○ Propose a pretext task named random query patch detection to unsupervisedly
pretrain DETR (UP-DETR) for object detection.
● Rethinking Transformer-based Set Prediction for Object Detection
(Sun et al. Arxiv 2020)
○ Encoder-only DETR significantly accelerate the training of small object detection, as
it removes cross-attention.
○ Feature generation for transformer encoders with FCOS (Fully Convolutional
One-Stage object detector) or RCNN
A Survey on Visual Transformer (Han et al. Arxiv 2021)
Task-specific: Segmentation
● MaX-DeepLab: End-to-End Panoptic Segmentation with Mask
Transformers (Wang et al. Arxiv 2020)
○ Infers masks and classes directly without hand-coded priors like object boxes.
○ Dual-path transformer enables CNNs to read and write a global memory at any layer.
● End-to-End Video Instance Segmentation with Transformers (Wang
et al. Arxiv 2020)
○ Three dimensional (temporal, horizontal and vertical) positional encoding
○ Instance sequence matching strategy - applying loss across different time
signatures
A Survey on Visual Transformer (Han et al. Arxiv 2021)
Additional Tasks
● Learning Joint Spatial-Temporal Transformations for Video
Inpainting (Zeng et al. ECCV 2020)
● End-to-End Dense Video Captioning with Masked Transformer (Zhou
et al. CVPR 2018)
● Hand-Transformer: Non-Autoregressive Structured Modeling for 3D
Hand Pose Estimation (Huang et al. ECCV 2020)
● Taming Transformers for High-Resolution Image Synthesis (Esser et
al. Arxiv 2020)
● Pre-Trained Image Processing Transformer (Chen et al. Arxiv 2020)
○ ImageNet pre-training for image denoising/superresolution
A Survey on Visual Transformer (Han et al. Arxiv 2021)

Weitere ähnliche Inhalte

Ähnlich wie Visual Transformers

Ähnlich wie Visual Transformers (20)

IISc Internship Report
IISc Internship ReportIISc Internship Report
IISc Internship Report
 
Deep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & FutureDeep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & Future
 
Introduction to 3D Computer Vision and Differentiable Rendering
Introduction to 3D Computer Vision and Differentiable RenderingIntroduction to 3D Computer Vision and Differentiable Rendering
Introduction to 3D Computer Vision and Differentiable Rendering
 
Reading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptxReading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptx
 
Deep Neural Networks Presentation
Deep Neural Networks PresentationDeep Neural Networks Presentation
Deep Neural Networks Presentation
 
IPT.pdf
IPT.pdfIPT.pdf
IPT.pdf
 
Deep learning 1.0 and Beyond, Part 1
Deep learning 1.0 and Beyond, Part 1Deep learning 1.0 and Beyond, Part 1
Deep learning 1.0 and Beyond, Part 1
 
Conv xg
Conv xgConv xg
Conv xg
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!
 
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...
 
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
 
210610 SSIIi2021 Computer Vision x Trasnformer
210610 SSIIi2021 Computer Vision x Trasnformer210610 SSIIi2021 Computer Vision x Trasnformer
210610 SSIIi2021 Computer Vision x Trasnformer
 
Attentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsAttentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene Graphs
 
Brodmann17 CVPR 2017 review - meetup slides
Brodmann17 CVPR 2017 review - meetup slides Brodmann17 CVPR 2017 review - meetup slides
Brodmann17 CVPR 2017 review - meetup slides
 
Cvpr 2017 Summary Meetup
Cvpr 2017 Summary MeetupCvpr 2017 Summary Meetup
Cvpr 2017 Summary Meetup
 
IRJET- Image Captioning using Multimodal Embedding
IRJET-  	  Image Captioning using Multimodal EmbeddingIRJET-  	  Image Captioning using Multimodal Embedding
IRJET- Image Captioning using Multimodal Embedding
 
Computer Vision: Visual Extent of an Object
Computer Vision: Visual Extent of an ObjectComputer Vision: Visual Extent of an Object
Computer Vision: Visual Extent of an Object
 
Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018
Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018
Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistry
 
VIBE: Video Inference for Human Body Pose and Shape Estimation
VIBE: Video Inference for Human Body Pose and Shape EstimationVIBE: Video Inference for Human Body Pose and Shape Estimation
VIBE: Video Inference for Human Body Pose and Shape Estimation
 

Mehr von Kwanghee Choi

Mehr von Kwanghee Choi (19)

Trends of ICASSP 2022
Trends of ICASSP 2022Trends of ICASSP 2022
Trends of ICASSP 2022
 
추천 시스템 한 발짝 떨어져 살펴보기 (3)
추천 시스템 한 발짝 떨어져 살펴보기 (3)추천 시스템 한 발짝 떨어져 살펴보기 (3)
추천 시스템 한 발짝 떨어져 살펴보기 (3)
 
Recommendation systems: Vertical and Horizontal Scrolls
Recommendation systems: Vertical and Horizontal ScrollsRecommendation systems: Vertical and Horizontal Scrolls
Recommendation systems: Vertical and Horizontal Scrolls
 
추천 시스템 한 발짝 떨어져 살펴보기 (1)
추천 시스템 한 발짝 떨어져 살펴보기 (1)추천 시스템 한 발짝 떨어져 살펴보기 (1)
추천 시스템 한 발짝 떨어져 살펴보기 (1)
 
추천 시스템 한 발짝 떨어져 살펴보기 (2)
추천 시스템 한 발짝 떨어져 살펴보기 (2)추천 시스템 한 발짝 떨어져 살펴보기 (2)
추천 시스템 한 발짝 떨어져 살펴보기 (2)
 
Before and After the AI Winter - Recap
Before and After the AI Winter - RecapBefore and After the AI Winter - Recap
Before and After the AI Winter - Recap
 
Mastering Gomoku - Recap
Mastering Gomoku - RecapMastering Gomoku - Recap
Mastering Gomoku - Recap
 
Teachings of Ada Lovelace
Teachings of Ada LovelaceTeachings of Ada Lovelace
Teachings of Ada Lovelace
 
div, grad, curl, and all that - a review
div, grad, curl, and all that - a reviewdiv, grad, curl, and all that - a review
div, grad, curl, and all that - a review
 
Gaussian processes
Gaussian processesGaussian processes
Gaussian processes
 
Neural Architecture Search: Learning How to Learn
Neural Architecture Search: Learning How to LearnNeural Architecture Search: Learning How to Learn
Neural Architecture Search: Learning How to Learn
 
Duality between OOP and RL
Duality between OOP and RLDuality between OOP and RL
Duality between OOP and RL
 
JFEF encoding
JFEF encodingJFEF encoding
JFEF encoding
 
Bandit algorithms for website optimization - A summary
Bandit algorithms for website optimization - A summaryBandit algorithms for website optimization - A summary
Bandit algorithms for website optimization - A summary
 
Dummy log generation using poisson sampling
 Dummy log generation using poisson sampling Dummy log generation using poisson sampling
Dummy log generation using poisson sampling
 
Azure functions: Quickstart
Azure functions: QuickstartAzure functions: Quickstart
Azure functions: Quickstart
 
Modern convolutional object detectors
Modern convolutional object detectorsModern convolutional object detectors
Modern convolutional object detectors
 
Usage of Moving Average
Usage of Moving AverageUsage of Moving Average
Usage of Moving Average
 
Jpl coding standard for the c programming language
Jpl coding standard for the c programming languageJpl coding standard for the c programming language
Jpl coding standard for the c programming language
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Visual Transformers

  • 2. Table of Contents ● Preliminary ○ Key, Value, Query, Attention ○ Pooling ○ Multi-head Attention ○ Unsupervised Representation Learning ○ Syntactic Knowledge ● State-of-the-art Papers ○ Generative Pretraining from Pixels (ICML 2020) ○ An Image is Worth 16x16 Words (ICLR 2021) ○ End-to-End Object Detection with Transformers (ECCV 2020) ○ Additional Works
  • 3. Key, Value, Query, Attention ● Problem: Given a set of data points (xi , yi ), find unknown y for x. ● Simplest approach: ● A bit more complicated approach: Watson-Nadaraya Estimator (1964) ● Key, value pairs (xi , yi ) ● Query x ● Attention ⍺ Reference. Attention in Deep Learning (Alex Smola, ICML 2019 Tutorials, http://alex.smola.org/talks/ICML19-attention.pdf )
  • 4. Pooling ● Nonlinearity ⍴, ɸ, learnable weight w ● Deep sets (Zaheer et al. 2017) ○ Permutation Invariant ● Word2Vec (Mikolov et al. 2013) ○ Embed each word in a sentence ● Attention Weighting (Wang et al. 2016) ○ Query x depends on the context ⍺ ● Iterative Attention Pooling (Yang et al. 2016) ○ Repeatedly update internal state qt Reference. Attention in Deep Learning (Alex Smola, ICML 2019 Tutorials, http://alex.smola.org/talks/ICML19-attention.pdf )
  • 5. Multi-head Attention ● Attention module ○ Softmax acts as an attention function. ○ Dot product of Q and K acts as a similarity. ○ sqrt(dk ): Standard deviation of the dot product when Q, V ~ N(0, 1) ● Multi-head Attention ○ Single-head limits the ability of focusing on a specific position. ○ Multi-head gives attention layers different representation subspace. Attention Is All You Need (Vaswani et al. NeurIPS 2017)
  • 6. Unsupervised Representation Learning ● Input sequence x=(x1 , x2 , … ) ● Autoregressive (AR) ○ ex) ELMo, GPT ○ No bidirectional context. ○ ELMO: Need to separately train forward/backward context. ● Auto Encoding (AE) ○ Corrupted input x’=(x1 , x2 , …, [MASK], … ) ○ ex) BERT ○ Bi-directional self-attention ○ Different input distribution due to corruption Understanding XLNet https://www.borealisai.com/en/blog/understanding-xlnet
  • 7. Syntactic Knowledge ● BERT representations are hierarchical rather than linear. ○ Open Sesame: Getting Inside BERT’s Linguistic Knowledge (Lin et al. ACLW 2019) ● BERT “naturally” learns some syntactic information, although it is not very similar to linguistic annotated resources. ○ Perturbed Masking: Parameter-free Probing for Analyzing and Interpreting BERT (Wu et al. ACL 2020) A Primer in BERTology: What we know about how BERT works (Rogers et al. TACL 2020)
  • 8. Generative Pretraining from Pixels ICML 2020, OpenAI
  • 9. Towards a general “image” model ● Just as a general LM can generate coherent text, Image GPT can generate coherent images. ● “Analysis by Synthesis” suggests that model will also know about object categories after it learns to do so. ● Generative sequence modeling is a universal unsupervised algorithm. Image GPT (https://openai.com/blog/image-gpt/)
  • 10. Approach Generative Pretraining from Pixels (Chen et al. ICML 2020)
  • 11. What representation works best? ● In supervised pre-training, representation quality tends to increase monotonically with depth, but with generative pre-training, it is not obvious whether a task like pixel prediction is relevant to image classification. ● Representations first improve as a function of depth, and then, starting around the middle layer, begin to deteriorate. ○ In the first phase, each position gathers information from its surrounding context in order to build a more global image representation. ○ In the second phase, this contextualized input is used to solve the conditional next pixel prediction task. ○ This could resemble the behavior of encoder-decoder architectures, but learned within a monolithic architecture via a pre-training objective. Generative Pretraining from Pixels (Chen et al. ICML 2020)
  • 12. Performance on CIFAR dataset ● We find that both increasing the scale of our models and training for more iterations result in better generative performance, which directly translates into better feature quality. ● Generative models produce much better features than BERT models after pre-training, but BERT models catch up after fine-tuning. Generative Pretraining from Pixels (Chen et al. ICML 2020)
  • 13. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale ICLR 2021, Google
  • 14. When does Transformers work? ● When trained on mid-sized datasets (i.e. ImageNet), Transformers yield modest accuracies, few % below ResNets of comparable size. ● However, large scale training (14M-300M images) trumps inductive bias of CNNs such as translation invariance & locality. ● Naive application of self-attention to images would require that each pixel attends to every other pixel. With quadratic cost in the number of pixels, this does not scale to realistic input sizes. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
  • 15. Model overview An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
  • 16. Performance With self-supervised pre-training (masked patch prediction), our smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a significant improvement of 2% to training from scratch, but still 4% behind supervised pre-training. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
  • 17. Interpreting the Results ● Positional embeddings ○ We speculate that learning to represent the spatial relations in this resolution (14 x 14) is equally easy for different strategies. ○ Closer patches tend to have more similar position embeddings. ○ Row-column structure & sinusoidal structure appears. ● Self-attention ○ “Attention distance” analogous to “receptive field size”. ○ Highly localized attention may serve a similar function as early convolutional layers in CNNs. ○ Model attends to image regions that are semantically relevant for classification. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
  • 18. End-to-End Object Detection with Transformers ECCV 2020, Facebook
  • 19. End-to-end object detection Object detection as a direct set prediction problem. End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 20. Removing NMS ● Conventional CNN to learn a 2D representation + Positional encoding ● 100 learned positional embeddings as object queries ● Global reasoning using pairwise relations End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 21. Encoder’s attention mechanism in action End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 22. Decoder’s attention mechanism in action End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 23. Performance in Object Detection End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 24. Panoptic Segmentation End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 25. Performance in Panoptic Segmentation End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 27. Notable Extensions ● Training data-efficient image transformers & distillation through attention (Touvron et al. Arxiv 2021) ○ Add another token: distillation token to ViT. Using only the classification token doesn’t help much. ○ Soft distillation (teacher model’s softmax output) and hard-distillation (teacher model’s argmax with label smoothing). ○ Surpasses SOTA yet again. ● DALL·E: Creating Images from Text (Ramesh et al. 2021) ○ Decoder-only transformer that receives both the text and the image as a single stream of tokens (Text: 256, Image: 1024) and models all of them autoregressively. ○ Creates images from text captions for a wide range of concepts expressible in natural language.
  • 28. Task-specific: Object Detection ● End-to-End Object Detection with Adaptive Clustering Transformer (Zheng et al. Arxiv 2020) ○ ACT cluster the query features adaptively using Locality Sensitive Hashing (LSH) and approximate the query-key interaction using the prototype-key interaction. ○ ACT can replace the original self-attention module in DETR without degrading the performance of pre-trained DETR model. ● Deformable DETR: Deformable Transformers for End-to-End Object Detection (Zhu et al. ICLR 2021) ○ Deformable DETR can achieve better performance than DETR (especially on small objects) with 10× less training epochs. ○ Deformable attention module: Choose only prominent feature map pixels, aggregate multi-scale features. A Survey on Visual Transformer (Han et al. Arxiv 2021)
  • 29. Task-specific: Object Detection ● UP-DETR: Unsupervised Pre-training for Object Detection with Transformers (Dai et al. Arxiv 2020) ○ Propose a pretext task named random query patch detection to unsupervisedly pretrain DETR (UP-DETR) for object detection. ● Rethinking Transformer-based Set Prediction for Object Detection (Sun et al. Arxiv 2020) ○ Encoder-only DETR significantly accelerate the training of small object detection, as it removes cross-attention. ○ Feature generation for transformer encoders with FCOS (Fully Convolutional One-Stage object detector) or RCNN A Survey on Visual Transformer (Han et al. Arxiv 2021)
  • 30. Task-specific: Segmentation ● MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers (Wang et al. Arxiv 2020) ○ Infers masks and classes directly without hand-coded priors like object boxes. ○ Dual-path transformer enables CNNs to read and write a global memory at any layer. ● End-to-End Video Instance Segmentation with Transformers (Wang et al. Arxiv 2020) ○ Three dimensional (temporal, horizontal and vertical) positional encoding ○ Instance sequence matching strategy - applying loss across different time signatures A Survey on Visual Transformer (Han et al. Arxiv 2021)
  • 31. Additional Tasks ● Learning Joint Spatial-Temporal Transformations for Video Inpainting (Zeng et al. ECCV 2020) ● End-to-End Dense Video Captioning with Masked Transformer (Zhou et al. CVPR 2018) ● Hand-Transformer: Non-Autoregressive Structured Modeling for 3D Hand Pose Estimation (Huang et al. ECCV 2020) ● Taming Transformers for High-Resolution Image Synthesis (Esser et al. Arxiv 2020) ● Pre-Trained Image Processing Transformer (Chen et al. Arxiv 2020) ○ ImageNet pre-training for image denoising/superresolution A Survey on Visual Transformer (Han et al. Arxiv 2021)