SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Downloaden Sie, um offline zu lesen
[http://pagines.uab.cat/mcv/]
Module 6 - Day 8 - Lecture 2
The Transformer
in Vision
31th March 2022
Xavier Giro-i-Nieto
@DocXavi
xavier.giro@upc.edu
Associate Professor
Universitat Politècnica de Catalunya
Outline
2
1. Vision Transformer (ViT)
2. Beyond ViT
The Transformer for Vision: ViT
3
#ViT Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani et al. "An
image is worth 16x16 words: Transformers for image recognition at scale." ICLR 2021. [blog] [code] [video by Yannic Kilcher]
Outline
4
1. Vision Transformer (ViT)
a. Tokenization
b. Position embeddings
c. Class embedding
d. Receptive field
e. Performance
2. Beyond ViT
The Transformer for Vision
5
Source: What’s AI, Will Transformers Replace CNNs in Computer Vision? + NVIDIA GTC Giveaway (2021)
The Transformer for Vision: ViT
6
#ViT Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani et al. "An
image is worth 16x16 words: Transformers for image recognition at scale." ICLR 2021. [blog] [code] [video by Yannic Kilcher]
Linear projection of Flattened Patches
7
Image
3x3 patches
(grayscale)
Flattened
patches
Linear
layer
Patches
W
768
9
Patch
embeddings
The Transformer for Vision
Consider the case of patches of 16x16 pixels and their embedding size of D=768, as in ViT-Base. How
could the linear layer be implented with a convolutional layer ?
…
Image
3x3 patches
(grayscale)
Flattened
patches
Linear
layer
Patches
W
16x16
16
Patch
embeddings
768
9
The Transformer for Vision
9
Consider the case of patches of 16x16 pixels and their embedding size of D=768, as in ViT-Base. How
could the linear layer be implented with a convolutional layer ?
…
Image
3x3 patches
(grayscale)
2D Convolutional layer
768 filters
Kernel size = 16x16
Stride = 16x16
Patch
embeddings
W
768
768
3
3
9
The Transformer for Vision
10
The Transformer for Vision
11
3x2x2 tensor
(RGB image of 2x2)
2 fully connected
neurons
( 3x2x2 x 2 ) weights + 1 bias
2 convolutional filters of 3 x 2 x 2
(same size as input tensor)
( 3x2x2 x 2) weights + 1 bias
Observation: Fully connected neurons could be implemented as convolutional ones.
Outline
12
1. Vision Transformer (ViT)
a. Tokenization
b. Position embeddings
c. Class embedding
d. Receptive field
e. Performance
Position Embeddings
13
#ViT Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani et al. "An
image is worth 16x16 words: Transformers for image recognition at scale." ICLR 2021. [blog] [code] [video by Yannic Kilcher]
Position embeddings
14
#ViT Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani et al. "An
image is worth 16x16 words: Transformers for image recognition at scale." ICLR 2021. [blog] [code] [video by Yannic Kilcher]
The model learns to encode the relative position between patches.
Each position embedding is most similar to
others in the same row and column, indicating
that the model has recovered the grid structure
of the original images.
Outline
15
1. Vision Transformer (ViT)
a. Tokenization
b. Position embeddings
c. Class embedding
d. Receptive field
e. Performance
Class embedding
16
#BERT Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pre-training of deep bidirectional transformers for language
understanding." NAACL 2019.
[class] is a special learnable
embedding added in front of
every input example.
It triggers the class prediction.
Class embedding
17
Why does the ViT not have a decoder in its architecture ?
Outline
18
1. Vision Transformer (ViT)
a. Tokenization
b. Position embeddings
c. Class embedding
d. Receptive field
e. Performance
Receptive field
19
Average spatial distance between one element attending to another for each
transformer block:
Both short & wide
attention ranges in early
layers (CNN can only learn
short ranges)
Deeper layers attend all
over the image.
Outline
20
1. Vision Transformer (ViT)
a. Tokenization
b. Position embeddings
c. Class embedding
d. Receptive field
e. Performance
Performance: Accuracy
21
#BiT Kolesnikov, Alexander, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. "Big transfer (bit): General
visual representation learning." ECCV 2020.
#ViT Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani et al. "An
image is worth 16x16 words: Transformers for image recognition at scale." ICLR 2021. [blog] [code] [video by Yannic Kilcher]
Slight improvement over
CNN (BiT) when very
large amounts of training
data available.
Worse
performance
than CNN (BiT)
with ImageNet
data only.
Performance: Computation
22
#BiT Kolesnikov, Alexander, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. "Big transfer (bit): General
visual representation learning." ECCV 2020.
#ViT Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani et al. "An
image is worth 16x16 words: Transformers for image recognition at scale." ICLR 2021. [blog] [code] [video by Yannic Kilcher]
Requires less training
computation than
comparable CNN (BiT).
Outline
23
1. Vision Transformer (ViT)
a. Tokenization
b. Position embeddings
c. Class embedding
d. Receptive field
e. Performance
2. Beyond ViT
3. Is attention all we need ?
Data-efficient Transformer (DeIT)
24
#DeIT Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. Training data-efficient image transformers &
distillation through attention. ICML 2021.
Distillation token that aims at predicting the label estimated by a
teacher CNN. This allows introducing the convolutional bias in ViT.
Shifted WINdow (SWIN) Self-Attention (SA)
25
#SWIN Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... & Guo, B. Swin Transformer: Hierarchical Vision Transformer using
Shifted Windows. ICCV 2021 [Best paper award].
Less computation by self-attenting only in local windows (in grey).
ViT Swin Transformers
Localized
SA
Global SA
Global SA
26
#SWIN Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... & Guo, B. Swin Transformer: Hierarchical Vision Transformer using
Shifted Windows. ICCV 2021 [Best paper award].
Hierarchical features maps by merging image patches (in red) across layers.
ViT Swin Transformers
Low
resolution
High
resolution
Low
resolution
Hierarchical ViT Backbone
Non-Hierarchical ViT Backbone
27
#VitDet Li, Yanghao, Hanzi Mao, Ross Girshick, and Kaiming He. "Exploring Plain Vision Transformer Backbones for Object
Detection." arXiv preprint arXiv:2203.16527 (2022).
Multi-scale detection by building a feature pyramid from only the last, large strige
(16) feature map of the plain backbone.
Outline
28
1. Vision Transformer (ViT)
a. Tokenization
b. Position embeddings
c. Class embedding
d. Receptive field
e. Performance
2. Beyond ViT
3. Is attention all we need ?
Object Detection
29
#DETR Carion, Nicolas, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko.
"End-to-End Object Detection with Transformers." ECCV 2020. [code] [colab]
● Object detection formulated as a set prediction problem.
● DETR infers a fixed-size amount of predictions.
● Comparable performance to Faster R-CNN.
30
#DETR Carion, Nicolas, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko.
"End-to-End Object Detection with Transformers." ECCV 2020. [code] [colab]
● During training, bipartite matching uniquely assigns predictions with ground
truth boxes.
● Prediction with no match should yield a “no object” (∅) class prediction.
Predictions
Ground
Truth
Object Detection
31
Is attention (or convolutions) all we need ?
#MLP-Mixer Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica
Yung, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy, “MLP-Mixer: An all-MLP Architecture for Vision”.
NeurIPS 2021. [tweet] [video by Yannic Kilcher]
“In this paper we show that while convolutions and attention are both sufficient
for good performance, neither of them are necessary.”
32
Is attention (or convolutions) all we need ?
#MLP-Mixer Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica
Yung, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy, “MLP-Mixer: An all-MLP Architecture for Vision”.
NeurIPS 2021. [tweet] [video by Yannic Kilcher] [code]
Two types of MLP Layers:
● MLP 1: Applied independently to image patches (i.e. “mixing” the
per-location features”)
● MLP 2: applied across patches (i.e. “mixing spatial information”).
33
#MLP-Mixer Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica
Yung, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy, “MLP-Mixer: An all-MLP Architecture for Vision”.
NeurIPS 2021. [tweet] [video by Yannic Kilcher]
Computation efficiency (train) Training data efficiency
Is attention (or convolutions) all we need ?
34
#MLP-Mixer Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica
Yung, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy, “MLP-Mixer: An all-MLP Architecture for Vision”.
NeurIPS 2021. [tweet] [video by Yannic Kilcher]
Is attention (or convolutions) all we need ?
35
#RepMLP Xiaohan Ding, Xiangyu Zhang, Jungong Han, Guiguang Ding, “RepMLP: Re-parameterizing Convolutions into
Fully-connected Layers for Image Recognition” CVPR 2022. [code] [tweet]
Is attention all we need ?
36
#ConvNeXt Liu, Zhuang, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. "A ConvNet
for the 2020s." CVPR 2022. [code]
Is attention all we need ?
Gradually “modernize” a standard ResNet towards the design of ViT.
Outline
37
1. Vision Transformer (ViT)
a. Tokenization
b. Position embeddings
c. Class embedding
d. Receptive field
e. Performance
2. Beyond ViT
38
Software
Learn more
39
Khan, Salman, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. "Transformers in vision: A survey."
ACM Computing Surveys (CSUR) (2021).
40
Learn more
● Paper with code: Twitter thread about ViT (2022)
● IAML Distill Blog: Transformers in Vision (2021)
● Touvron, Hugo, Matthieu Cord, Alaaeldin El-Nouby, Jakob Verbeek, and Hervé Jégou. "Three
things everyone should know about Vision Transformers." arXiv preprint arXiv:2203.09795
(2022).
● Tutorial: Fine-Tune ViT for Image Classification with 🤗 Transformers (2022).
41
Learn more
Ismael Elisi, “Transformers and its use in computer vision” (TUM 2021)
42
Questions ?

Weitere ähnliche Inhalte

Was ist angesagt?

Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Simplilearn
 
Super resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooSuper resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun Yoo
JaeJun Yoo
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
Jinwon Lee
 

Was ist angesagt? (20)

Transforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approachTransforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approach
 
ViT.pptx
ViT.pptxViT.pptx
ViT.pptx
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
 
210523 swin transformer v1.5
210523 swin transformer v1.5210523 swin transformer v1.5
210523 swin transformer v1.5
 
Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)
 
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
 
Super resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooSuper resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun Yoo
 
Variational denoising network
Variational denoising networkVariational denoising network
Variational denoising network
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
PR-315: Taming Transformers for High-Resolution Image Synthesis
PR-315: Taming Transformers for High-Resolution Image SynthesisPR-315: Taming Transformers for High-Resolution Image Synthesis
PR-315: Taming Transformers for High-Resolution Image Synthesis
 
Stable Diffusion path
Stable Diffusion pathStable Diffusion path
Stable Diffusion path
 
Deep learning for 3-D Scene Reconstruction and Modeling
Deep learning for 3-D Scene Reconstruction and Modeling Deep learning for 3-D Scene Reconstruction and Modeling
Deep learning for 3-D Scene Reconstruction and Modeling
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
 
Interpretability beyond feature attribution quantitative testing with concept...
Interpretability beyond feature attribution quantitative testing with concept...Interpretability beyond feature attribution quantitative testing with concept...
Interpretability beyond feature attribution quantitative testing with concept...
 
Depth estimation using deep learning
Depth estimation using deep learningDepth estimation using deep learning
Depth estimation using deep learning
 
Visual Transformer Overview
Visual Transformer OverviewVisual Transformer Overview
Visual Transformer Overview
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 

Ähnlich wie The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona 2022

Ähnlich wie The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona 2022 (20)

Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
 
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
 
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
 
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
 
Deep Learning Representations for All (a.ka. the AI hype)
Deep Learning Representations for All (a.ka. the AI hype)Deep Learning Representations for All (a.ka. the AI hype)
Deep Learning Representations for All (a.ka. the AI hype)
 
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
 
210610 SSIIi2021 Computer Vision x Trasnformer
210610 SSIIi2021 Computer Vision x Trasnformer210610 SSIIi2021 Computer Vision x Trasnformer
210610 SSIIi2021 Computer Vision x Trasnformer
 
Deep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & FutureDeep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & Future
 
Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)
Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)
Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)
 
Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018
Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018
Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
 
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN BarcelonaDeep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
 
“Can You See What I See? The Power of Deep Learning,” a Presentation from Str...
“Can You See What I See? The Power of Deep Learning,” a Presentation from Str...“Can You See What I See? The Power of Deep Learning,” a Presentation from Str...
“Can You See What I See? The Power of Deep Learning,” a Presentation from Str...
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Image Classification on ImageNet (D1L3 Insight@DCU Machine Learning Workshop ...
Image Classification on ImageNet (D1L3 Insight@DCU Machine Learning Workshop ...Image Classification on ImageNet (D1L3 Insight@DCU Machine Learning Workshop ...
Image Classification on ImageNet (D1L3 Insight@DCU Machine Learning Workshop ...
 
What's Wrong With Deep Learning?
What's Wrong With Deep Learning?What's Wrong With Deep Learning?
What's Wrong With Deep Learning?
 
Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018
 
"Demystifying Deep Neural Networks," a Presentation from BDTI
"Demystifying Deep Neural Networks," a Presentation from BDTI"Demystifying Deep Neural Networks," a Presentation from BDTI
"Demystifying Deep Neural Networks," a Presentation from BDTI
 

Mehr von Universitat Politècnica de Catalunya

Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Universitat Politècnica de Catalunya
 

Mehr von Universitat Politècnica de Catalunya (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
 

Kürzlich hochgeladen

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 

Kürzlich hochgeladen (20)

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
Vastral Call Girls Book Now 7737669865 Top Class Escort Service Available
Vastral Call Girls Book Now 7737669865 Top Class Escort Service AvailableVastral Call Girls Book Now 7737669865 Top Class Escort Service Available
Vastral Call Girls Book Now 7737669865 Top Class Escort Service Available
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 

The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona 2022

  • 1. [http://pagines.uab.cat/mcv/] Module 6 - Day 8 - Lecture 2 The Transformer in Vision 31th March 2022 Xavier Giro-i-Nieto @DocXavi xavier.giro@upc.edu Associate Professor Universitat Politècnica de Catalunya
  • 2. Outline 2 1. Vision Transformer (ViT) 2. Beyond ViT
  • 3. The Transformer for Vision: ViT 3 #ViT Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani et al. "An image is worth 16x16 words: Transformers for image recognition at scale." ICLR 2021. [blog] [code] [video by Yannic Kilcher]
  • 4. Outline 4 1. Vision Transformer (ViT) a. Tokenization b. Position embeddings c. Class embedding d. Receptive field e. Performance 2. Beyond ViT
  • 5. The Transformer for Vision 5 Source: What’s AI, Will Transformers Replace CNNs in Computer Vision? + NVIDIA GTC Giveaway (2021)
  • 6. The Transformer for Vision: ViT 6 #ViT Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani et al. "An image is worth 16x16 words: Transformers for image recognition at scale." ICLR 2021. [blog] [code] [video by Yannic Kilcher]
  • 7. Linear projection of Flattened Patches 7 Image 3x3 patches (grayscale) Flattened patches Linear layer Patches W 768 9 Patch embeddings
  • 8. The Transformer for Vision Consider the case of patches of 16x16 pixels and their embedding size of D=768, as in ViT-Base. How could the linear layer be implented with a convolutional layer ? … Image 3x3 patches (grayscale) Flattened patches Linear layer Patches W 16x16 16 Patch embeddings 768 9
  • 9. The Transformer for Vision 9 Consider the case of patches of 16x16 pixels and their embedding size of D=768, as in ViT-Base. How could the linear layer be implented with a convolutional layer ? … Image 3x3 patches (grayscale) 2D Convolutional layer 768 filters Kernel size = 16x16 Stride = 16x16 Patch embeddings W 768 768 3 3 9
  • 10. The Transformer for Vision 10
  • 11. The Transformer for Vision 11 3x2x2 tensor (RGB image of 2x2) 2 fully connected neurons ( 3x2x2 x 2 ) weights + 1 bias 2 convolutional filters of 3 x 2 x 2 (same size as input tensor) ( 3x2x2 x 2) weights + 1 bias Observation: Fully connected neurons could be implemented as convolutional ones.
  • 12. Outline 12 1. Vision Transformer (ViT) a. Tokenization b. Position embeddings c. Class embedding d. Receptive field e. Performance
  • 13. Position Embeddings 13 #ViT Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani et al. "An image is worth 16x16 words: Transformers for image recognition at scale." ICLR 2021. [blog] [code] [video by Yannic Kilcher]
  • 14. Position embeddings 14 #ViT Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani et al. "An image is worth 16x16 words: Transformers for image recognition at scale." ICLR 2021. [blog] [code] [video by Yannic Kilcher] The model learns to encode the relative position between patches. Each position embedding is most similar to others in the same row and column, indicating that the model has recovered the grid structure of the original images.
  • 15. Outline 15 1. Vision Transformer (ViT) a. Tokenization b. Position embeddings c. Class embedding d. Receptive field e. Performance
  • 16. Class embedding 16 #BERT Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pre-training of deep bidirectional transformers for language understanding." NAACL 2019. [class] is a special learnable embedding added in front of every input example. It triggers the class prediction.
  • 17. Class embedding 17 Why does the ViT not have a decoder in its architecture ?
  • 18. Outline 18 1. Vision Transformer (ViT) a. Tokenization b. Position embeddings c. Class embedding d. Receptive field e. Performance
  • 19. Receptive field 19 Average spatial distance between one element attending to another for each transformer block: Both short & wide attention ranges in early layers (CNN can only learn short ranges) Deeper layers attend all over the image.
  • 20. Outline 20 1. Vision Transformer (ViT) a. Tokenization b. Position embeddings c. Class embedding d. Receptive field e. Performance
  • 21. Performance: Accuracy 21 #BiT Kolesnikov, Alexander, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. "Big transfer (bit): General visual representation learning." ECCV 2020. #ViT Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani et al. "An image is worth 16x16 words: Transformers for image recognition at scale." ICLR 2021. [blog] [code] [video by Yannic Kilcher] Slight improvement over CNN (BiT) when very large amounts of training data available. Worse performance than CNN (BiT) with ImageNet data only.
  • 22. Performance: Computation 22 #BiT Kolesnikov, Alexander, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. "Big transfer (bit): General visual representation learning." ECCV 2020. #ViT Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani et al. "An image is worth 16x16 words: Transformers for image recognition at scale." ICLR 2021. [blog] [code] [video by Yannic Kilcher] Requires less training computation than comparable CNN (BiT).
  • 23. Outline 23 1. Vision Transformer (ViT) a. Tokenization b. Position embeddings c. Class embedding d. Receptive field e. Performance 2. Beyond ViT 3. Is attention all we need ?
  • 24. Data-efficient Transformer (DeIT) 24 #DeIT Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. Training data-efficient image transformers & distillation through attention. ICML 2021. Distillation token that aims at predicting the label estimated by a teacher CNN. This allows introducing the convolutional bias in ViT.
  • 25. Shifted WINdow (SWIN) Self-Attention (SA) 25 #SWIN Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... & Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ICCV 2021 [Best paper award]. Less computation by self-attenting only in local windows (in grey). ViT Swin Transformers Localized SA Global SA Global SA
  • 26. 26 #SWIN Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... & Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ICCV 2021 [Best paper award]. Hierarchical features maps by merging image patches (in red) across layers. ViT Swin Transformers Low resolution High resolution Low resolution Hierarchical ViT Backbone
  • 27. Non-Hierarchical ViT Backbone 27 #VitDet Li, Yanghao, Hanzi Mao, Ross Girshick, and Kaiming He. "Exploring Plain Vision Transformer Backbones for Object Detection." arXiv preprint arXiv:2203.16527 (2022). Multi-scale detection by building a feature pyramid from only the last, large strige (16) feature map of the plain backbone.
  • 28. Outline 28 1. Vision Transformer (ViT) a. Tokenization b. Position embeddings c. Class embedding d. Receptive field e. Performance 2. Beyond ViT 3. Is attention all we need ?
  • 29. Object Detection 29 #DETR Carion, Nicolas, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. "End-to-End Object Detection with Transformers." ECCV 2020. [code] [colab] ● Object detection formulated as a set prediction problem. ● DETR infers a fixed-size amount of predictions. ● Comparable performance to Faster R-CNN.
  • 30. 30 #DETR Carion, Nicolas, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. "End-to-End Object Detection with Transformers." ECCV 2020. [code] [colab] ● During training, bipartite matching uniquely assigns predictions with ground truth boxes. ● Prediction with no match should yield a “no object” (∅) class prediction. Predictions Ground Truth Object Detection
  • 31. 31 Is attention (or convolutions) all we need ? #MLP-Mixer Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy, “MLP-Mixer: An all-MLP Architecture for Vision”. NeurIPS 2021. [tweet] [video by Yannic Kilcher] “In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary.”
  • 32. 32 Is attention (or convolutions) all we need ? #MLP-Mixer Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy, “MLP-Mixer: An all-MLP Architecture for Vision”. NeurIPS 2021. [tweet] [video by Yannic Kilcher] [code] Two types of MLP Layers: ● MLP 1: Applied independently to image patches (i.e. “mixing” the per-location features”) ● MLP 2: applied across patches (i.e. “mixing spatial information”).
  • 33. 33 #MLP-Mixer Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy, “MLP-Mixer: An all-MLP Architecture for Vision”. NeurIPS 2021. [tweet] [video by Yannic Kilcher] Computation efficiency (train) Training data efficiency Is attention (or convolutions) all we need ?
  • 34. 34 #MLP-Mixer Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy, “MLP-Mixer: An all-MLP Architecture for Vision”. NeurIPS 2021. [tweet] [video by Yannic Kilcher] Is attention (or convolutions) all we need ?
  • 35. 35 #RepMLP Xiaohan Ding, Xiangyu Zhang, Jungong Han, Guiguang Ding, “RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition” CVPR 2022. [code] [tweet] Is attention all we need ?
  • 36. 36 #ConvNeXt Liu, Zhuang, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. "A ConvNet for the 2020s." CVPR 2022. [code] Is attention all we need ? Gradually “modernize” a standard ResNet towards the design of ViT.
  • 37. Outline 37 1. Vision Transformer (ViT) a. Tokenization b. Position embeddings c. Class embedding d. Receptive field e. Performance 2. Beyond ViT
  • 39. Learn more 39 Khan, Salman, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. "Transformers in vision: A survey." ACM Computing Surveys (CSUR) (2021).
  • 40. 40 Learn more ● Paper with code: Twitter thread about ViT (2022) ● IAML Distill Blog: Transformers in Vision (2021) ● Touvron, Hugo, Matthieu Cord, Alaaeldin El-Nouby, Jakob Verbeek, and Hervé Jégou. "Three things everyone should know about Vision Transformers." arXiv preprint arXiv:2203.09795 (2022). ● Tutorial: Fine-Tune ViT for Image Classification with 🤗 Transformers (2022).
  • 41. 41 Learn more Ismael Elisi, “Transformers and its use in computer vision” (TUM 2021)