SlideShare a Scribd company logo
Recent Breakthroughs in AI
- Clubhouse Podcasts
YouTube Link: https://www.youtube.com/watch?v=3OxEpGU1unA
Presenter: Sangmin Woo
2021.03.10
Andrej Karpathy Justin Johnson Lex Fridman Richard Socher Russell Kaplan
2 36
 Rise of multimodal learning: CLIP and DALL-E
• CLIP efficiently learns visual concepts from natural language supervision
• DALL-E creates images from text captions for a wide range of concepts expressible in natural language
 ‘Data’ is the KING: Importance of data and datasets
• Academia: Given dataset – build more powerful model vs. Reality (Industry): Given model – collect/generate dataset
• In fact, many innovation comes from data (not model…)
• Data curation & MLOps will become more important
 Will Transformers overtake CNNs? and towards "generalized neural substrates“
• Image = CNN, sequence = RNN → All = Transformer (consolidation of architectures)
• 2020 is the year of Transformer, All we need is Transformer!
 Lifelong learning (need to consider catastrophic forgetting & semantic shift, …)
• Benchmarking is difficult… since the tasks & models will be all different from previous SOTA…
• Then why not fix the model? Model-first benchmark design!
 Taking hard data structures and "softening" to make differentiable
• Transformer is softened version of hash table
• What would be the next generation data structure?
Talk Summary https://www.youtube.com/watch?v=3OxEpGU1unA
Learning Visual-Linguistic
Representation in the Wild
Presenter: Sangmin Woo
2021.03.10
CLIP – OpenAI / DALL-E – OpenAI / ALIGN – Google Research
(+ UniT – Facebook AI Research)
Blog Link : https://openai.com/blog/clip/
5 36
“Scaling a simple pre-training task is sufficient to achieve competitive zero-shot performance on
a great variety of image classification datasets”
 CLIP is trained on 400M (image, text) pairs found across the internet.
 Given an image, CLIP predicts which out of a set of 32,768 randomly sampled text snippets, was
actually paired with it in the dataset.
 CLIP learns to recognize a wide variety of visual concepts in images and associate them with
their names.
 CLIP models can then be applied to nearly arbitrary visual classification tasks.
Summary
6 36
 Current approaches have several major problems:
 datasets are labor intensive and costly to create
 models are good at one task and one task only
 models that perform well on benchmarks have disappointingly poor performance on real-world.
 CLIP (Contrastive Language–Image Pre-training) aims to address these problems:
• It is trained on image & natural language supervision that’s abundantly available on the internet.
• It can be instructed in natural language to perform several classification benchmarks, without directly
optimizing for the benchmark’s performance (similar to the “zero-shot” capabilities of GPT-3).
• It matches the accuracy of the original ResNet-50 on ImageNet zero-shot without using any of the
1.28M training examples.
Introduction
7 36
 Both models show the same accuracy on the
ImageNet test set.
 In non-ImageNet settings, CLIP significantly
outperforms ImageNet model.
 ObjectNet checks a model’s ability to recognize
objects in many different poses and with many
different backgrounds inside homes.
 ImageNet Rendition and ImageNet Sketch check
a model’s ability to recognize more abstract
depictions of objects.
ImageNet ResNet-101 vs. CLIP ViT-L
8 36
 CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training
examples.
 At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of
the target dataset’s classes.
Approach
9 36
Approach
10 36
 Random, non-cherry picked,
predictions of zero-shot CLIP
classifiers on examples from
various datasets:
Qualitative Examples
11 36
 CLIP is highly efficient
• CLIP learns from unfiltered, highly varied, and highly noisy data, and is intended to be used
in a zero-shot manner.
• CLIP (GPT-2 and 3) can achieve compelling zero shot performance. However, it requires
significant training compute.
• Two algorithmic choices to save compute:
 contrastive objective for connecting text with images.
 Vision Transformer gives 3x gain in compute efficiency over a standard ResNet.
Key takeaways
12 36
 Image-to-caption
Transformer model
struggled at zero-shot
transfer. It only
achieves 16% accuracy
on ImageNet after
training for 400M
images.
 CLIP is much more
efficient and achieves
the same accuracy
roughly 12x faster.
Key takeaways
13 36
 CLIP is flexible and general
• CLIP models are more flexible and general than ImageNet models because they learn a wide range of
visual concepts directly from natural language. They are able to zero-shot perform many different tasks.
• CLIP has validated its zero-shot performance on over 30 different datasets including tasks such as
fine-grained object classification, geo-localization, action recognition in videos, and OCR.
• Learning OCR is an exciting behavior that does not occur in standard ImageNet models.
• The best CLIP model outperforms the best publicly available ImageNet model, the Noisy Student
EfficientNet-L2, on 20 out of 26 different transfer datasets.
Key takeaways
14 36
 Across 27 tasks such as fine-
grained object classification, OCR,
activity recognition in videos, and
geo-localization, CLIP models
learn more widely useful image
representations.
Key takeaways
15 36
 While CLIP usually performs well on recognizing common objects, it struggles on counting the
number of objects in an image and on predicting how close the nearest car is in a photo.
 Zero-shot CLIP also struggles compared to task specific models on very fine-grained
classification, such as telling the difference between car models, variants of aircraft, or flower
species.
 CLIP also still has poor generalization to images not covered in its pre-training dataset. For
instance, although CLIP learns a capable OCR system, when evaluated on MNIST dataset, zero-
shot CLIP only achieves 88% accuracy, well below the 99.75% of humans on the dataset.
Limitations
Blog Link : https://openai.com/blog/dall-e/
YouTube: https://www.youtube.com/watch?v=az-OV47oKvA
(for more detailed and friendly explanation)
17 36
“DALL-E is a 12B parameter AR Transformer trained to generate images from text descriptions in
a zero-shot manner, using 250M text–image pairs collected from the internet”
 DALL-E achieves high quality image generation on MS-COCO dataset zero-shot, without using
any of the training labels.
 preferred over prior work trained on the dataset by human evaluators 90% of the time.
 Image-to-image translation
Summary
+
DALL-E =
Salvador Dalí WALL-E
18 36
 GPT-3: text generation
 Image GPT: image generation
 Jukebox: music generation
 DALL-E extend these findings to show that manipulating visual concepts through language is
now within reach.
 DALL-E can
• create anthropomorphized versions of animals and objects
• combine unrelated concepts in plausible ways
• render text
• apply transformations to existing images
 Qualitative examples: https://openai.com/blog/dall-e/
Introduction
19 36
Overview
20 36
 The goal is to train a transformer to autoregressively model the text and image tokens as a
single stream of data.
 However, using pixels directly as image tokens would require an inordinate amount of memory
for high-resolution images.
 A discrete variational autoencoder (dVAE) is trained to compress each 256×256 RGB image
into a 32×32 grid of image tokens, each element of which can assume 8192 possible values.
 256 BPE-encoded text tokens are concatenated with the 32×32=1024 image tokens, and train
an autoregressive transformer to model the joint distribution over the text and image tokens.
Approach
21 36
 VQ-VAE (Vector Quantized Variational AutoEncoder) for image compression
Approach
Oord et al., Neural Discrete Representation Learning
22 36
 Gumbel Softmax
Approach
Jang et al., Categorical Reparameterization with Gumbel-Softmax
23 36
 A discrete variational autoencoder (dVAE) is
trained to compress each 256×256 RGB image
into a 32×32 grid of image tokens, each element
of which can assume 8192 possible values.
 The encoder downsamples the spatial resolution by
a factor of 8.
 While details are sometimes lost or distorted, the
main features of the image are still typically
recognizable.
Approach
24 36
Qualitative Examples
26 36
“ALIGN (A Large-scale ImaGe and Noisy-text embedding) uses a noisy dataset of over 1B image
alt-text pairs, obtained without expensive filtering or post-processing steps, to learn a simple dual-
encoder architecture (image and text) by aligning visual-language representations using a
contrastive loss”
 While representation learning in NLP has transitioned to training on raw text without human
annotations, visual and vision-language representations still rely heavily on curated training
datasets that are expensive or require expert knowledge. This costly curation process limits
the size of datasets and hence hinders the scaling of trained models.
 The scale of the corpus can make up for its noise and leads to state-of-the-art representations
even with such a simple learning scheme.
Summary
27 36
 Visual and language representations are jointly learned from noisy image alt-text data and can be used for
vision-only or vision-language task transfer.
 Without any fine-tuning, ALIGN powers cross-modal search including image-to-text search, text-to-image
search and even search with joint image+text queries.
Summary
28 36
 The goal is to align the visual-language representations in a shared latent embedding space
using a simple dual-encoder architecture (image: EfficientNet, text: BERT)
 Image and text encoders are learned via a contrastive loss (formulated as normalized softmax)
that pushes the embeddings of matched image-text pair together while pushing those of non-
matched image-text pair apart.
 Considering paired texts as fine-grained labels of images, image-to-text contrastive loss is
analogous to the conventional label-based classification objective; and the key difference is
that the text encoder generates the “label” weights.
Approach
29 36
 The image (EfficientNet) and text encoders (BERT) are optimized via contrastive loss (sum of
two normalized softmax losses) that pushes the embeddings of matched image-text pair
(positive) together while pushing those of non-matched image-text pair (negative) apart.
• Image-to-text classification loss
• Text-to-image classification loss
Approach
𝑥𝑖: image embedding in the 𝑖-th pair
𝑦𝑗: text embedding in the 𝑗-th pair
𝑁: batch size
𝜎: (learnable) temperature
30 36
Qualitative Examples
32 36
“Unified Transformer (UniT) is built upon the transformer encoder-decoder architecture and jointly
learns multiple tasks across different modalities (image & text), ranging from object detection to
language understanding and multimodal reasoning”
 UniT model encodes each input modality with an encoder and makes predictions on each task
with a shared decoder over the encoded input representations, followed by task-specific output
heads
 Compared to previous efforts on multi-task learning with transformers, UniT share the same
model parameters to all tasks instead of separately fine-tuning task-specific models and handle
a much higher variety of tasks across different domains.
 UniT learns 7 tasks jointly over 8 datasets, achieving comparable performance to well-
established prior work on each domain under the same supervision with a compact set of
model parameters.
Summary
33 36
Summary
34 36
 UniT uses an image encoder, a text encoder, and a joint decoder with per-task query embedding
followed by task-specific heads to make the final outputs for each task.
Approach
35 36
 Among the existing architectures, Transformer is the most generic architectures because it has
less inductive bias than others.
 A new formula such as “Large Transformer + Large scale dataset” has begun to emerge
(CLIP:400M, DALL-E:250M, ALIGN:1B).
 All we need is data: the recent BIG studies talk about how they collected/curated data, not
much about models.
 Transformers are replacing CNN-based SOTAs, which were considered de-facto standard in
the image domain, on several benchmarks.
 Also, Transformers are indeed strong at multi-modality.
Wrap up
Thank You
shmwoo9395@{gmail.com, gist.ac.kr}
If you find my presentation interesting and this gives you new inspiration,
please feel free to contact me!

More Related Content

What's hot

NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...
NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...
NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...
paperpublications3
 
convolutional neural network (CNN, or ConvNet)
convolutional neural network (CNN, or ConvNet)convolutional neural network (CNN, or ConvNet)
convolutional neural network (CNN, or ConvNet)
RakeshSaran5
 

What's hot (20)

How much position information do convolutional neural networks encode? review...
How much position information do convolutional neural networks encode? review...How much position information do convolutional neural networks encode? review...
How much position information do convolutional neural networks encode? review...
 
Architecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks IArchitecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks I
 
Scene classification using Convolutional Neural Networks - Jayani Withanawasam
Scene classification using Convolutional Neural Networks - Jayani WithanawasamScene classification using Convolutional Neural Networks - Jayani Withanawasam
Scene classification using Convolutional Neural Networks - Jayani Withanawasam
 
Devil in the Details: Analysing the Performance of ConvNet Features
Devil in the Details: Analysing the Performance of ConvNet FeaturesDevil in the Details: Analysing the Performance of ConvNet Features
Devil in the Details: Analysing the Performance of ConvNet Features
 
Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017
Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017
Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017
 
Visual Saliency Prediction with Deep Learning - Kevin McGuinness - UPC Barcel...
Visual Saliency Prediction with Deep Learning - Kevin McGuinness - UPC Barcel...Visual Saliency Prediction with Deep Learning - Kevin McGuinness - UPC Barcel...
Visual Saliency Prediction with Deep Learning - Kevin McGuinness - UPC Barcel...
 
Review: Incremental Few-shot Instance Segmentation [CDM]
Review: Incremental Few-shot Instance Segmentation [CDM]Review: Incremental Few-shot Instance Segmentation [CDM]
Review: Incremental Few-shot Instance Segmentation [CDM]
 
Content-based Image Retrieval - Eva Mohedano - UPC Barcelona 2018
Content-based Image Retrieval - Eva Mohedano - UPC Barcelona 2018Content-based Image Retrieval - Eva Mohedano - UPC Barcelona 2018
Content-based Image Retrieval - Eva Mohedano - UPC Barcelona 2018
 
NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...
NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...
NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...
 
LeNet to ResNet
LeNet to ResNetLeNet to ResNet
LeNet to ResNet
 
convolutional neural network (CNN, or ConvNet)
convolutional neural network (CNN, or ConvNet)convolutional neural network (CNN, or ConvNet)
convolutional neural network (CNN, or ConvNet)
 
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
 
160205 NeuralArt - Understanding Neural Representation
160205 NeuralArt - Understanding Neural Representation160205 NeuralArt - Understanding Neural Representation
160205 NeuralArt - Understanding Neural Representation
 
Review: You Only Look One-level Feature
Review: You Only Look One-level FeatureReview: You Only Look One-level Feature
Review: You Only Look One-level Feature
 
Object Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning FrameworkObject Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning Framework
 
Deep Learning for Computer Vision: Image Retrieval (UPC 2016)
Deep Learning for Computer Vision: Image Retrieval (UPC 2016)Deep Learning for Computer Vision: Image Retrieval (UPC 2016)
Deep Learning for Computer Vision: Image Retrieval (UPC 2016)
 
Object Detection Methods using Deep Learning
Object Detection Methods using Deep LearningObject Detection Methods using Deep Learning
Object Detection Methods using Deep Learning
 
Semantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesSemantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network Approaches
 
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
 
CNNs: from the Basics to Recent Advances
CNNs: from the Basics to Recent AdvancesCNNs: from the Basics to Recent Advances
CNNs: from the Basics to Recent Advances
 

Similar to Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the Wild

Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakLearn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
PyData
 
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Sitakanta Mishra
 
16 OpenCV Functions to Start your Computer Vision journey.docx
16 OpenCV Functions to Start your Computer Vision journey.docx16 OpenCV Functions to Start your Computer Vision journey.docx
16 OpenCV Functions to Start your Computer Vision journey.docx
ssuser90e017
 

Similar to Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the Wild (20)

Obscenity Detection in Images
Obscenity Detection in ImagesObscenity Detection in Images
Obscenity Detection in Images
 
Image captioning
Image captioningImage captioning
Image captioning
 
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用
 
Image Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine LearningImage Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine Learning
 
Report face recognition : ArganRecogn
Report face recognition :  ArganRecognReport face recognition :  ArganRecogn
Report face recognition : ArganRecogn
 
Representational Continuity for Unsupervised Continual Learning
Representational Continuity for Unsupervised Continual LearningRepresentational Continuity for Unsupervised Continual Learning
Representational Continuity for Unsupervised Continual Learning
 
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakLearn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
 
Minor Project Report on Denoising Diffusion Probabilistic Model
Minor Project Report on Denoising Diffusion Probabilistic ModelMinor Project Report on Denoising Diffusion Probabilistic Model
Minor Project Report on Denoising Diffusion Probabilistic Model
 
Human age and gender Detection
Human age and gender DetectionHuman age and gender Detection
Human age and gender Detection
 
Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016
 
One shot learning
One shot learningOne shot learning
One shot learning
 
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
 
Automated Image Captioning – Model Based on CNN – GRU Architecture
Automated Image Captioning – Model Based on CNN – GRU ArchitectureAutomated Image Captioning – Model Based on CNN – GRU Architecture
Automated Image Captioning – Model Based on CNN – GRU Architecture
 
DSDT meetup July 2021
DSDT meetup July 2021DSDT meetup July 2021
DSDT meetup July 2021
 
OReilly AI Transfer Learning
OReilly AI Transfer LearningOReilly AI Transfer Learning
OReilly AI Transfer Learning
 
ImageNet Classification with Deep Convolutional Neural Networks
ImageNet Classification with Deep Convolutional Neural NetworksImageNet Classification with Deep Convolutional Neural Networks
ImageNet Classification with Deep Convolutional Neural Networks
 
Transfer Learning: Breve introducción a modelos pre-entrenados.
Transfer Learning: Breve introducción a modelos pre-entrenados.Transfer Learning: Breve introducción a modelos pre-entrenados.
Transfer Learning: Breve introducción a modelos pre-entrenados.
 
16 OpenCV Functions to Start your Computer Vision journey.docx
16 OpenCV Functions to Start your Computer Vision journey.docx16 OpenCV Functions to Start your Computer Vision journey.docx
16 OpenCV Functions to Start your Computer Vision journey.docx
 
What multimodal foundation models cannot perceive
What multimodal foundation models cannot perceiveWhat multimodal foundation models cannot perceive
What multimodal foundation models cannot perceive
 

More from Sangmin Woo

More from Sangmin Woo (13)

Multimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptxMultimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptx
 
Video Transformers.pptx
Video Transformers.pptxVideo Transformers.pptx
Video Transformers.pptx
 
Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptx
 
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptxAn Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
 
Visual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptxVisual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptx
 
Video Grounding.pptx
Video Grounding.pptxVideo Grounding.pptx
Video Grounding.pptx
 
Action Recognition Datasets.pptx
Action Recognition Datasets.pptxAction Recognition Datasets.pptx
Action Recognition Datasets.pptx
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
Towards Efficient Transformers
Towards Efficient TransformersTowards Efficient Transformers
Towards Efficient Transformers
 
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsAction Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
 
Neural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextNeural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global context
 
Attentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsAttentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene Graphs
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph Generation
 

Recently uploaded

Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...
Sérgio Sacani
 
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptAerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
sreddyrahul
 
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdfPests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
PirithiRaju
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
AADYARAJPANDEY1
 
THYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursingTHYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursing
Jocelyn Atis
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
muralinath2
 
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdfPests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
PirithiRaju
 

Recently uploaded (20)

GBSN - Microbiology (Lab 1) Microbiology Lab Safety Procedures
GBSN -  Microbiology (Lab  1) Microbiology Lab Safety ProceduresGBSN -  Microbiology (Lab  1) Microbiology Lab Safety Procedures
GBSN - Microbiology (Lab 1) Microbiology Lab Safety Procedures
 
Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...
 
Transport in plants G1.pptx Cambridge IGCSE
Transport in plants G1.pptx Cambridge IGCSETransport in plants G1.pptx Cambridge IGCSE
Transport in plants G1.pptx Cambridge IGCSE
 
A Giant Impact Origin for the First Subduction on Earth
A Giant Impact Origin for the First Subduction on EarthA Giant Impact Origin for the First Subduction on Earth
A Giant Impact Origin for the First Subduction on Earth
 
Shuaib Y-basedComprehensive mahmudj.pptx
Shuaib Y-basedComprehensive mahmudj.pptxShuaib Y-basedComprehensive mahmudj.pptx
Shuaib Y-basedComprehensive mahmudj.pptx
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
 
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptAerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
 
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdfPests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
 
THYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursingTHYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursing
 
electrochemical gas sensors and their uses.pptx
electrochemical gas sensors and their uses.pptxelectrochemical gas sensors and their uses.pptx
electrochemical gas sensors and their uses.pptx
 
Topography and sediments of the floor of the Bay of Bengal
Topography and sediments of the floor of the Bay of BengalTopography and sediments of the floor of the Bay of Bengal
Topography and sediments of the floor of the Bay of Bengal
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
 
Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...
Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...
Extensive Pollution of Uranus and Neptune’s Atmospheres by Upsweep of Icy Mat...
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
 
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
 
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdfPests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
 
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
 

Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the Wild

  • 1. Recent Breakthroughs in AI - Clubhouse Podcasts YouTube Link: https://www.youtube.com/watch?v=3OxEpGU1unA Presenter: Sangmin Woo 2021.03.10 Andrej Karpathy Justin Johnson Lex Fridman Richard Socher Russell Kaplan
  • 2. 2 36  Rise of multimodal learning: CLIP and DALL-E • CLIP efficiently learns visual concepts from natural language supervision • DALL-E creates images from text captions for a wide range of concepts expressible in natural language  ‘Data’ is the KING: Importance of data and datasets • Academia: Given dataset – build more powerful model vs. Reality (Industry): Given model – collect/generate dataset • In fact, many innovation comes from data (not model…) • Data curation & MLOps will become more important  Will Transformers overtake CNNs? and towards "generalized neural substrates“ • Image = CNN, sequence = RNN → All = Transformer (consolidation of architectures) • 2020 is the year of Transformer, All we need is Transformer!  Lifelong learning (need to consider catastrophic forgetting & semantic shift, …) • Benchmarking is difficult… since the tasks & models will be all different from previous SOTA… • Then why not fix the model? Model-first benchmark design!  Taking hard data structures and "softening" to make differentiable • Transformer is softened version of hash table • What would be the next generation data structure? Talk Summary https://www.youtube.com/watch?v=3OxEpGU1unA
  • 3. Learning Visual-Linguistic Representation in the Wild Presenter: Sangmin Woo 2021.03.10 CLIP – OpenAI / DALL-E – OpenAI / ALIGN – Google Research (+ UniT – Facebook AI Research)
  • 4. Blog Link : https://openai.com/blog/clip/
  • 5. 5 36 “Scaling a simple pre-training task is sufficient to achieve competitive zero-shot performance on a great variety of image classification datasets”  CLIP is trained on 400M (image, text) pairs found across the internet.  Given an image, CLIP predicts which out of a set of 32,768 randomly sampled text snippets, was actually paired with it in the dataset.  CLIP learns to recognize a wide variety of visual concepts in images and associate them with their names.  CLIP models can then be applied to nearly arbitrary visual classification tasks. Summary
  • 6. 6 36  Current approaches have several major problems:  datasets are labor intensive and costly to create  models are good at one task and one task only  models that perform well on benchmarks have disappointingly poor performance on real-world.  CLIP (Contrastive Language–Image Pre-training) aims to address these problems: • It is trained on image & natural language supervision that’s abundantly available on the internet. • It can be instructed in natural language to perform several classification benchmarks, without directly optimizing for the benchmark’s performance (similar to the “zero-shot” capabilities of GPT-3). • It matches the accuracy of the original ResNet-50 on ImageNet zero-shot without using any of the 1.28M training examples. Introduction
  • 7. 7 36  Both models show the same accuracy on the ImageNet test set.  In non-ImageNet settings, CLIP significantly outperforms ImageNet model.  ObjectNet checks a model’s ability to recognize objects in many different poses and with many different backgrounds inside homes.  ImageNet Rendition and ImageNet Sketch check a model’s ability to recognize more abstract depictions of objects. ImageNet ResNet-101 vs. CLIP ViT-L
  • 8. 8 36  CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples.  At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset’s classes. Approach
  • 10. 10 36  Random, non-cherry picked, predictions of zero-shot CLIP classifiers on examples from various datasets: Qualitative Examples
  • 11. 11 36  CLIP is highly efficient • CLIP learns from unfiltered, highly varied, and highly noisy data, and is intended to be used in a zero-shot manner. • CLIP (GPT-2 and 3) can achieve compelling zero shot performance. However, it requires significant training compute. • Two algorithmic choices to save compute:  contrastive objective for connecting text with images.  Vision Transformer gives 3x gain in compute efficiency over a standard ResNet. Key takeaways
  • 12. 12 36  Image-to-caption Transformer model struggled at zero-shot transfer. It only achieves 16% accuracy on ImageNet after training for 400M images.  CLIP is much more efficient and achieves the same accuracy roughly 12x faster. Key takeaways
  • 13. 13 36  CLIP is flexible and general • CLIP models are more flexible and general than ImageNet models because they learn a wide range of visual concepts directly from natural language. They are able to zero-shot perform many different tasks. • CLIP has validated its zero-shot performance on over 30 different datasets including tasks such as fine-grained object classification, geo-localization, action recognition in videos, and OCR. • Learning OCR is an exciting behavior that does not occur in standard ImageNet models. • The best CLIP model outperforms the best publicly available ImageNet model, the Noisy Student EfficientNet-L2, on 20 out of 26 different transfer datasets. Key takeaways
  • 14. 14 36  Across 27 tasks such as fine- grained object classification, OCR, activity recognition in videos, and geo-localization, CLIP models learn more widely useful image representations. Key takeaways
  • 15. 15 36  While CLIP usually performs well on recognizing common objects, it struggles on counting the number of objects in an image and on predicting how close the nearest car is in a photo.  Zero-shot CLIP also struggles compared to task specific models on very fine-grained classification, such as telling the difference between car models, variants of aircraft, or flower species.  CLIP also still has poor generalization to images not covered in its pre-training dataset. For instance, although CLIP learns a capable OCR system, when evaluated on MNIST dataset, zero- shot CLIP only achieves 88% accuracy, well below the 99.75% of humans on the dataset. Limitations
  • 16. Blog Link : https://openai.com/blog/dall-e/ YouTube: https://www.youtube.com/watch?v=az-OV47oKvA (for more detailed and friendly explanation)
  • 17. 17 36 “DALL-E is a 12B parameter AR Transformer trained to generate images from text descriptions in a zero-shot manner, using 250M text–image pairs collected from the internet”  DALL-E achieves high quality image generation on MS-COCO dataset zero-shot, without using any of the training labels.  preferred over prior work trained on the dataset by human evaluators 90% of the time.  Image-to-image translation Summary + DALL-E = Salvador Dalí WALL-E
  • 18. 18 36  GPT-3: text generation  Image GPT: image generation  Jukebox: music generation  DALL-E extend these findings to show that manipulating visual concepts through language is now within reach.  DALL-E can • create anthropomorphized versions of animals and objects • combine unrelated concepts in plausible ways • render text • apply transformations to existing images  Qualitative examples: https://openai.com/blog/dall-e/ Introduction
  • 20. 20 36  The goal is to train a transformer to autoregressively model the text and image tokens as a single stream of data.  However, using pixels directly as image tokens would require an inordinate amount of memory for high-resolution images.  A discrete variational autoencoder (dVAE) is trained to compress each 256×256 RGB image into a 32×32 grid of image tokens, each element of which can assume 8192 possible values.  256 BPE-encoded text tokens are concatenated with the 32×32=1024 image tokens, and train an autoregressive transformer to model the joint distribution over the text and image tokens. Approach
  • 21. 21 36  VQ-VAE (Vector Quantized Variational AutoEncoder) for image compression Approach Oord et al., Neural Discrete Representation Learning
  • 22. 22 36  Gumbel Softmax Approach Jang et al., Categorical Reparameterization with Gumbel-Softmax
  • 23. 23 36  A discrete variational autoencoder (dVAE) is trained to compress each 256×256 RGB image into a 32×32 grid of image tokens, each element of which can assume 8192 possible values.  The encoder downsamples the spatial resolution by a factor of 8.  While details are sometimes lost or distorted, the main features of the image are still typically recognizable. Approach
  • 25.
  • 26. 26 36 “ALIGN (A Large-scale ImaGe and Noisy-text embedding) uses a noisy dataset of over 1B image alt-text pairs, obtained without expensive filtering or post-processing steps, to learn a simple dual- encoder architecture (image and text) by aligning visual-language representations using a contrastive loss”  While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. This costly curation process limits the size of datasets and hence hinders the scaling of trained models.  The scale of the corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Summary
  • 27. 27 36  Visual and language representations are jointly learned from noisy image alt-text data and can be used for vision-only or vision-language task transfer.  Without any fine-tuning, ALIGN powers cross-modal search including image-to-text search, text-to-image search and even search with joint image+text queries. Summary
  • 28. 28 36  The goal is to align the visual-language representations in a shared latent embedding space using a simple dual-encoder architecture (image: EfficientNet, text: BERT)  Image and text encoders are learned via a contrastive loss (formulated as normalized softmax) that pushes the embeddings of matched image-text pair together while pushing those of non- matched image-text pair apart.  Considering paired texts as fine-grained labels of images, image-to-text contrastive loss is analogous to the conventional label-based classification objective; and the key difference is that the text encoder generates the “label” weights. Approach
  • 29. 29 36  The image (EfficientNet) and text encoders (BERT) are optimized via contrastive loss (sum of two normalized softmax losses) that pushes the embeddings of matched image-text pair (positive) together while pushing those of non-matched image-text pair (negative) apart. • Image-to-text classification loss • Text-to-image classification loss Approach 𝑥𝑖: image embedding in the 𝑖-th pair 𝑦𝑗: text embedding in the 𝑗-th pair 𝑁: batch size 𝜎: (learnable) temperature
  • 31.
  • 32. 32 36 “Unified Transformer (UniT) is built upon the transformer encoder-decoder architecture and jointly learns multiple tasks across different modalities (image & text), ranging from object detection to language understanding and multimodal reasoning”  UniT model encodes each input modality with an encoder and makes predictions on each task with a shared decoder over the encoded input representations, followed by task-specific output heads  Compared to previous efforts on multi-task learning with transformers, UniT share the same model parameters to all tasks instead of separately fine-tuning task-specific models and handle a much higher variety of tasks across different domains.  UniT learns 7 tasks jointly over 8 datasets, achieving comparable performance to well- established prior work on each domain under the same supervision with a compact set of model parameters. Summary
  • 34. 34 36  UniT uses an image encoder, a text encoder, and a joint decoder with per-task query embedding followed by task-specific heads to make the final outputs for each task. Approach
  • 35. 35 36  Among the existing architectures, Transformer is the most generic architectures because it has less inductive bias than others.  A new formula such as “Large Transformer + Large scale dataset” has begun to emerge (CLIP:400M, DALL-E:250M, ALIGN:1B).  All we need is data: the recent BIG studies talk about how they collected/curated data, not much about models.  Transformers are replacing CNN-based SOTAs, which were considered de-facto standard in the image domain, on several benchmarks.  Also, Transformers are indeed strong at multi-modality. Wrap up
  • 36. Thank You shmwoo9395@{gmail.com, gist.ac.kr} If you find my presentation interesting and this gives you new inspiration, please feel free to contact me!