Contrastive Learning은 두 이미지가 유사한지 유사하지 않은 지에 대해서 어떤 label이 없이 피쳐들을 배우게 하는 머신 learning 테크닉 중에 하나입니다 우리는 기존에 있는 Supervised learning과 조금 차이가 있는데 Supervised learning은 label cost가 들고
그다음에 Task specific 하기 때문에 generalizability가 조금 떨어질 수 있습니다 하지만 Contrastive Learning은 label이 없이 진행하기때문에 label cost가 없고 generalizability가 조금 더 좋을수 있습니다. 해당 논문은 보다 유용한 Contrastive Learning을 위한 Joint Contrastive Learning에 대해 제안을 하는대요 https://youtu.be/0NLq-ikBP1I
3. Introduction Related Work Methods Experiments Conclusions
❖ What is Contrastive learning?
Contrastive learning is a machine learning technique used to learn the general features of a dataset
without labels by teaching the model which data points are similar or different.
https://amitness.com/2020/03/illustrated-simclr/ 2
4. Introduction Related Work Methods Experiments Conclusions
❖ What is Contrastive learning?
Contrastive learning is a machine learning technique used to learn the general features of a dataset
without labels by teaching the model which data points are similar or different.
https://amitness.com/2020/03/illustrated-simclr/ 3
5. Introduction Related Work Methods Experiments Conclusions
❖ What is Contrastive learning?
Contrastive learning is a machine learning technique used to learn the general features of a dataset
without labels by teaching the model which data points are similar or different.
CNN
Supervised learning
1. Label cost
1. Task-specific solution (poor generalizability) 4
6. Introduction Related Work Methods Experiments Conclusions
❖ What is Contrastive learning?
Contrastive learning is a machine learning technique used to learn the general features of a dataset
without labels by teaching the model which data points are similar or different.
5
Contrastive learning
1. No label cost
1. Task-agnostic solution (Good generalizability)
7. How can we train model using contrastive learning method?
Introduction Related Work Methods Experiments Conclusions
6
8. Encoder
Mechanism to get representations that allow the machine to
understand an image.
Introduction Related Work Methods Experiments Conclusions
❖ How do we train contrastive learning?
CNN
Image Representation
Similarity measure
Mechanism to compute the similarity of two images.
Similarity ( , )
7
Similar and dissimilar images
Example pairs of similar and dissimilar images.
Image Same
Different
Different
9. Introduction Related Work Methods Experiments Conclusions
❖ How do we train contrastive learning?
Positive sample to be recognized as a similar image, Negative sample to be recognized as a different image.
Score ( , )
maximize
https://ankeshanand.com/blog/2020/01/26/contrative-self-supervised-learning.html
Score ( , )
Score ( , ) Score ( , )
+ 𝚺K-1
we can change softmax and NLL in this probability.
InfoNCE
8
Score ( , )
minimize
10. Introduction Related Work Methods Experiments Conclusions
❖ InfoNCE loss
Maximize mutual Information
https://arxiv.org/pdf/1807.03748.pdf
Mutual Information
MI maximize
9
InfoNCE
11. ❖ InfoNCE loss
Maximize mutual Information
https://arxiv.org/pdf/1807.03748.pdf
MI maximize
Mutual Information
Introduction Related Work Methods Experiments Conclusions
10
InfoNCE
12. Introduction Related Work Methods Experiments Conclusions
❖ Self supervised learning
Pretext Task : pre-designed tasks for networks to solve,
and visual features are learned by learning objective
functions of pretext tasks.
Downstream Task : computer vision applications that are
used to evaluate the quality of features learned by self-
supervised learning.
No label
Label
11
13. Introduction Related Work Methods Experiments Conclusions
❖ Self supervised learning in Pretext task
https://arxiv.org/pdf/1803.07728.pdf,https://arxiv.org/pdf/1603.09246.pdf
https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Doersch_Unsupervised_Visual_Representation_ICCV_2015_paper.pdf 12
14. Introduction Related Work Methods Experiments Conclusions
❖ Self supervised learning in Pretext task
https://arxiv.org/pdf/1803.07728.pdf,https://arxiv.org/pdf/1603.09246.pdf
https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Doersch_Unsupervised_Visual_Representation_ICCV_2015_paper.pdf
Rotation
0, 90, 180, 270 degree random rotation
4-class classification problem
13
15. Introduction Related Work Methods Experiments Conclusions
❖ Self supervised learning in Pretext task
https://arxiv.org/pdf/1803.07728.pdf,https://arxiv.org/pdf/1603.09246.pdf
https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Doersch_Unsupervised_Visual_Representation_ICCV_2015_paper.pdf
Jigsaw Puzzle
Random shuffle patches,
Permutation prediction
Context Prediction
Relative patches location prediction
14
16. Introduction Related Work Methods Experiments Conclusions
❖ MoCo
https://arxiv.org/abs/1911.05722 ,
https://arxiv.org/abs/2003.04297
● Dictionary as a queue
(JCL adopted)
● Momentum update
● Memory bank
● Contrastive loss function
○ one positive pair
○ many negative pair
❖ SimCLR
https://arxiv.org/abs/2002.05709
● Random cropping resize to
original, color distortion,
Gaussian blur
● Encoder f (ᐧ)
● Projection head g (ᐧ) mapping
● Contrastive loss function
○ one positive pair
○ many negative pair
15
● SimCLR augmentation strategy
● MLP projection head
● Cosine LR scheduler
● Same loss function as Moco v1
❖ MoCo v2
17. How does JCL actually work?
Introduction Related Work Methods Experiments Conclusions
16
18. Introduction Related Work Methods Experiments Conclusions
● Most existing contrastive learning methods only consider independently penalizing the incompatibility of
each single positive query-key pair at a time.
● This does not fully leverage the assumption that all augmentations corresponding to a specific
image are statistically dependent on each other, and are simultaneously similar to the query.
❖ Positive, Negative key pairs
17
19. Introduction Related Work Methods Experiments Conclusions
● Instance xi to be firstly augmented M+1 times (M for positive keys and 1 extra for the query itself).
● Unfortunately, this is not computational applicable, as carrying all (M +1)×N pairs in a mini-batch would
quickly drain the GPU memory when M is even moderately small.
● Capitalizing on this application of infinity limit, the statistics of the data become sufficient to reach the
same goal of multiple pairing.
❖ Positive, Negative key pairs
18
20. Introduction Related Work Methods Experiments Conclusions
❖ Joint Contrastive Learning
https://en.wikipedia.org/wiki/Jensen%27s_inequality 19
21. Introduction Related Work Methods Experiments Conclusions
❖ Joint Contrastive Learning
https://en.wikipedia.org/wiki/Jensen%27s_inequality 20
22. ● Jensen's inequality
Introduction Related Work Methods Experiments Conclusions
❖ Joint Contrastive Learning
https://en.wikipedia.org/wiki/Jensen%27s_inequality 21
23. Introduction Related Work Methods Experiments Conclusions
❖ Joint Contrastive Learning
Implicit
Explicit
22
24. Introduction Related Work Methods Experiments Conclusions
❖ Joint Contrastive Learning
23
● Under this Gaussian assumption, Eq.(7) eventually reduces Eq.(8).
● For any random variable x that follows Gaussian distribution x∼N (μ, Σ), where μ is the expectation of x, Σ is the variance of x,
we have the moment generation function that satisfy.
● We scale the influence of Σk+ by multiplying it with a scalar λ. This tuning of λ hopefully stabilizes the training.
25. Introduction Related Work Methods Experiments Conclusions
❖ Joint Contrastive Learning
24
● Under this Gaussian assumption, Eq.(7) eventually reduces Eq.(8).
● For any random variable x that follows Gaussian distribution x∼N (μ, Σ), where μ is the expectation of x, Σ is the variance of x,
we have the moment generation function that satisfy.
● We scale the influence of Σk+ by multiplying it with a scalar λ. This tuning of λ hopefully stabilizes the training.
26. Introduction Related Work Methods Experiments Conclusions
❖ Joint Contrastive Learning
25
● Under this Gaussian assumption, Eq.(7) eventually reduces Eq.(8).
● For any random variable x that follows Gaussian distribution x∼N (μ, Σ), where μ is the expectation of x, Σ is the variance of x,
we have the moment generation function that satisfy.
● We scale the influence of Σk+ by multiplying it with a scalar λ. This tuning of λ hopefully stabilizes the training.
28. Introduction Related Work Methods Experiments Conclusions
❖ Implicit Semantic Data Augmentation
https://arxiv.org/abs/1909.12220
https://neurohive.io/en/news/semantic-data-augmentation-improves-neural-network-s-generalization/
● We hope that directions corresponding to meaningful transformations for each class
are well represented by the principal components of the covariance matrix of that class.
● Consider training a deep network G with weights Θ on a training set
D = {(xi,yi)}Ni=1, where yi ∈ {1, . . . , C} is the label of the i-th sample xi over C classes.
● Let the A-dimensional vector ai = [ai1, . . . , aiA]T = G(xi, Θ) denote the deep features of xi
learned by G, and aij indicate the j-th element of ai.
● To obtain semantic directions to augment ai, we randomly sample vectors from
a zero-mean multi-variate normal distribution N(0 , Σyi),
where Σyi is the class-conditional covariance matrix estimated from the features of all
the samples in class yi.
● During training, C covariance matrices are computed, one for each class.
The augmented feature a ̃i is obtained by translating ai along a random direction
sampled from N(0 , λΣyi).
27
29. Introduction Related Work Methods Experiments Conclusions
❖ Joint Contrastive Learning
● Variable k+
i follows a Gaussian distribution
,where μk+ and Σk+
are respectively the mean and the covariance matrix
of the positive keys for qi+.
● this assumption is legitimate as
positive keys more or less share similarities in the
embedding space around some mean value as they
all mirror the nature of the query to some extent.
Explicit representation in negative keys
https://blog.daum.net/shksjy/228
https://neurohive.io/en/news/semantic-data-augmentation-improves-neural-network-s-generalization/
Implicit representation in positive keys
1 2 3 4
28
30. Introduction Related Work Methods Experiments Conclusions
❖ Joint Contrastive Learning
29
JCL pre-trained , MoCo v2 pre-trained ResNet-18 network ,
directly extract the features.
Fig(3(a)): calculating the cosine similarities of each pair of
features (every pair of 2 out of 32 augmentations) belonging
to the same identity image.
Fig(3(b)): Accordingly, the variance of the obtained
features belonging to the same image is much smaller.
non-contrastive
contrastive
supervised
unsupervised
Positive key number M 𝜏 λ Batch size N
hyper-parameter 5 0.2 4.0 512
31. Introduction Related Work Methods Experiments Conclusions
❖ Conclusion in Joint Contrastive Learning
● JCL implicitly involves the joint learning of an infinite number of query-key pairs for each instance.
● By applying rigorous bounding techniques on the proposed formulation, we transfer the originally intractable loss
function into practical implementations.
● Most notably, although JCL is an unsupervised algorithm, the JCL pre-trained networks even outperform its
supervised counterparts in many scenarios.
30
Joint Contrastive Learning with Infinite Possibilities : https://arxiv.org/abs/2009.14776contrastive learning
implicit semantic augmentation은 explicit 하게 여러 augmentation으로 직접 모델에게 가르치는게 아니라 latent feature map에 대한 의미있는 direction을 가르치기 위하여 가우시안 분포라고 가정하여 샘플링 하도록 합니다. 그리고 그에 해당하는 가우시안 분포의 mean, var를 학습시키도록 합니다. 그리고 PC들을 잘 꺼내어 의미있는 direction을 뽑히도록 한다.