SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
Sungchul Kim
Contents
1. Introduction
2. Method
3. Empirical Study
4. Comparisons
5. (Hypothesis)
2
Introduction
3
• Un-/Self-supervised representation learning
Pretext (Auxiliary) Task
Image Model Extra module
Pseudo
-task
Downstream Task
Image Model
New
extra module
Linear Evaluation
Freeze extractor and
train 1-layer NN
Finetune
Train all networks
for downstream
Transfer Learning
Detection/segmentation
Retrieval
Calculate Recall
among kNN
Introduction
4
• Siamese networks
• An undesired trivial solution : all outputs “collapsing” to a constant
• Constrastive Learning
• Attract the positive sample pairs and repulse the negative sample pairs
• SimCLR, MoCo, …
• Clustering
• Alternate between clustering the representations and learning to predict the cluster assignment
• SwAV
• BYOL
• A Siamese network in which one branch is a momentum encoder
Introduction
5
• SimSiam
• Without the momentum encoder, negative pairs, and online clustering
• Does not cause collapsing and can perform competitively
• Siamese networks : inductive biases for modeling invariance
• Invariance : two observations of the same concept should produce the same outputs.
• The weight-sharing Siamese networks can model invariance w.r.t. more complicated transformations
(augmentations).
Method
6
• SimSiam
• Two randomly augmented views 𝑥1 and 𝑥2 from an image 𝑥
• An encoder network 𝑓 consisting of a backbone and a projection MLP head
• Two output vectors 𝑝1 ≜ ℎ 𝑓 𝑥1 and 𝑧2 ≜ 𝑓 𝑥2
𝒟 𝑝1, 𝑧2 = −
𝑝1
𝑝1 2
⋅
𝑧2
𝑧2 2
• A symmetrized loss
ℒ =
1
2
𝒟 𝑝1, 𝑧2 +
1
2
𝒟 𝑝2, 𝑧1
• Stop-gradient operation
ℒ =
1
2
𝒟 𝑝1, stopgrad 𝑧2 +
1
2
𝒟 𝑝2, stopgrad 𝑧1
Method
7
• SimSiam
• Baseline Settings
• Optimizer
• SGD with 0.9 momentum and weight decay 0.0001
• Learning rate : 𝑙𝑟 × BatchSize/256 with a base 𝑙𝑟 = 0.05
• Batch size : 512 / Batch normalization synchronized across devices
• Projection MLP
• (FC (2048) + BN + ReLU) x 2 + FC (𝑑=2048) + BN (𝑧)
• Prediction MLP
• FC (512)+ BN + ReLU + FC (𝑑=2048) (𝑝)
• Experimental setup
• Pre-training : 1000-class ImageNet training set
• Linear evaluation : training a supervised linear classifier on frozen representations
Empirical Study
8
• Stop-gradient
• (Left) quickly find a degenerated solution reach the minimum possible loss of -1 without stop-gradient
• (middle) std of the 𝑙2-normalized output z′
= 𝑧/ 𝑧 2
• The outputs collapse to a constant vector → the std over all samples should be zero for each channel! (the red curve)
• The output 𝑧 has a zero-mean isotropic Gaussian distribution → the std of 𝒛′ is
𝟏
𝒅
→ The outputs do not collapse, and they are scattered on the unit hypersphere!
• 𝑧𝑖
′
= Τ
𝑧𝑖 σ𝑗=1
𝑑
𝑧𝑗
2
1
2
≈ Τ
𝑧𝑖 𝑑
1
2 (if 𝑧𝑗 is subject to an i.i.d Gaussian distribution) → std 𝑧𝑖
′
≈ 1/𝑑
1
2
Empirical Study
9
• Stop-gradient
• (Right) the validation accuracy of a k-nearest-neighbor (kNN) classifier
• The kNN classifier can serve as a monitor of the progress.
• With stop-gradient, the kNN monitor shows a steadily improving accuracy!
• (Table) the linear evaluation result
• w/ stop-grad : a nontrivial accuracy of 67.7%
• w/o stop-grad : 0.1%..
Empirical Study
10
• Stop-gradient
• Discussion
• There exist collapsing solutions.
• The minimum possible loss and the constant outputs (Of course, it is not sufficient to indicate collapsing!)
• It is insufficient for our method to prevent collapsing solely by the architecture designs (e.g., predictor, BN, 𝑙2-norm).
Empirical Study
11
• Predictor
• (Table 1) the predictor MLP (ℎ)’s effect
• (a) removing ℎ (=ℎ is the identity mapping) → the model does not work!
• ℒ(a) =
1
2
𝒟 𝑧1, stopgrad 𝑧2 +
1
2
𝒟 𝑧2, stopgrad 𝑧1
• Same direction as the gradient of 𝒟 𝑧1, 𝑧2 with the magnitude scaled by 1/2 → collapsing!
• The asymmetric variant 𝒟 𝑝1, stopgrad 𝑧2 also fails if removing ℎ, while it can work if ℎ is kept.
Empirical Study
12
• Predictor
• (Table 1) the predictor MLP (ℎ)’s effect
• (b) ℎ is fixed as random initialization → the model does not work!
• This failure is not about collapsing. The training does not converge, and the loss remains high.
• (c) ℎ with a constant 𝑙𝑟 (without decay) → works well!
• It is not necessary to force it converge (by reducing 𝑙𝑟) before the representations are sufficiently trained.
Empirical Study
13
• Batch Size
• (Table 2) the results with a batch size from 64 to 4096
• Linear scaling rule (𝑙𝑟 × BatchSize/256) with base 𝑙𝑟 = 0.5
• 10 epochs of warm-up for batch sizes ≥ 1024
• SGD optimizer, not LARS
• 64, 128 : a drop of 0.8% or 2.0% in accuracy
• 256 to 2048 : similarly good
→ SimSiam, SimCLR, SwAV : Siamese network with direct weight-sharing
→ SimCLR and SwAV both require a large batch (e.g., 4096) to work well.
→ The standard SGD optimizer does not work well when the batch is too large.
→ A specialized optimizer is not necessary for preventing collapsing.
Empirical Study
14
• Batch Normalization
• (Table 3) the configurations of BN on the MLP heads
• (a) remove all BN layers in the MLP heads
• This variant does not cause collapse, although the acc is low (34.6%).
• (b) adding BN to the hidden layers
• It increases accuracy to 67.4%.
• (c) adding BN to the output of the projection MLP (default)
• It boosts accuracy to 68.1%.
• Disabling learnable affine transformation (scale and offset) in 𝑓’s output BN : 68.2%
• (d) adding BN to the output of the prediction MLP ℎ → does not work well!
• Not about collapsing, the training is unstable and the loss oscillates.
→ BN is helpful for optimization when used appropriately.
→ No evidence that BN helps to prevent collapsing.
Empirical Study
15
• Similarity Function
• Modify 𝒟 as: 𝒟 𝑝1, 𝑧2 = −softmax 𝑧2 ⋅ log softmax 𝑝1
• The output of softmax can be thought of as the probabilities of belonging to each of 𝑑 pseudo-categories.
• Simply replace the cosine similarity with the cross-entropy similarity, and symmetrize it.
• The cross-entropy variant can converge to a reasonable result without collapsing.
→ The collapsing prevention behavior is not just about the cosine similarity.
Empirical Study
16
• Symmetrization
• Observe that SimSiam’s behavior of preventing collapsing does not depend on symmetrization.
• Compare with the asymmetric variant (𝒟 𝑝1, stopgrad 𝑧2 ).
• The asymmetric variant achieves reasonable results.
→ Symmetrization is helpful for boosting accuracy, but it is not related to collapse prevention.
• Sampling two pairs for each image in the asymmetric version (“2x”)
→ It makes the gap smaller.
Empirical Study
17
• Summary
• SimSiam can produce meaningful results without collapsing.
• The optimizer (batch size), batch normalization, similarity function, and symmetrization
→ affect accuracy, but we have seen no evidence about collapsing prevention
→ the stop-gradient operation may play an essential role!
Hypothesis
18
• What is implicitly optimized by SimSiam?
• Formulation
• SimSiam is an implementation of an Expectation-Maximization (EM) like algorithm.
ℒ 𝜃, 𝜂 = 𝔼𝑥,𝒯 ℱ𝜃 𝒯 𝑥 − 𝜂𝑥 2
2
• ℱ : a network parameterized by 𝜃
• 𝒯 : the augmentation / 𝑥 : an image
• 𝜂 : another set of variables (proportional to the number of images)
• 𝜂𝑥 : the representation of the image 𝑥 / the subscript 𝑥 : using the image index to access a sub-vector of 𝜂
• The expectation 𝔼 ⋅ is over the distribution of images and augmentations.
• The mean squared error ⋅ 2
2
: equivalent to the cosine similarity if the vectors are 𝑙2-normalized
Hypothesis
19
• Formulation
• Solve min
𝜃,𝜂
ℒ 𝜃, 𝜂
• This formulation is analogous to k-means clustering.
• The variable 𝜃 is analogous to the clustering centers (the learnable parameters of an encoder)
• The variable 𝜂𝑥 is analogous to the assignment vector of the sample 𝑥 (a one-hot vector, the representation of 𝑥)
• An alternating algorithm
• Solving for 𝜽 (𝜃𝑡
← argmin𝜃ℒ 𝜃, 𝜂𝑡−1
)
• The stop-gradient operation is a natural consequence
: the gradient does not back-propagate to 𝜂𝑡−1
which is a constant.
• Solving for 𝜼 (𝜂𝑡
← argmin𝜂ℒ 𝜃𝑡
, 𝜂 )
• Minimize 𝔼𝒯 ℱ𝜃𝑡 𝒯 𝑥 − 𝜂𝑥 2
2
for each image 𝑥 : 𝜂𝑥
𝑡
← 𝔼𝒯 ℱ𝜃𝑡 𝒯 𝑥 (due to the mean squared error)
• 𝜂𝑥 is assigned with the average representation of 𝑥 over the distribution of augmentation.
Hypothesis
20
• Formulation
• One-step alternation
• Approximate 𝜂𝑥
𝑡
← 𝔼𝒯 ℱ𝜃𝑡 𝒯 𝑥 by sampling the augmentation only once (𝒯′) and ignoring 𝔼𝒯 ⋅
𝜂𝑥
𝑡
← ℱ𝜃𝑡 𝒯′
𝑥
• Then,
𝜃𝑡+1
← argmin𝜃𝔼𝑥,𝒯 ℱ𝜃 𝒯 𝑥 − ℱ𝜃𝑡 𝒯′
𝑥 2
2
• 𝜃𝑡
: a constant in this sub-problem
• 𝒯′ : another view due to its random nature
→ The SimSiam algorithm: a Siamese network naturally with stop-gradient applied.
Hypothesis
21
• Formulation
• Predictor
• Assume : ℎ is helpful in our method because of the approximation ~.
• Minimize 𝔼𝑧 ℎ 𝑧1 − 𝑧2 2
2
→ the optimal solution : ℎ 𝑧1 = 𝔼𝑧 𝑧2 = 𝔼𝒯 𝑓 𝒯 𝑥 for any image 𝑥
• In practice, it would be unrealistic to actually compute the expectation 𝔼𝒯
• But it may be possible for a neural network (e.g., the predictor ℎ) to learn to predict the expectation,
while the sampling of 𝒯 is implicitly distributed across multiple epochs.
• Symmetrization
• Symmetrization is like denser sampling 𝒯.
• 𝔼𝑥,𝒯 ⋅ by sampling a batch of images and one pair of augmentation 𝒯
1, 𝒯
2
• Symmetrization supplies an extra pair 𝒯
2, 𝒯
1
• Not necessary, yet improving accuracy
Hypothesis
22
• Proof of concept
• Multi-step alternation
• SimSiam : alternating between updating 𝜃𝑡
and 𝜂𝑡
, with an interval of one step of SGD update
→ the interval has multiple steps of SGD
• 𝑡 : the index of an outer loop
• 𝜃𝑡
is updated by an innter loop of 𝑘 SGD steps
• Pre-compute the 𝜂𝑥 required for all 𝑘 SGD steps and cache them in memory
• 1-step : SimSiam / 1-epoch : the 𝑘 steps required for one epoch
Hypothesis
23
• Proof of concept
• Expectation over augmentations
• Do not update 𝜂𝑥 directly, maintain a moving-average
𝜂𝑥
𝑡
← 𝑚 ∗ 𝜂𝑥
𝑡−1
+ 1 − 𝑚 ∗ ℱ𝜃𝑡 𝒯′
𝑥 𝑚 = 0.8
• This moving-average provides an approximated expectation of multiple views.
• 55.0% accuracy without the predictor ℎ
• Fails completely if removing ℎ but do not maintain the moving average. (Table 1a)
→ The usage of predictor ℎ is related to approximating 𝔼𝒯 ⋅
• Discussion
• It does not explain why collapsing is prevented.
Comparisons
24
• Result Comparisons
• ImageNet
• SimSiam achieves competitive results with its simplicity.
• The highest accuracy among all methods under 100-epoch pre-training
• Better results than SimCLR in all cases
Comparisons
25
• Result Comparisons
• Transfer Learning
• VOC (object detection) and COCO (object detection and instance segmentation)
• SimSiam, base : the baseline pre-training recipe
• SimSiam, optimal : optimal recipe (𝑙𝑟=0.5 and 𝑤𝑑=1e-5)
• Siamese structure is a core factor for their general success.
Comparisons
26
• Methodology Comparisons
• Relation to SimCLR
• SimSiam is SimCLR without negatives.
• Negative samples (“dissimilarity”) to prevent collapsing
• Append the prediction MLP ℎ and stop-gradient to SimCLR
• Neither the stop-gradient nor the extra predictor is necessary or helpful.
• a consequence of another underlying optimization problem.
• It is different from the contrastive learning problem!
• Relation to BYOL
• SimSiam is BYOL without the momentum encoder.
• The momentum encoder may be beneficial for accuracy, but it is not necessary for preventing collapsing.
감 사 합 니 다
27

Weitere ähnliche Inhalte

Was ist angesagt?

Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
milad abbasi
 
研究室内PRML勉強会 8章1節
研究室内PRML勉強会 8章1節研究室内PRML勉強会 8章1節
研究室内PRML勉強会 8章1節
Koji Matsuda
 
変分推論と Normalizing Flow
変分推論と Normalizing Flow変分推論と Normalizing Flow
変分推論と Normalizing Flow
Akihiro Nitta
 

Was ist angesagt? (20)

クラシックな機械学習の入門  9. モデル推定
クラシックな機械学習の入門  9. モデル推定クラシックな機械学習の入門  9. モデル推定
クラシックな機械学習の入門  9. モデル推定
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation Learning
 
論文紹介:Grad-CAM: Visual explanations from deep networks via gradient-based loca...
論文紹介:Grad-CAM: Visual explanations from deep networks via gradient-based loca...論文紹介:Grad-CAM: Visual explanations from deep networks via gradient-based loca...
論文紹介:Grad-CAM: Visual explanations from deep networks via gradient-based loca...
 
[DL輪読会]SoftTriple Loss: Deep Metric Learning Without Triplet Sampling (ICCV2019)
[DL輪読会]SoftTriple Loss: Deep Metric Learning Without Triplet Sampling (ICCV2019)[DL輪読会]SoftTriple Loss: Deep Metric Learning Without Triplet Sampling (ICCV2019)
[DL輪読会]SoftTriple Loss: Deep Metric Learning Without Triplet Sampling (ICCV2019)
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 
DNNの曖昧性に関する研究動向
DNNの曖昧性に関する研究動向DNNの曖昧性に関する研究動向
DNNの曖昧性に関する研究動向
 
【論文読み会】Self-Attention Generative Adversarial Networks
【論文読み会】Self-Attention Generative  Adversarial Networks【論文読み会】Self-Attention Generative  Adversarial Networks
【論文読み会】Self-Attention Generative Adversarial Networks
 
Optimizer入門&最新動向
Optimizer入門&最新動向Optimizer入門&最新動向
Optimizer入門&最新動向
 
Domain Transfer and Adaptation Survey
Domain Transfer and Adaptation SurveyDomain Transfer and Adaptation Survey
Domain Transfer and Adaptation Survey
 
[DL輪読会]1次近似系MAMLとその理論的背景
[DL輪読会]1次近似系MAMLとその理論的背景[DL輪読会]1次近似系MAMLとその理論的背景
[DL輪読会]1次近似系MAMLとその理論的背景
 
[DL輪読会]Reinforcement Learning with Deep Energy-Based Policies
[DL輪読会]Reinforcement Learning with Deep Energy-Based Policies[DL輪読会]Reinforcement Learning with Deep Energy-Based Policies
[DL輪読会]Reinforcement Learning with Deep Energy-Based Policies
 
t-SNE Explained
t-SNE Explainedt-SNE Explained
t-SNE Explained
 
帰納バイアスが成立する条件
帰納バイアスが成立する条件帰納バイアスが成立する条件
帰納バイアスが成立する条件
 
研究室内PRML勉強会 8章1節
研究室内PRML勉強会 8章1節研究室内PRML勉強会 8章1節
研究室内PRML勉強会 8章1節
 
変分推論と Normalizing Flow
変分推論と Normalizing Flow変分推論と Normalizing Flow
変分推論と Normalizing Flow
 
MIXUPは最終層でやった方がいいんじゃないか説
MIXUPは最終層でやった方がいいんじゃないか説MIXUPは最終層でやった方がいいんじゃないか説
MIXUPは最終層でやった方がいいんじゃないか説
 
[DL輪読会]SeqGan Sequence Generative Adversarial Nets with Policy Gradient
[DL輪読会]SeqGan Sequence Generative Adversarial Nets with Policy Gradient[DL輪読会]SeqGan Sequence Generative Adversarial Nets with Policy Gradient
[DL輪読会]SeqGan Sequence Generative Adversarial Nets with Policy Gradient
 
モンテカルロサンプリング
モンテカルロサンプリングモンテカルロサンプリング
モンテカルロサンプリング
 
Machine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series PredictionMachine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series Prediction
 
変分ベイズ法の説明
変分ベイズ法の説明変分ベイズ法の説明
変分ベイズ法の説明
 

Ähnlich wie Exploring Simple Siamese Representation Learning

Oleksandr Obiednikov “Affine transforms and how CNN lives with them”
Oleksandr Obiednikov “Affine transforms and how CNN lives with them”Oleksandr Obiednikov “Affine transforms and how CNN lives with them”
Oleksandr Obiednikov “Affine transforms and how CNN lives with them”
Lviv Startup Club
 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyay
abhishek upadhyay
 

Ähnlich wie Exploring Simple Siamese Representation Learning (20)

Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear Algebra
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
 
Deep neural network with GANs pre- training for tuberculosis type classificat...
Deep neural network with GANs pre- training for tuberculosis type classificat...Deep neural network with GANs pre- training for tuberculosis type classificat...
Deep neural network with GANs pre- training for tuberculosis type classificat...
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
 
Learning stochastic neural networks with Chainer
Learning stochastic neural networks with ChainerLearning stochastic neural networks with Chainer
Learning stochastic neural networks with Chainer
 
SPICE-MATEX @ DAC15
SPICE-MATEX @ DAC15SPICE-MATEX @ DAC15
SPICE-MATEX @ DAC15
 
IJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsIJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphs
 
Oleksandr Obiednikov “Affine transforms and how CNN lives with them”
Oleksandr Obiednikov “Affine transforms and how CNN lives with them”Oleksandr Obiednikov “Affine transforms and how CNN lives with them”
Oleksandr Obiednikov “Affine transforms and how CNN lives with them”
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketching
 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyay
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with Performers
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
 
Score based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsScore based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential Equations
 
K-means and GMM
K-means and GMMK-means and GMM
K-means and GMM
 
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
 
Fa18_P1.pptx
Fa18_P1.pptxFa18_P1.pptx
Fa18_P1.pptx
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 

Mehr von Sungchul Kim

Mehr von Sungchul Kim (20)

PR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
PR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo SupervisionPR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
PR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
 
Revisiting the Calibration of Modern Neural Networks
Revisiting the Calibration of Modern Neural NetworksRevisiting the Calibration of Modern Neural Networks
Revisiting the Calibration of Modern Neural Networks
 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
 
Revisiting the Sibling Head in Object Detector
Revisiting the Sibling Head in Object DetectorRevisiting the Sibling Head in Object Detector
Revisiting the Sibling Head in Object Detector
 
Do Wide and Deep Networks Learn the Same Things: Uncovering How Neural Networ...
Do Wide and Deep Networks Learn the Same Things: Uncovering How Neural Networ...Do Wide and Deep Networks Learn the Same Things: Uncovering How Neural Networ...
Do Wide and Deep Networks Learn the Same Things: Uncovering How Neural Networ...
 
Deeplabv1, v2, v3, v3+
Deeplabv1, v2, v3, v3+Deeplabv1, v2, v3, v3+
Deeplabv1, v2, v3, v3+
 
Going Deeper with Convolutions
Going Deeper with ConvolutionsGoing Deeper with Convolutions
Going Deeper with Convolutions
 
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based LocalizationGrad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
 
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...
 
Panoptic Segmentation
Panoptic SegmentationPanoptic Segmentation
Panoptic Segmentation
 
On the Variance of the Adaptive Learning Rate and Beyond
On the Variance of the Adaptive Learning Rate and BeyondOn the Variance of the Adaptive Learning Rate and Beyond
On the Variance of the Adaptive Learning Rate and Beyond
 
A Benchmark for Interpretability Methods in Deep Neural Networks
A Benchmark for Interpretability Methods in Deep Neural NetworksA Benchmark for Interpretability Methods in Deep Neural Networks
A Benchmark for Interpretability Methods in Deep Neural Networks
 
KDGAN: Knowledge Distillation with Generative Adversarial Networks
KDGAN: Knowledge Distillation with Generative Adversarial NetworksKDGAN: Knowledge Distillation with Generative Adversarial Networks
KDGAN: Knowledge Distillation with Generative Adversarial Networks
 
Designing Network Design Spaces
Designing Network Design SpacesDesigning Network Design Spaces
Designing Network Design Spaces
 
Search to Distill: Pearls are Everywhere but not the Eyes
Search to Distill: Pearls are Everywhere but not the EyesSearch to Distill: Pearls are Everywhere but not the Eyes
Search to Distill: Pearls are Everywhere but not the Eyes
 
Supervised Constrastive Learning
Supervised Constrastive LearningSupervised Constrastive Learning
Supervised Constrastive Learning
 
Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning
Bootstrap Your Own Latent: A New Approach to Self-Supervised LearningBootstrap Your Own Latent: A New Approach to Self-Supervised Learning
Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning
 
FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...
FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...
FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...
 
Regularizing Class-wise Predictions via Self-knowledge Distillation
Regularizing Class-wise Predictions via Self-knowledge DistillationRegularizing Class-wise Predictions via Self-knowledge Distillation
Regularizing Class-wise Predictions via Self-knowledge Distillation
 
FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence
FixMatch: Simplifying Semi-Supervised Learning with Consistency and ConfidenceFixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence
FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence
 

Kürzlich hochgeladen

Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 

Kürzlich hochgeladen (20)

Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 

Exploring Simple Siamese Representation Learning

  • 2. Contents 1. Introduction 2. Method 3. Empirical Study 4. Comparisons 5. (Hypothesis) 2
  • 3. Introduction 3 • Un-/Self-supervised representation learning Pretext (Auxiliary) Task Image Model Extra module Pseudo -task Downstream Task Image Model New extra module Linear Evaluation Freeze extractor and train 1-layer NN Finetune Train all networks for downstream Transfer Learning Detection/segmentation Retrieval Calculate Recall among kNN
  • 4. Introduction 4 • Siamese networks • An undesired trivial solution : all outputs “collapsing” to a constant • Constrastive Learning • Attract the positive sample pairs and repulse the negative sample pairs • SimCLR, MoCo, … • Clustering • Alternate between clustering the representations and learning to predict the cluster assignment • SwAV • BYOL • A Siamese network in which one branch is a momentum encoder
  • 5. Introduction 5 • SimSiam • Without the momentum encoder, negative pairs, and online clustering • Does not cause collapsing and can perform competitively • Siamese networks : inductive biases for modeling invariance • Invariance : two observations of the same concept should produce the same outputs. • The weight-sharing Siamese networks can model invariance w.r.t. more complicated transformations (augmentations).
  • 6. Method 6 • SimSiam • Two randomly augmented views 𝑥1 and 𝑥2 from an image 𝑥 • An encoder network 𝑓 consisting of a backbone and a projection MLP head • Two output vectors 𝑝1 ≜ ℎ 𝑓 𝑥1 and 𝑧2 ≜ 𝑓 𝑥2 𝒟 𝑝1, 𝑧2 = − 𝑝1 𝑝1 2 ⋅ 𝑧2 𝑧2 2 • A symmetrized loss ℒ = 1 2 𝒟 𝑝1, 𝑧2 + 1 2 𝒟 𝑝2, 𝑧1 • Stop-gradient operation ℒ = 1 2 𝒟 𝑝1, stopgrad 𝑧2 + 1 2 𝒟 𝑝2, stopgrad 𝑧1
  • 7. Method 7 • SimSiam • Baseline Settings • Optimizer • SGD with 0.9 momentum and weight decay 0.0001 • Learning rate : 𝑙𝑟 × BatchSize/256 with a base 𝑙𝑟 = 0.05 • Batch size : 512 / Batch normalization synchronized across devices • Projection MLP • (FC (2048) + BN + ReLU) x 2 + FC (𝑑=2048) + BN (𝑧) • Prediction MLP • FC (512)+ BN + ReLU + FC (𝑑=2048) (𝑝) • Experimental setup • Pre-training : 1000-class ImageNet training set • Linear evaluation : training a supervised linear classifier on frozen representations
  • 8. Empirical Study 8 • Stop-gradient • (Left) quickly find a degenerated solution reach the minimum possible loss of -1 without stop-gradient • (middle) std of the 𝑙2-normalized output z′ = 𝑧/ 𝑧 2 • The outputs collapse to a constant vector → the std over all samples should be zero for each channel! (the red curve) • The output 𝑧 has a zero-mean isotropic Gaussian distribution → the std of 𝒛′ is 𝟏 𝒅 → The outputs do not collapse, and they are scattered on the unit hypersphere! • 𝑧𝑖 ′ = Τ 𝑧𝑖 σ𝑗=1 𝑑 𝑧𝑗 2 1 2 ≈ Τ 𝑧𝑖 𝑑 1 2 (if 𝑧𝑗 is subject to an i.i.d Gaussian distribution) → std 𝑧𝑖 ′ ≈ 1/𝑑 1 2
  • 9. Empirical Study 9 • Stop-gradient • (Right) the validation accuracy of a k-nearest-neighbor (kNN) classifier • The kNN classifier can serve as a monitor of the progress. • With stop-gradient, the kNN monitor shows a steadily improving accuracy! • (Table) the linear evaluation result • w/ stop-grad : a nontrivial accuracy of 67.7% • w/o stop-grad : 0.1%..
  • 10. Empirical Study 10 • Stop-gradient • Discussion • There exist collapsing solutions. • The minimum possible loss and the constant outputs (Of course, it is not sufficient to indicate collapsing!) • It is insufficient for our method to prevent collapsing solely by the architecture designs (e.g., predictor, BN, 𝑙2-norm).
  • 11. Empirical Study 11 • Predictor • (Table 1) the predictor MLP (ℎ)’s effect • (a) removing ℎ (=ℎ is the identity mapping) → the model does not work! • ℒ(a) = 1 2 𝒟 𝑧1, stopgrad 𝑧2 + 1 2 𝒟 𝑧2, stopgrad 𝑧1 • Same direction as the gradient of 𝒟 𝑧1, 𝑧2 with the magnitude scaled by 1/2 → collapsing! • The asymmetric variant 𝒟 𝑝1, stopgrad 𝑧2 also fails if removing ℎ, while it can work if ℎ is kept.
  • 12. Empirical Study 12 • Predictor • (Table 1) the predictor MLP (ℎ)’s effect • (b) ℎ is fixed as random initialization → the model does not work! • This failure is not about collapsing. The training does not converge, and the loss remains high. • (c) ℎ with a constant 𝑙𝑟 (without decay) → works well! • It is not necessary to force it converge (by reducing 𝑙𝑟) before the representations are sufficiently trained.
  • 13. Empirical Study 13 • Batch Size • (Table 2) the results with a batch size from 64 to 4096 • Linear scaling rule (𝑙𝑟 × BatchSize/256) with base 𝑙𝑟 = 0.5 • 10 epochs of warm-up for batch sizes ≥ 1024 • SGD optimizer, not LARS • 64, 128 : a drop of 0.8% or 2.0% in accuracy • 256 to 2048 : similarly good → SimSiam, SimCLR, SwAV : Siamese network with direct weight-sharing → SimCLR and SwAV both require a large batch (e.g., 4096) to work well. → The standard SGD optimizer does not work well when the batch is too large. → A specialized optimizer is not necessary for preventing collapsing.
  • 14. Empirical Study 14 • Batch Normalization • (Table 3) the configurations of BN on the MLP heads • (a) remove all BN layers in the MLP heads • This variant does not cause collapse, although the acc is low (34.6%). • (b) adding BN to the hidden layers • It increases accuracy to 67.4%. • (c) adding BN to the output of the projection MLP (default) • It boosts accuracy to 68.1%. • Disabling learnable affine transformation (scale and offset) in 𝑓’s output BN : 68.2% • (d) adding BN to the output of the prediction MLP ℎ → does not work well! • Not about collapsing, the training is unstable and the loss oscillates. → BN is helpful for optimization when used appropriately. → No evidence that BN helps to prevent collapsing.
  • 15. Empirical Study 15 • Similarity Function • Modify 𝒟 as: 𝒟 𝑝1, 𝑧2 = −softmax 𝑧2 ⋅ log softmax 𝑝1 • The output of softmax can be thought of as the probabilities of belonging to each of 𝑑 pseudo-categories. • Simply replace the cosine similarity with the cross-entropy similarity, and symmetrize it. • The cross-entropy variant can converge to a reasonable result without collapsing. → The collapsing prevention behavior is not just about the cosine similarity.
  • 16. Empirical Study 16 • Symmetrization • Observe that SimSiam’s behavior of preventing collapsing does not depend on symmetrization. • Compare with the asymmetric variant (𝒟 𝑝1, stopgrad 𝑧2 ). • The asymmetric variant achieves reasonable results. → Symmetrization is helpful for boosting accuracy, but it is not related to collapse prevention. • Sampling two pairs for each image in the asymmetric version (“2x”) → It makes the gap smaller.
  • 17. Empirical Study 17 • Summary • SimSiam can produce meaningful results without collapsing. • The optimizer (batch size), batch normalization, similarity function, and symmetrization → affect accuracy, but we have seen no evidence about collapsing prevention → the stop-gradient operation may play an essential role!
  • 18. Hypothesis 18 • What is implicitly optimized by SimSiam? • Formulation • SimSiam is an implementation of an Expectation-Maximization (EM) like algorithm. ℒ 𝜃, 𝜂 = 𝔼𝑥,𝒯 ℱ𝜃 𝒯 𝑥 − 𝜂𝑥 2 2 • ℱ : a network parameterized by 𝜃 • 𝒯 : the augmentation / 𝑥 : an image • 𝜂 : another set of variables (proportional to the number of images) • 𝜂𝑥 : the representation of the image 𝑥 / the subscript 𝑥 : using the image index to access a sub-vector of 𝜂 • The expectation 𝔼 ⋅ is over the distribution of images and augmentations. • The mean squared error ⋅ 2 2 : equivalent to the cosine similarity if the vectors are 𝑙2-normalized
  • 19. Hypothesis 19 • Formulation • Solve min 𝜃,𝜂 ℒ 𝜃, 𝜂 • This formulation is analogous to k-means clustering. • The variable 𝜃 is analogous to the clustering centers (the learnable parameters of an encoder) • The variable 𝜂𝑥 is analogous to the assignment vector of the sample 𝑥 (a one-hot vector, the representation of 𝑥) • An alternating algorithm • Solving for 𝜽 (𝜃𝑡 ← argmin𝜃ℒ 𝜃, 𝜂𝑡−1 ) • The stop-gradient operation is a natural consequence : the gradient does not back-propagate to 𝜂𝑡−1 which is a constant. • Solving for 𝜼 (𝜂𝑡 ← argmin𝜂ℒ 𝜃𝑡 , 𝜂 ) • Minimize 𝔼𝒯 ℱ𝜃𝑡 𝒯 𝑥 − 𝜂𝑥 2 2 for each image 𝑥 : 𝜂𝑥 𝑡 ← 𝔼𝒯 ℱ𝜃𝑡 𝒯 𝑥 (due to the mean squared error) • 𝜂𝑥 is assigned with the average representation of 𝑥 over the distribution of augmentation.
  • 20. Hypothesis 20 • Formulation • One-step alternation • Approximate 𝜂𝑥 𝑡 ← 𝔼𝒯 ℱ𝜃𝑡 𝒯 𝑥 by sampling the augmentation only once (𝒯′) and ignoring 𝔼𝒯 ⋅ 𝜂𝑥 𝑡 ← ℱ𝜃𝑡 𝒯′ 𝑥 • Then, 𝜃𝑡+1 ← argmin𝜃𝔼𝑥,𝒯 ℱ𝜃 𝒯 𝑥 − ℱ𝜃𝑡 𝒯′ 𝑥 2 2 • 𝜃𝑡 : a constant in this sub-problem • 𝒯′ : another view due to its random nature → The SimSiam algorithm: a Siamese network naturally with stop-gradient applied.
  • 21. Hypothesis 21 • Formulation • Predictor • Assume : ℎ is helpful in our method because of the approximation ~. • Minimize 𝔼𝑧 ℎ 𝑧1 − 𝑧2 2 2 → the optimal solution : ℎ 𝑧1 = 𝔼𝑧 𝑧2 = 𝔼𝒯 𝑓 𝒯 𝑥 for any image 𝑥 • In practice, it would be unrealistic to actually compute the expectation 𝔼𝒯 • But it may be possible for a neural network (e.g., the predictor ℎ) to learn to predict the expectation, while the sampling of 𝒯 is implicitly distributed across multiple epochs. • Symmetrization • Symmetrization is like denser sampling 𝒯. • 𝔼𝑥,𝒯 ⋅ by sampling a batch of images and one pair of augmentation 𝒯 1, 𝒯 2 • Symmetrization supplies an extra pair 𝒯 2, 𝒯 1 • Not necessary, yet improving accuracy
  • 22. Hypothesis 22 • Proof of concept • Multi-step alternation • SimSiam : alternating between updating 𝜃𝑡 and 𝜂𝑡 , with an interval of one step of SGD update → the interval has multiple steps of SGD • 𝑡 : the index of an outer loop • 𝜃𝑡 is updated by an innter loop of 𝑘 SGD steps • Pre-compute the 𝜂𝑥 required for all 𝑘 SGD steps and cache them in memory • 1-step : SimSiam / 1-epoch : the 𝑘 steps required for one epoch
  • 23. Hypothesis 23 • Proof of concept • Expectation over augmentations • Do not update 𝜂𝑥 directly, maintain a moving-average 𝜂𝑥 𝑡 ← 𝑚 ∗ 𝜂𝑥 𝑡−1 + 1 − 𝑚 ∗ ℱ𝜃𝑡 𝒯′ 𝑥 𝑚 = 0.8 • This moving-average provides an approximated expectation of multiple views. • 55.0% accuracy without the predictor ℎ • Fails completely if removing ℎ but do not maintain the moving average. (Table 1a) → The usage of predictor ℎ is related to approximating 𝔼𝒯 ⋅ • Discussion • It does not explain why collapsing is prevented.
  • 24. Comparisons 24 • Result Comparisons • ImageNet • SimSiam achieves competitive results with its simplicity. • The highest accuracy among all methods under 100-epoch pre-training • Better results than SimCLR in all cases
  • 25. Comparisons 25 • Result Comparisons • Transfer Learning • VOC (object detection) and COCO (object detection and instance segmentation) • SimSiam, base : the baseline pre-training recipe • SimSiam, optimal : optimal recipe (𝑙𝑟=0.5 and 𝑤𝑑=1e-5) • Siamese structure is a core factor for their general success.
  • 26. Comparisons 26 • Methodology Comparisons • Relation to SimCLR • SimSiam is SimCLR without negatives. • Negative samples (“dissimilarity”) to prevent collapsing • Append the prediction MLP ℎ and stop-gradient to SimCLR • Neither the stop-gradient nor the extra predictor is necessary or helpful. • a consequence of another underlying optimization problem. • It is different from the contrastive learning problem! • Relation to BYOL • SimSiam is BYOL without the momentum encoder. • The momentum encoder may be beneficial for accuracy, but it is not necessary for preventing collapsing.
  • 27. 감 사 합 니 다 27