Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Regularizing Class-wise Predictions via Self-knowledge Distillation (CVPR 2020)
1. CVPR 2020
Regularizing Class-wise Predictions via
Self-knowledge Distillation
1 Korea Advanced Institute of Science and Technology (KAIST)
2 University of California, Berkeley
Sukmin Yun1 Jongjin Park1 Kimin Lee2 Jinwoo Shin1
2. Algorithmic Intelligence Lab
• We propose a new output regularizer utilizing the dark knowledge
Introduction
2
DNN Prediction
Dark knowldege
Information of non-target labels
Supervision on dark knowledge leads to better generalization!
3. Algorithmic Intelligence Lab
• Self-supervision by penalizing the predictions between similar samples
• is the self-supervision of
Class-wise Self-knowledge Distillation (CS-KD)
3
Similar samples
(same class)
4. Algorithmic Intelligence Lab
Class-wise Self-knowledge Distillation (CS-KD)
4
• The total training loss is defined as follow:
where denotes the Kullback-Leibler divergence and denotes the cross-entropy loss
Similar samples
(same class)
5. Algorithmic Intelligence Lab
• CS-KD could achieve two desirable goals simultaneously:
1. Preventing overconfident predictions
• Goal of entropy regularization methods [1]
• CS-KD utilizes the model prediction of other samples as the soft-label
2. Reducing the intra-class variations
• Goal of margin-based methods [2]
• CS-KD minimizes the distance between two logits within the same class
Class-wise Self-knowledge Distillation (CS-KD)
5
[1] When does label smoothing help? In NeuIPS, 2019.
[2] Adacos: Adaptively scaling cosine logits for effectively learning deep face representations. In CVPR, 2019.
6. Algorithmic Intelligence Lab
• We demonstrate the effectiveness of CS-KD, as below:
Our Contributions
6
Improving the generalization ability
Reducing the intra-class variations
Relaxing the overconfident predictions
Enhancing model calibration [3]
[3] Predicting good probabilities with supervised learning. In ICML, 2005.
7. Algorithmic Intelligence Lab
• The proposed CS-KD is arguably the simplest way to achieve two goals via a
single mechanism:
• We believe that the proposed method may be influential to enjoy a broader
usage in other applications.
Conclusion
7
Thank you for your attention
• Preventing overconfident predictions
• Expected calibration error
• Reducing the intra-class variations
• Generalization error