SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
PR-383
http://arxiv.org/abs/2204.03475
주성훈, VUNO Inc.


2022. 4. 24.
1. Research Background
2. Methods
1. Research Background 3
The performance of a vision model
•여전히 ImageNet에 대한 딥러닝의 높은 분류 성능을 위한 training 연구는 현재진행형임
/ 30
2. Methods
1. Research Background 4
Previous works
•Regularizations
•Stronger augmentations: AutoAugment, RandAugment
•Image-based regularizations Cutout, Cutmix and Mixup
•Architecture regularizations like drop-path, drop-block
•Label-smoothing
•Progressive image resizing during training
•Different train-test resolutions
•Training configuration
•More training epochs
•Dedicated optimizer for large batch size(LAMB Optimizer), Scaling learning rate with batch size
•Exponential-moving average (EMA) of model weights
•Improved weights initializations
•Decoupled weight decay (AdamW)
Yun, S. et al., CutMix: Regularization strategy to train strong
classifiers with localizable features. ICCV 2019
Fixing the train-test resolution discrepancy. NeurIPS 2019
/ 30
2. Methods
1. Research Background 5
ResNeXt
Automated architecture search를 활용한 구조
[67 (NASNET), 41 (AmoebaNet: 83.9), 55 (EfficientNet-B7, 84.4%, 2019)].
Adapting self-attention to the visual domain
AA-ResNet-152, 79.1%, 2019
ViT-L/16 87.76±0.03%, 2020
LambdaResNet200 84.3%, 2021
Previous works
•Architecture
VGG
ResNet
Inception
ViT-L/16 87.76±0.03%, 2020
/ 30
2. Methods
1. Research Background 6
Motivation - Architecture와 관계없이 잘 작동하는 training scheme 제안 필요
•Architecture마다 맞춤형 training scheme이 적용됨
•ResNet 계열 (TResNet, SEResNet, ResNet-D …)
•일반적으로 다양한 training scheme에 잘 작동함.
•(Ross Wightman et al., 2021) 에서 제안한 방법이 ResNet 계열을 학습시키는데 standard가 됐다고 함.
•Mobile-oriented models
•Depth-wise convolutions에 많이 의존
•Their dedicated training schemes usually consist of RMSProp optimizer, waterfall learning rate scheduling
and EMA.
•Transformer-based, MLP-only models
•Inductive bias가 없어 훈련하기 어려움 -> longer training (1000 epochs), strong cutmix-mixup and drop-
path regularizations, large weight-decay and repeated augmentations
•어떤 한 모델에 대한 맞춤형 training scheme은 다른 모델에 적용하면 성능이 낮아짐
•ResNet50을 위한 training scheme을 EfficieneNetV2 model에 적용했을 때 맞춤형 training scheme을 적용할 떄 보다
3.3%의 성능 하락을 보임 (Mingxing Tan et al., PMLR, 2021)
/ 30
2. Methods
1. Research Background 7
Objective:


we introduce a unified training scheme for ImageNet without any hyper-parameter
tuning or tailor-made tricks per model.
/ 30
2. Methods
2. Methods
2. Methods 9
knowledge distillation (KD) for classification
Image from: https://intellabs.github.io/distiller/knowledge_distillation.html
Hinton et al., (2015). Distilling the Knowledge in a Neural Network
/ 30
2. Methods
2. Methods 10
knowledge distillation (KD) for classification
•KD의 적용 - Previous work
•Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network, 2021
•ResNet50의 image classification 성능 향상에 KD가 중요한 역할을 함을 보임
•DeIT (PR-297):
•ViT와 같은 구조를 사용하면서 Training 방법 개선과 distillation token을 사용하는 KD를 적용해 ImageNet data 만으로
EfficientNet보다 뛰어난 성능을 보여줌
•Once-for-All: Train One Network and Specialize it for Efficient Deployment, ICLR 2020
•Neural architecture search에 KD를 적용해 cost-effective한 sub-networks 훈련법 제안
•Circumventing outliers of autoaugment with knowledge distillation, ECCV 2020
•KD가 data augmentation에서 발생하는 noise를 줄여줌에 따라 더 강한 augmentation 적용이 가능함을 보임
•However, KD is not a common practice for ImageNet training.
/ 30
2. Methods
2. Methods 11
insight and motivation into the impact of KD
•Wing, warplane : Teacher network은 image가 완전히 mutually-exclusive하지 않은 case를 보완한다
•(c) hen 55.5% 사람이 봐도 애매한데 그 애매함을 teacher의 classification 결과가 반영한다.
•(d) Task로 보면 틀린 답안이지만 English setter가 이미지에서 main object라고 볼 수 있음
ImageNet ground-truth label Hen: 암탉, cock: 수탉, forklift: 지게차
English setter, Gordon setter: 개의 품종
Ice lolly : 아이스크림
/ 30
2. Methods
2. Methods 12
insight and motivation into the impact of KD
•Teacher label에 ground truth label보다 더 많은 정보가 포함되어 있음 (class간의 유사성과 상관관계)
•Label error를 보정할 수 있음, Label smoothing을 따로 할 필요가 없음
•Lead to a more effective and robust optimization process, compared to training with hard-labels only.
ImageNet ground-truth label Hen: 암탉, cock: 수탉, forklift: 지게차
English setter, Gordon setter: 개의 품종
Ice lolly : 아이스크림
/ 30
2. Methods
2. Methods 13
The Proposed Training Scheme
•KD를 활용해 architecture가 달라도 같은 training configuration을 적용할 수 있도록 제안.
/ 30
3. Experimental Results
2. Methods
3. Experimental Results 15
•USI의 robustness 검증
•제안한 training scheme (KD), loss function이 잘 작동함을 확인
•추가로 성능 향상할 수 있는 방법 제안
•Application: Speed-Accuracy comparison
/ 30
2. Methods
3. Experimental Results 16
Comparison to Previous Schemes
•위의 model들에 똑같이 USI를 적용했을 때, tailor-made schemes을 적용한 각 논문의 Top1 accuracy보
다 좋은 성능을 보임
/ 30
2. Methods
3. Experimental Results 17
Comparison to Previous Schemes
•위의 model들에 똑같이 USI를 적용했을 때, tailor-made schemes을 적용한 각 논문의 Top1 accuracy보
다 좋은 성능을 보임
/ 30
2. Methods
3. Experimental Results 18
Robustness to Batch-size
•이전 연구 (Yang You et al., 2017) 에서는 더 큰 batch size를 위해 더 큰 learning rate나 dedicated optimizer가 필
요하다고 제안
•USI를 적용했을 때, Batch size를 어떻게 결정하는지에 따라 성능의 큰 차이가 없음.
•Batch size가 클수록 training speed가 오름
• GPU: V100 8장
• TResNet-L teacher, TResNet-M student
• TResNet-M 은 inplace-activated batchnorm을
사용하기 때문에 batch size를 많이 키울 수 있음
/ 30
2. Methods
3. Experimental Results 19
Robustness to Teacher Type
•USI를 적용했을 때, teacher network 선택의 폭이 다양함
• Volo-d1 과 TResNet-L은 비슷한 top-1 accuracy 를 보임
(83.9% for TResNet-L, 84.1% for Volo-d1).
/ 30
2. Methods
3. Experimental Results 20
Robustness to architecture-based regularization
•Architecture-based regularization은 model architecture에 따라 적용이 가능한 경우도 있고 아닌 경우
도 있음.
•USI에 추가로 drop-path 적용 유무가 성능에 영향을 미치지 않는다는 것을 보여 USI의 architecture
robustness를 강조하려 한 것 같음
Huang, G. et al., (2016). Deep networks with stochastic depth.
/ 30
2. Methods
3. Experimental Results 21
•USI의 robustness 검증
•제안한 training scheme (KD), loss function이 잘 작동함을 확인
•추가로 성능 향상할 수 있는 방법 제안
•Application: Speed-Accuracy comparison
/ 30
2. Methods
3. Experimental Results 22
Ablations about loss function
•ImageNet training에서 KD가 효과적임을 입증
6.5% less than accuracy with the default
(Default)
•Relative weight αkd
/ 30
2. Methods
3. Experimental Results 23
Ablations about loss function
•Vanilla softmax probabilities를 사용하는 것이 좋음
τ < 1 (sharpening the teacher predictions)
•KD Temperature (τ)
τ > 1 (softening the teacher predictions)
Class마다의 softmax output의 차이가 줄어듦
/ 30
2. Methods
3. Experimental Results 24
•USI의 robustness 검증
•제안한 training scheme (KD), loss function이 잘 작동함을 확인
•추가로 성능 향상할 수 있는 방법 제안
•Application: Speed-Accuracy comparison
/ 30
2. Methods
3. Experimental Results 25
성능을 더 높일 수 있는 방법들에 대한 검증
•Epoch에 관한 USI의 default configuration은 300이지만, 더 긴 training epoch으로 성능을 향상 시킬 수
있음 (Patient teacher)
•Training Epochs
•Mixup-Cutmix vs. Cutout augmentation
Yun, S. et al., CutMix: Regularization strategy to train strong
classifiers with localizable features. ICCV 2019
• Cutout: CNNs, Mobile-oriented model 학습에 주로 쓰임
• Mixup-Cutmix: transformer 기반 모델 학습에 주로 쓰임
•Augmentation은 적용하는 것이 좋음
/ 30
2. Methods
3. Experimental Results 26
•USI의 robustness 검증
•제안한 training scheme (KD), loss function이 잘 작동함을 확인
•추가로 성능 향상할 수 있는 방법 제안
•Application: Speed-Accuracy comparison
/ 30
2. Methods
3. Experimental Results 27
Speed-Accuracy comparison
•USI를 활용해 모든 backbone에 대해 동일한 hyperparameter를 적용했고, 이에 따라 재현성과 신뢰도가
높은 speed-accuracy trade-off 비교가 가능했다.
GPU inference
/ 30
2. Methods
3. Experimental Results 28
Speed-Accuracy comparison
•USI를 활용해 모든 backbone에 대해 동일한 hyperparameter를 적용했고, 이에 따라 재현성과 신뢰도가
높은 speed-accuracy trade-off 비교가 가능했다.
CPU inference
/ 30
4. Conclusion
2. Methods
4. Conclusions 30
• Main contribution
• (1) We introduce a unified, efficient training scheme for ImageNet dataset, USI, that
does not require hyperparameter tuning.
• (2) We show it consistently and reliably achieves state-of-the-art results, compared
to tailor-made schemes per model (ResNet-like, Mobile-oriented, Transformer-
based and MLP-only models).
• (3) We use USI to perform a methodological speed-accuracy comparison of modern
deep learning models, and identify efficient backbones along the Pareto curve.
• 다른 classification dataset로의 확장성에 대한 논의: 이 논문의 parameter (high learning rate,
training epoch, strong augmentation)를 바로 활용하기는 어렵겠지만, KD의 적용 자체는 이점이
있을 것으로 예상
/ 30
Thank you.

Weitere ähnliche Inhalte

Was ist angesagt?

Finding connections among images using CycleGAN
Finding connections among images using CycleGANFinding connections among images using CycleGAN
Finding connections among images using CycleGAN
NAVER Engineering
 

Was ist angesagt? (20)

[DL輪読会]Generative Models of Visually Grounded Imagination
[DL輪読会]Generative Models of Visually Grounded Imagination[DL輪読会]Generative Models of Visually Grounded Imagination
[DL輪読会]Generative Models of Visually Grounded Imagination
 
XGBoost & LightGBM
XGBoost & LightGBMXGBoost & LightGBM
XGBoost & LightGBM
 
ICLR2020の異常検知論文の紹介 (2019/11/23)
ICLR2020の異常検知論文の紹介 (2019/11/23)ICLR2020の異常検知論文の紹介 (2019/11/23)
ICLR2020の異常検知論文の紹介 (2019/11/23)
 
Glove global vectors for word representation
Glove global vectors for word representationGlove global vectors for word representation
Glove global vectors for word representation
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
 
SSII2020 [OS2-03] 深層学習における半教師あり学習の最新動向
SSII2020 [OS2-03] 深層学習における半教師あり学習の最新動向SSII2020 [OS2-03] 深層学習における半教師あり学習の最新動向
SSII2020 [OS2-03] 深層学習における半教師あり学習の最新動向
 
GPT
GPTGPT
GPT
 
Finding connections among images using CycleGAN
Finding connections among images using CycleGANFinding connections among images using CycleGAN
Finding connections among images using CycleGAN
 
【DL輪読会】Representational Continuity for Unsupervised Continual Learning ( ICLR...
【DL輪読会】Representational Continuity for Unsupervised Continual Learning ( ICLR...【DL輪読会】Representational Continuity for Unsupervised Continual Learning ( ICLR...
【DL輪読会】Representational Continuity for Unsupervised Continual Learning ( ICLR...
 
[DL Hacks]Simple Online Realtime Tracking with a Deep Association Metric
[DL Hacks]Simple Online Realtime Tracking with a Deep Association Metric[DL Hacks]Simple Online Realtime Tracking with a Deep Association Metric
[DL Hacks]Simple Online Realtime Tracking with a Deep Association Metric
 
Genetic algorithm for hyperparameter tuning
Genetic algorithm for hyperparameter tuningGenetic algorithm for hyperparameter tuning
Genetic algorithm for hyperparameter tuning
 
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP ModelsComparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP Models
 
Deep Learningを用いた教師なし画像検査の論文調査 GAN/SVM/Autoencoderとか .pdf
Deep Learningを用いた教師なし画像検査の論文調査 GAN/SVM/Autoencoderとか .pdfDeep Learningを用いた教師なし画像検査の論文調査 GAN/SVM/Autoencoderとか .pdf
Deep Learningを用いた教師なし画像検査の論文調査 GAN/SVM/Autoencoderとか .pdf
 
MobileViTv1
MobileViTv1MobileViTv1
MobileViTv1
 
【DL輪読会】StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-I...
【DL輪読会】StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-I...【DL輪読会】StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-I...
【DL輪読会】StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-I...
 
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
 
【論文読み会】MAUVE: Measuring the Gap Between Neural Text and Human Text using Dive...
【論文読み会】MAUVE: Measuring the Gap Between Neural Text and Human Text using Dive...【論文読み会】MAUVE: Measuring the Gap Between Neural Text and Human Text using Dive...
【論文読み会】MAUVE: Measuring the Gap Between Neural Text and Human Text using Dive...
 
Generative Adversarial Networks (GAN) の学習方法進展・画像生成・教師なし画像変換
Generative Adversarial Networks (GAN) の学習方法進展・画像生成・教師なし画像変換Generative Adversarial Networks (GAN) の学習方法進展・画像生成・教師なし画像変換
Generative Adversarial Networks (GAN) の学習方法進展・画像生成・教師なし画像変換
 
【DL輪読会】Generative models for molecular discovery: Recent advances and challenges
【DL輪読会】Generative models for molecular discovery: Recent advances and challenges【DL輪読会】Generative models for molecular discovery: Recent advances and challenges
【DL輪読会】Generative models for molecular discovery: Recent advances and challenges
 
大規模データに基づく自然言語処理
大規模データに基づく自然言語処理大規模データに基づく自然言語処理
大規模データに基づく自然言語処理
 

Ähnlich wie PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top Results

Image mining defect detection midterm final
Image mining defect detection midterm finalImage mining defect detection midterm final
Image mining defect detection midterm final
ssuserc8629a
 

Ähnlich wie PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top Results (20)

Bag of Tricks for Image Classification with Convolutional Neural Networks (C...
Bag of Tricks for Image Classification  with Convolutional Neural Networks (C...Bag of Tricks for Image Classification  with Convolutional Neural Networks (C...
Bag of Tricks for Image Classification with Convolutional Neural Networks (C...
 
Infra as a model service
Infra as a model serviceInfra as a model service
Infra as a model service
 
딥러닝 논문읽기 efficient netv2 논문리뷰
딥러닝 논문읽기 efficient netv2  논문리뷰딥러닝 논문읽기 efficient netv2  논문리뷰
딥러닝 논문읽기 efficient netv2 논문리뷰
 
carrier of_tricks_for_image_classification
carrier of_tricks_for_image_classificationcarrier of_tricks_for_image_classification
carrier of_tricks_for_image_classification
 
Image mining defect detection midterm final
Image mining defect detection midterm finalImage mining defect detection midterm final
Image mining defect detection midterm final
 
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
 
Enhanced ai platform
Enhanced ai platformEnhanced ai platform
Enhanced ai platform
 
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesPR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of Samples
 
"simple does it weakly supervised instance and semantic segmentation" Paper r...
"simple does it weakly supervised instance and semantic segmentation" Paper r..."simple does it weakly supervised instance and semantic segmentation" Paper r...
"simple does it weakly supervised instance and semantic segmentation" Paper r...
 
20210131deit-210204074124.pdf
20210131deit-210204074124.pdf20210131deit-210204074124.pdf
20210131deit-210204074124.pdf
 
Training data-efficient image transformers & distillation through attention
Training data-efficient image transformers & distillation through attentionTraining data-efficient image transformers & distillation through attention
Training data-efficient image transformers & distillation through attention
 
Image data augmentatiion
Image data augmentatiionImage data augmentatiion
Image data augmentatiion
 
권기훈_포트폴리오
권기훈_포트폴리오권기훈_포트폴리오
권기훈_포트폴리오
 
Bert3q KorQuAD Finetuning NLP Challenge
Bert3q KorQuAD Finetuning NLP ChallengeBert3q KorQuAD Finetuning NLP Challenge
Bert3q KorQuAD Finetuning NLP Challenge
 
제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [분모자] : 분류 모자이크
제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [분모자] : 분류 모자이크제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [분모자] : 분류 모자이크
제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [분모자] : 분류 모자이크
 
NN and PDF
NN and PDFNN and PDF
NN and PDF
 
Lab_Study_0615.pptx
Lab_Study_0615.pptxLab_Study_0615.pptx
Lab_Study_0615.pptx
 
230720_NS
230720_NS230720_NS
230720_NS
 
Dense sparse-dense training for dnn and Other Models
Dense sparse-dense training for dnn and Other ModelsDense sparse-dense training for dnn and Other Models
Dense sparse-dense training for dnn and Other Models
 
(Nlp)fine tuning 대회_참여기
(Nlp)fine tuning 대회_참여기(Nlp)fine tuning 대회_참여기
(Nlp)fine tuning 대회_참여기
 

Mehr von Sunghoon Joo

Mehr von Sunghoon Joo (17)

PR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But FasterPR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But Faster
 
PR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked AutoencodersPR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked Autoencoders
 
PR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdfPR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdf
 
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
 
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
 
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningPR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
 
PR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learningPR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learning
 
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
 
PR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document rerankingPR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document reranking
 
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
 
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationPR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
 
PR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseasesPR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseases
 
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From ScratchPR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
 
PR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture SearchPR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture Search
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
 
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
 
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
 

Kürzlich hochgeladen

Kürzlich hochgeladen (7)

공학 관점에서 바라본 JMP 머신러닝 최적화
공학 관점에서 바라본 JMP 머신러닝 최적화공학 관점에서 바라본 JMP 머신러닝 최적화
공학 관점에서 바라본 JMP 머신러닝 최적화
 
JMP를 활용한 전자/반도체 산업 Yield Enhancement Methodology
JMP를 활용한 전자/반도체 산업 Yield Enhancement MethodologyJMP를 활용한 전자/반도체 산업 Yield Enhancement Methodology
JMP를 활용한 전자/반도체 산업 Yield Enhancement Methodology
 
실험 설계의 평가 방법: Custom Design을 중심으로 반응인자 최적화 및 Criteria 해석
실험 설계의 평가 방법: Custom Design을 중심으로 반응인자 최적화 및 Criteria 해석실험 설계의 평가 방법: Custom Design을 중심으로 반응인자 최적화 및 Criteria 해석
실험 설계의 평가 방법: Custom Design을 중심으로 반응인자 최적화 및 Criteria 해석
 
데이터 분석 문제 해결을 위한 나의 JMP 활용법
데이터 분석 문제 해결을 위한 나의 JMP 활용법데이터 분석 문제 해결을 위한 나의 JMP 활용법
데이터 분석 문제 해결을 위한 나의 JMP 활용법
 
JMP가 걸어온 여정, 새로운 도약 JMP 18!
JMP가 걸어온 여정, 새로운 도약 JMP 18!JMP가 걸어온 여정, 새로운 도약 JMP 18!
JMP가 걸어온 여정, 새로운 도약 JMP 18!
 
JMP 기능의 확장 및 내재화의 핵심 JMP-Python 소개
JMP 기능의 확장 및 내재화의 핵심 JMP-Python 소개JMP 기능의 확장 및 내재화의 핵심 JMP-Python 소개
JMP 기능의 확장 및 내재화의 핵심 JMP-Python 소개
 
JMP를 활용한 가속열화 분석 사례
JMP를 활용한 가속열화 분석 사례JMP를 활용한 가속열화 분석 사례
JMP를 활용한 가속열화 분석 사례
 

PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top Results

  • 3. 2. Methods 1. Research Background 3 The performance of a vision model •여전히 ImageNet에 대한 딥러닝의 높은 분류 성능을 위한 training 연구는 현재진행형임 / 30
  • 4. 2. Methods 1. Research Background 4 Previous works •Regularizations •Stronger augmentations: AutoAugment, RandAugment •Image-based regularizations Cutout, Cutmix and Mixup •Architecture regularizations like drop-path, drop-block •Label-smoothing •Progressive image resizing during training •Different train-test resolutions •Training configuration •More training epochs •Dedicated optimizer for large batch size(LAMB Optimizer), Scaling learning rate with batch size •Exponential-moving average (EMA) of model weights •Improved weights initializations •Decoupled weight decay (AdamW) Yun, S. et al., CutMix: Regularization strategy to train strong classifiers with localizable features. ICCV 2019 Fixing the train-test resolution discrepancy. NeurIPS 2019 / 30
  • 5. 2. Methods 1. Research Background 5 ResNeXt Automated architecture search를 활용한 구조 [67 (NASNET), 41 (AmoebaNet: 83.9), 55 (EfficientNet-B7, 84.4%, 2019)]. Adapting self-attention to the visual domain AA-ResNet-152, 79.1%, 2019 ViT-L/16 87.76±0.03%, 2020 LambdaResNet200 84.3%, 2021 Previous works •Architecture VGG ResNet Inception ViT-L/16 87.76±0.03%, 2020 / 30
  • 6. 2. Methods 1. Research Background 6 Motivation - Architecture와 관계없이 잘 작동하는 training scheme 제안 필요 •Architecture마다 맞춤형 training scheme이 적용됨 •ResNet 계열 (TResNet, SEResNet, ResNet-D …) •일반적으로 다양한 training scheme에 잘 작동함. •(Ross Wightman et al., 2021) 에서 제안한 방법이 ResNet 계열을 학습시키는데 standard가 됐다고 함. •Mobile-oriented models •Depth-wise convolutions에 많이 의존 •Their dedicated training schemes usually consist of RMSProp optimizer, waterfall learning rate scheduling and EMA. •Transformer-based, MLP-only models •Inductive bias가 없어 훈련하기 어려움 -> longer training (1000 epochs), strong cutmix-mixup and drop- path regularizations, large weight-decay and repeated augmentations •어떤 한 모델에 대한 맞춤형 training scheme은 다른 모델에 적용하면 성능이 낮아짐 •ResNet50을 위한 training scheme을 EfficieneNetV2 model에 적용했을 때 맞춤형 training scheme을 적용할 떄 보다 3.3%의 성능 하락을 보임 (Mingxing Tan et al., PMLR, 2021) / 30
  • 7. 2. Methods 1. Research Background 7 Objective: 
 we introduce a unified training scheme for ImageNet without any hyper-parameter tuning or tailor-made tricks per model. / 30
  • 9. 2. Methods 2. Methods 9 knowledge distillation (KD) for classification Image from: https://intellabs.github.io/distiller/knowledge_distillation.html Hinton et al., (2015). Distilling the Knowledge in a Neural Network / 30
  • 10. 2. Methods 2. Methods 10 knowledge distillation (KD) for classification •KD의 적용 - Previous work •Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network, 2021 •ResNet50의 image classification 성능 향상에 KD가 중요한 역할을 함을 보임 •DeIT (PR-297): •ViT와 같은 구조를 사용하면서 Training 방법 개선과 distillation token을 사용하는 KD를 적용해 ImageNet data 만으로 EfficientNet보다 뛰어난 성능을 보여줌 •Once-for-All: Train One Network and Specialize it for Efficient Deployment, ICLR 2020 •Neural architecture search에 KD를 적용해 cost-effective한 sub-networks 훈련법 제안 •Circumventing outliers of autoaugment with knowledge distillation, ECCV 2020 •KD가 data augmentation에서 발생하는 noise를 줄여줌에 따라 더 강한 augmentation 적용이 가능함을 보임 •However, KD is not a common practice for ImageNet training. / 30
  • 11. 2. Methods 2. Methods 11 insight and motivation into the impact of KD •Wing, warplane : Teacher network은 image가 완전히 mutually-exclusive하지 않은 case를 보완한다 •(c) hen 55.5% 사람이 봐도 애매한데 그 애매함을 teacher의 classification 결과가 반영한다. •(d) Task로 보면 틀린 답안이지만 English setter가 이미지에서 main object라고 볼 수 있음 ImageNet ground-truth label Hen: 암탉, cock: 수탉, forklift: 지게차 English setter, Gordon setter: 개의 품종 Ice lolly : 아이스크림 / 30
  • 12. 2. Methods 2. Methods 12 insight and motivation into the impact of KD •Teacher label에 ground truth label보다 더 많은 정보가 포함되어 있음 (class간의 유사성과 상관관계) •Label error를 보정할 수 있음, Label smoothing을 따로 할 필요가 없음 •Lead to a more effective and robust optimization process, compared to training with hard-labels only. ImageNet ground-truth label Hen: 암탉, cock: 수탉, forklift: 지게차 English setter, Gordon setter: 개의 품종 Ice lolly : 아이스크림 / 30
  • 13. 2. Methods 2. Methods 13 The Proposed Training Scheme •KD를 활용해 architecture가 달라도 같은 training configuration을 적용할 수 있도록 제안. / 30
  • 15. 2. Methods 3. Experimental Results 15 •USI의 robustness 검증 •제안한 training scheme (KD), loss function이 잘 작동함을 확인 •추가로 성능 향상할 수 있는 방법 제안 •Application: Speed-Accuracy comparison / 30
  • 16. 2. Methods 3. Experimental Results 16 Comparison to Previous Schemes •위의 model들에 똑같이 USI를 적용했을 때, tailor-made schemes을 적용한 각 논문의 Top1 accuracy보 다 좋은 성능을 보임 / 30
  • 17. 2. Methods 3. Experimental Results 17 Comparison to Previous Schemes •위의 model들에 똑같이 USI를 적용했을 때, tailor-made schemes을 적용한 각 논문의 Top1 accuracy보 다 좋은 성능을 보임 / 30
  • 18. 2. Methods 3. Experimental Results 18 Robustness to Batch-size •이전 연구 (Yang You et al., 2017) 에서는 더 큰 batch size를 위해 더 큰 learning rate나 dedicated optimizer가 필 요하다고 제안 •USI를 적용했을 때, Batch size를 어떻게 결정하는지에 따라 성능의 큰 차이가 없음. •Batch size가 클수록 training speed가 오름 • GPU: V100 8장 • TResNet-L teacher, TResNet-M student • TResNet-M 은 inplace-activated batchnorm을 사용하기 때문에 batch size를 많이 키울 수 있음 / 30
  • 19. 2. Methods 3. Experimental Results 19 Robustness to Teacher Type •USI를 적용했을 때, teacher network 선택의 폭이 다양함 • Volo-d1 과 TResNet-L은 비슷한 top-1 accuracy 를 보임 (83.9% for TResNet-L, 84.1% for Volo-d1). / 30
  • 20. 2. Methods 3. Experimental Results 20 Robustness to architecture-based regularization •Architecture-based regularization은 model architecture에 따라 적용이 가능한 경우도 있고 아닌 경우 도 있음. •USI에 추가로 drop-path 적용 유무가 성능에 영향을 미치지 않는다는 것을 보여 USI의 architecture robustness를 강조하려 한 것 같음 Huang, G. et al., (2016). Deep networks with stochastic depth. / 30
  • 21. 2. Methods 3. Experimental Results 21 •USI의 robustness 검증 •제안한 training scheme (KD), loss function이 잘 작동함을 확인 •추가로 성능 향상할 수 있는 방법 제안 •Application: Speed-Accuracy comparison / 30
  • 22. 2. Methods 3. Experimental Results 22 Ablations about loss function •ImageNet training에서 KD가 효과적임을 입증 6.5% less than accuracy with the default (Default) •Relative weight αkd / 30
  • 23. 2. Methods 3. Experimental Results 23 Ablations about loss function •Vanilla softmax probabilities를 사용하는 것이 좋음 τ < 1 (sharpening the teacher predictions) •KD Temperature (τ) τ > 1 (softening the teacher predictions) Class마다의 softmax output의 차이가 줄어듦 / 30
  • 24. 2. Methods 3. Experimental Results 24 •USI의 robustness 검증 •제안한 training scheme (KD), loss function이 잘 작동함을 확인 •추가로 성능 향상할 수 있는 방법 제안 •Application: Speed-Accuracy comparison / 30
  • 25. 2. Methods 3. Experimental Results 25 성능을 더 높일 수 있는 방법들에 대한 검증 •Epoch에 관한 USI의 default configuration은 300이지만, 더 긴 training epoch으로 성능을 향상 시킬 수 있음 (Patient teacher) •Training Epochs •Mixup-Cutmix vs. Cutout augmentation Yun, S. et al., CutMix: Regularization strategy to train strong classifiers with localizable features. ICCV 2019 • Cutout: CNNs, Mobile-oriented model 학습에 주로 쓰임 • Mixup-Cutmix: transformer 기반 모델 학습에 주로 쓰임 •Augmentation은 적용하는 것이 좋음 / 30
  • 26. 2. Methods 3. Experimental Results 26 •USI의 robustness 검증 •제안한 training scheme (KD), loss function이 잘 작동함을 확인 •추가로 성능 향상할 수 있는 방법 제안 •Application: Speed-Accuracy comparison / 30
  • 27. 2. Methods 3. Experimental Results 27 Speed-Accuracy comparison •USI를 활용해 모든 backbone에 대해 동일한 hyperparameter를 적용했고, 이에 따라 재현성과 신뢰도가 높은 speed-accuracy trade-off 비교가 가능했다. GPU inference / 30
  • 28. 2. Methods 3. Experimental Results 28 Speed-Accuracy comparison •USI를 활용해 모든 backbone에 대해 동일한 hyperparameter를 적용했고, 이에 따라 재현성과 신뢰도가 높은 speed-accuracy trade-off 비교가 가능했다. CPU inference / 30
  • 30. 2. Methods 4. Conclusions 30 • Main contribution • (1) We introduce a unified, efficient training scheme for ImageNet dataset, USI, that does not require hyperparameter tuning. • (2) We show it consistently and reliably achieves state-of-the-art results, compared to tailor-made schemes per model (ResNet-like, Mobile-oriented, Transformer- based and MLP-only models). • (3) We use USI to perform a methodological speed-accuracy comparison of modern deep learning models, and identify efficient backbones along the Pareto curve. • 다른 classification dataset로의 확장성에 대한 논의: 이 논문의 parameter (high learning rate, training epoch, strong augmentation)를 바로 활용하기는 어렵겠지만, KD의 적용 자체는 이점이 있을 것으로 예상 / 30 Thank you.