딥러닝 논문읽기 efficient netv2 논문리뷰

EfficientNetV2:
Smaller Models and Faster Training
Mingxing Tan (Google Research, Brain Team)
Quoc V. Le (Google Research, Brain Team)
2021.05.02
딥러닝논문읽기모임 이미지처리팀
홍은기, 김병현, 안종식, 이찬혁

1. 개요
2. EfficientNetV1
3. EfficientNetV1의 문제점
4. EfficientNetV2
1) Progressive Learning
2) Fused-MBConv
3) Training-Aware NAS & Scaling
5. Main Results
6. Ablation Studies
7. 결론
2
목차

• EfficientDet - Scalable and Efficient Object Detection (2020)
• Mobiledets: Searching for object detection architectures for mobile accelerators (2020)
• 83% ImageNet Accuracy in One Hour (2020)
• EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks (2019)
• Mnasnet: Platform-aware neural architecture search for mobile (2019)
• MixConv: Mixed Depthwise Convolutional Kernels (2019)
• Searching for mobilenetv3 (2019)
...
3
Mingxing Tan (Google Brain)
Keywords: Mobile, Edge Device, Efficiency, NAS (Neural Architecture Search)

1. 딥러닝 모델과 데이터의 사이즈가 점점 커짐에 따라 학습 효율(training efficiency)의
중요성 또한 커짐
2. NFNet (batch norm 제거), RestNet-RS (scaling hyperparameter 최적화), Vision
Transformer (Transformer block 사용) 등은 제각기 학습 효율을 개선했으나,
파라미터 효율(모델 크기)이 나빠지는 한계가 있었음
3. 본 연구는 이전 연구(EfficientNetV1), 컨볼루션 커널(Fused-MBConv), 뉴럴넷 설계 방법
론(Training-aware NAS), 학습 방식(Progressive Learning) 등 다양한 테크닉을 이용하여
높은 training efficiency 및 parameter efficiency와 accuracy를 함께 달성
4
개요

5
개요
4. EfficientNetV1 대비 학습 속도는 4x 개선, 파라
미터 사이즈는 6.8x 축소할 수 있었으며,
(ImageNet21k로 사전 학습하였을 때)
ImageNet ILSVRC2012 대상으로 87.3 % top-1
accuracy 달성
5. ConvNet이 정확도 및 학습 효율 측면에서
Vision Transformer보다 뛰어나다는 것을 증명

6
EfficientNetV1 – Intro
1. 사용자의 컴퓨팅 자원에 따라 취사 선택할 수 있는 B0 ~ B7 모델 제안
2. 모델의 width (채널 수), depth (레이어 개수), resolution (이미지 사이즈)을 동시에 조절할
수 있는 compound scaling method를 적용함으로써, B0를 기준으로 B1, B2 … B7까지
scale up
3. B7의 경우, 당시 SOTA 대비 8.4x smaller & 6.1x faster한 모델로서 ImageNet 84.3 %
top-1 accuracy 달성
EfficientNet - Rethinking Model Scaling for Convolutional Neural Networks (2019, Tan et al.)

7
EfficientNetV1 – Compounding Model Scaling
1. Width: 채널 수
2. Depth: 레이어 개수
3. Resolution: 이미지 사이즈

8
EfficientNetV1 – Compound Model Scaling
아이디어: 모델의 depth, width, resolution
중 단 하나만을 조절한 시도들은 많았으나,
이 3가지 factor는 서로 독립적이지 않다.
예를 들어, 이미지 사이즈가 커질 수록 더
많은 픽셀을 처리하기 위해 레이어 및 채널
개수도 올려야 한다.
따라서, 3가지 요소를 동시에 조절할 수 있
는 compound scaling을 적용

9
EfficientNetV1 – Baseline B0 architecture
MNasNet: Platform-Aware Neural Architecture Search for Mobile (2019, Tan et al.)
MNasNet: Platform-Aware Neural
Architecture Search for Mobile의 NAS 방법론
을 이용하여 탐색
Accuracy와 FLOP을 NAS의 objective로 설정

10
EfficientNetV1 – Compound Model Scaling
STEP 1: ϕ = 1로 고정한 뒤, grid search를 이용하여
alpha, beta, gamma 탐색
STEP 2: 탐색된 alpha=1.2, beta=1.1, gamma=1.15를 고
정한 채로, ϕ를 2로 증가시킴으로써 B1 획득
ϕ를 증가시킴으로써 B2, B3, …, B7 획득
Compound scaling method
(Compound coefficient ϕ가 depth,
width, resolution을 동시에 scale up)

1. 해상도가 높은 이미지로 학습하면 학습 속도가 느리다
• GPU 메모리는 동일하므로, 이미지 사이즈가 클 경우 mini-batch size 감소
• 이에 따라 학습 속도가 크게 느려짐
2. Depthwise Separable Convolution은 초기 layer에서 느리다
• Depthwise Separable Conv는 이론적으로는 일반 Conv보다 8~9배 빠르지만, modern accelerator는
그 계산 방식에 최적화되어 있지 않음
3. 모든 stage를 동일하게 scaling하는 것은 sub-optimal하다.
• EfficientNetV1은 간단한 compound scaling rule을 이용하여 모든 stage를 동일하게 scale up 하므로,
학습 속도 및 파라미터 효율을 최적화한다고 볼 수 없음
12
EfficientNetV1의 문제점

1. 해상도가 높은 이미지로 학습하면 학습 속도가 느리다
• Progressive Learning을 이용하여 학습 중에 image size 및 regularization을 동적으로 변경
2. Depthwise Convolution은 초기 layer에서 느리다
• 초기 layer들을 MBConv 대신, Fused-MBConv로 교체
3. 모든 stage를 동일하게 scaling하는 것은 sub-optimal하다.
• non-uniform scaling 전략을 적용
13
EfficientNetV1의 문제점 – 해결 방안

14
EfficientNetV2 – Progressive Learning
1. 기존 여러 연구들이 학습 중 이미지 사이즈를 변경하는 시도를 하였으나 정확도가 감소
하였음 (Progressive resizing)
2. 저자의 가설: 정확도 감소의 원인은 불균형한 regularization이다.
1) 사이즈가 작은 이미지는 상대적으로 약한 regularization 필요
2) 반면, 사이즈가 큰 이미지는 상대적으로 강한 regularization 필요

15
EfficientNetV2 – Progressive Learning (Adaptive Regularization)
RandAug + Mixup example
* Vision Transformer는 embedding vector가 입력 이미지의 사이즈에 의존적인
반면, ConvNet은 이미지 사이즈에 독립적이므로, 서로 다른 사이즈 이미지에서
학습한 weight를 그대로 사용 가능

16
Accelerator-aware Neural Network Design using AutoML (2020, Gupta & Akin)
MBConv
(1) FLOP of Standard Conv.
-> (K x K x M) x (H x W x N)
-> (3 x 3 x 128) x (14 x 14 x 512) = 115,605,504
(2) FLOP of Depthwise Separable Conv. :
-> (K x K x H x W x M) + (H x W x M x N)
-> (3 x 3 x 14 x 14 x 512) + (14 x 14 x 128 x 512)
= 903,168 + 12,845,056 = 13,748,224
(theoretically) 8.4x less computation
Depthwise Separable Conv. vs Standard Conv.
where,
K: filter size
M: input channel
N: output channel
H: output height
W: output width
Fused-MBConv

17
Depthwise Conv. is slow in early layers
vs
Accelerator-aware Neural Network Design using AutoML (2020, Gupta & Akin)
Later layer
runtime latency
Early layer
runtime latency

18
MBConv vs Fused-MBConv
Params 및 FLOP 자체는 단순 MBConv보다 많지만,
실제 throughput은 Fused-MBConv를 조합한 것이 더 빠르다
EfficientNetV2 - Smaller Models and Faster Training (2021, Tan et al.)

19
Training-aware NAS & non-uniform scaling
1. EfficientNetV1에서 정확도 및 FLOP을 최적화했던 것과 달리, 정확도, 학습 효율 및 파라
미터 효율을 NAS의 objective로 설정하여 최적화
1) Search Reward: (Accuracy A) * (normalized training step time Sw) * (parameter size Pv)
2) w = -0.07, v = -0.05
2. Design choices:
1) Conv ops type: {MBConv, Fused-MBConv}
2) Number of layers
3) Kernel size: {3 x 3, 5 x 5}
4) Expansion ratio: {1, 4, 6}
학습 효율을 objective로 잡았기 때문
에 training-aware….

20
EfficientNetV1 vs EfficientNetV2 Architecture
EfficientNetV2 - Smaller Models and Faster Training (2021, Tan et al.)
EfficientNetV1-B0 EfficientNetV2-S

21
Training-aware NAS & non-uniform scaling
1. EfficientNetV1와 같은 방법으로 EfficientNet-V2-S를 EfficientNet-V2-M/L로 scale up
2. Additional Optimizations:
1. 지나치게 커다란 이미지 사이즈는 메모리/학습 속도에 부하를 줄 수 있으므로, inference시 최대 이미지 사
이즈를 480으로 제한.
2. “Heuristic”하게 runtime에 부하를 주지 않는 선에서 network capacity를 늘리기 위하여 later stage에 점진
적으로 레이어 추가

23
Main Results – ImageNet
Training setups
• EfficientNetV1과 유사하게 설정
• RMSProp, decay 0.9, Momentum 0.0
• batch norm momentum 0.99
• weight decay 1e-5
• 350 epochs
• batch size 4096
• learning rate 0 ~ 0.256 (decay: 0.97
every 2.4 epochs
• exponential moving average with
0.9999 decay rate
Progressive Learning setups
• 87 epoch을 1 stage로 설정, 4 stage로 분할
• 최대 이미지 사이즈를 inference 사이즈의
약 20 %로 설정

25
Main Results – ImageNet
Two interesting observations:
1) Scaling up data size is more effective than simply scaling up model size in high-accuracy
regime.
 top-1 accuracy가 85 %를 넘어서면, 단순히 모델 크기를 늘리는 것은 과적합으로 인해 정확도를 더욱 끌어올리기
어렵다. 반면, ImageNet21k로 사전 학습하였을 때 정확도를 크게 향상할 수 있었다.
2) Pretraining on ImageNet21k could be quite efficient.
 ImageNet21k는 데이터 크기가 ImageNet ILSVRC2012의 10배가 넘지만, EfficientNet2의 학습 효율 & 파라미터 효
율, Progressive Learning 덕분에 32 TPU core로 이틀 안에 사전 학습을 완료할 수 있었다(ViT는 몇 주 걸림).

26
Ablation Studies
1. Comparison to EfficientNetV1
동일한 학습 방법을 EfficientNetV1에 적용한 경우에도, EfficientNetV2가 나은 성능을 보임
2. Progressive Learning for Different Networks
Progressive Learning을 적용하였을 때, 학습 시간이 줄어들고 정확도가 향상되는 효과를 모든 네트워크에서
관찰하였음

27
Ablation Studies
3. Importance of Adaptive Regularization
모든 이미지 사이즈에 대해 동일한 regularization을 적용한 기존 방법들보다 나은 정확도를 보임

28
결론 및 논의
1. 이전 연구 결과(multi-objective NAS, Fused-MBConv) 및 EfficientNetV1의 단점에 대한 분석을 통해
네트워크 아키텍처 개선
2. Adaptive regularization을 동반한 progressive learning 제안
3. 학습 효율 & 파라미터 효율 & 정확도 ↑
학습에 막대한 비용/시간이 소요되는 초대형 모델에 비해 훨씬 효율적이며 탄소발자국이 적은 모델로서
환경 및 운영 비용 측면에서 바람직한 모델

딥러닝 논문읽기 efficient netv2 논문리뷰

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie 딥러닝 논문읽기 efficient netv2 논문리뷰

Ähnlich wie 딥러닝 논문읽기 efficient netv2 논문리뷰 (20)

Mehr von taeseon ryu

Mehr von taeseon ryu (20)

딥러닝 논문읽기 efficient netv2 논문리뷰