1. Recent Trends in DNN
Compression
October 12th, 2018
Kaushalya Madhawa
Murata Laboratory
1
Tokyo Tech
2. Back then…
2
• Size of commonly used DNNs
• AlexNet 240MB
• VGG 16 552MB
• Inception V3 109MB
• Running models on the cloud has its own
disadvantages
• Network latency
• Privacy
3. DNN Compression
• Can we achieve the same accuracy with
smaller models?
• There are several approaches to obtain
smaller models
– Compressing pre-trained networks
• DeepCompression (Han+, 2016)
– Designing of compact models
• SqueezeNet (Iandola+, 2016)
• MobileNets (Howard+, 2017)
3
4. Deep Compression (Han+, ICLR 2016)
• One of the first papers to introduce model compression
• Requires specific custom hardware to leverage
inferencing
• Sparsity doesn’t always translate to reduced inference
time
4
5. Deep Compression (Han+, ICLR 2016)
• One of the first papers to introduce model compression
• Requires specific custom hardware to leverage
inferencing
• Sparsity doesn’t always translate to reduced inference
time
5
6. Compact Models
• Designing networks
with less number of
parameters
• SqueezeNet - AlexNet
level accuracy with 50x
less parameters
• MobileNets - Depth-
wise separable
convolutions
6
Fire module: SqueezeNet
7. Compact Models
• Designing networks
with less number of
parameters
• SqueezeNet - AlexNet
level accuracy with 50x
less parameters
• MobileNets - Depth-
wise separable
convolutions
7
Fire module: SqueezeNet
Requires lot of expertise and consumes lot of time!
9. State-of-the-art (SOTA) in 2018
• Mobile devices
• More memory
• Has dedicated hardware to run ML models
• Deep Learning frameworks
• Models
• Directly optimize models for the resource constraint (eg:
size)
• More focus on latency
• Optimize for multiple objectives
!9
10. SOTA in 2018: Devices
• Storage: <128MB • Storage: <512MB
• Neural Engine: dedicated
hardware for ML algorithms
• CoreML/ TF-Lite
!10
11. SOTA in 2018: Models
• Model compression
• Structured pruning is used to reduce the latency
• Designing compact models
• Neural architecture search for finding models fulfilling
the resource restrictions
• In addition to accuracy, latency or model size also
incorporated into the objective
!11
12. Neural Architectural Search
• Automates the designing of neural network models
• NasNet (Zoph and Le, 2017): Accuracy is used as the
reward in a reinforcement learning model
• PPP-Net (Dong+, 2018): A multi-objective architecture
search to optimize for both accuracy and inference time
!12
13. Mnasnet (Tan+, 2018)
• Neural Architecture Search for mobile
devices
• Optimized for both accuracy and latency
• Multiple pareto-optimal solutions are
found in a single architecture search
• Latency is directly measured on a mobile
phone
• Able to find models that run 1.5x faster
than MobileNet v2
Sample models
from search space Trainer
Mobile
phones
Multi-objective
reward
latency
reward
Controller
accuracy
maximize
m
ACC(m) ×
[
LAT(m)
T ]
w
w =
{
α, if LAT(m) ≤ T
β, otherwise
!13
14. Mnasnet
Model Name Model_Size Top-1 Accuracy Top-5 Accuracy
TF Lite
Performance
MnasNet_0.50_22
4
8.5 Mb 68.03% 87.79% 37 ms
MnasNet_0.75_22
4
12 Mb 71.72% 90.17% 61ms
MnasNet_1.3_224 24 Mb 75.24% 92.55% 152 ms
SqueezeNet 5.0 Mb 49.0% 72.9% 224 ms
ResNet_V2_101 178.3 Mb 76.8% 93.6% 1880 ms
Inception_V3 95.3 Mb 77.9% 93.8% 1433 ms
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/lite/g3doc/models.md
!14
15. Mnasnet
Model Name Model_Size Top-1 Accuracy Top-5 Accuracy
TF Lite
Performance
MnasNet_0.50_22
4
8.5 Mb 68.03% 87.79% 37 ms
MnasNet_0.75_22
4
12 Mb 71.72% 90.17% 61ms
MnasNet_1.3_224 24 Mb 75.24% 92.55% 152 ms
SqueezeNet 5.0 Mb 49.0% 72.9% 224 ms
ResNet_V2_101 178.3 Mb 76.8% 93.6% 1880 ms
Inception_V3 95.3 Mb 77.9% 93.8% 1433 ms
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/lite/g3doc/models.md
!15
16. Summary
• Mobile devices are more capable in running
DNN models
• Unstructured pruning is out of fashion
• Accuracy and platform-dependent restrictions
are incorporated into multi-objective model
search
!16
17. References
• Dong, Jin-Dong, et al. "DPP-Net: Device-aware Progressive Search for Pareto-optimal Neural
Architectures." arXiv preprint arXiv:1806.08198 (2018).
• Han, Song, Huizi Mao, and William J. Dally. "Deep compression: Compressing deep neural
networks with pruning, trained quantization and huffman coding." arXiv preprint arXiv:
1510.00149 (2015).
• Tan, Mingxing, et al. "MnasNet: Platform-Aware Neural Architecture Search for Mobile." arXiv
preprint arXiv:1807.11626 (2018).
• Zoph, Barret, and Quoc V. Le. "Neural architecture search with reinforcement learning." arXiv
preprint arXiv:1611.01578 (2016)
17