Review : Learning Sparse Networks using Targeted Dropout
- by Seunghyun Hwang (Yonsei University, Severance Hospital, Center for Clinical Data Science)
Learning Sparse Networks Using Targeted Dropout
Hwang seung hyun
Yonsei University Severance Hospital CCIDS
Google Brain, University of Oxford, for.ai, Geoffrey Hinton | Neurips 2018
2020.08.09
Introduction Related Work Methods and
Experiments
01 02 03
Conclusion
04
Yonsei Unversity Severance Hospital CCIDS
Contents
Targeted Dropout
Introduction – Background
• Large number of learnable parameters can lead to overfitting.
• There has been lots of works on compressing neural networks
<Sparsification Techniques>
Introduction / Related Work / Methods and Experiments / Conclusion
01
L1 penalty
L2 penalty
1. sparsity-inducing regulariser 2. Post hoc pruning (training Full size network → pruning)
- Removing weights with the smallest magnitude [1]
- Ranking the weights by the sensitivity of the task
performance and remove [2]
[1] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems,
pages 1135–1143, 2015.
[2] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems
3. Dropout Regularisation
- Standard Dropout
- Variational Dropout
Targeted Dropout
Introduction – Proposal
• Issues of Previous works
- Standard training does not encourage nets to be amenable to pruning
- Applying sparsification techniques with little negative impact to task performance is difficult
• Propose “Targeted Dropout”
- Specifically apply dropout to the set of units that are believed to be less useful
- Rank weights or units and apply dropout primarily to those elements with small magnitudes
- Network learns to be robust the choice of post hoc pruning stretegy.
Introduction / Related Work / Methods and Experiments / Conclusion
02
Targeted Dropout
Introduction – Contribution
• Makes networks extremely robust to the post hoc pruning strategy of choice
• Gives intimate control over the desired sparsity patterns
• Easy to implement
• Achieved impressive sparsity rates on a wide range of architectures and datsets
- 99% sparsity on the ResNet-32 with less than 4% drop in test set accuracy on
CIFAR-10
Introduction / Related Work / Methods and Experiments / Conclusion
03
Related Work
Introduction / Related Work / Methods and Experiments / Conclusion
04
Dropout
• Two kinds of Bernoulli dropout techniques
1. Unit Dropout
- Randomly drops units at each training
step to reduce dependence between units
and prevent overfitting
2. Weight Dropout
- Randomly drops individual weights in the
weight matrices at each training step.
Dropping connections between layers
Related Work
Introduction / Related Work / Methods and Experiments / Conclusion
05
Magnitude-based pruning
• Treat the top-k largest magnitude weights as important (use argmax-k)
1. Unit Pruning
- Considers the units (column-vectors) of
weight metrices under the L2 norm
(usually faster with less computation)
2. Weight Pruning
- Considers the entries of each feature
vector under the L1 norm (usually more
accurate)
Related Work
Introduction / Related Work / Methods and Experiments / Conclusion
06
Sparsification Methods
• L1 regularisation [3]
- cost added to the loss function, intended to drive unimportant weights to zero.
[3] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems,
pages 1135–1143, 2015
[4] Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through l_0 regularization. arXiv preprint arXiv:1712.01312, 2017.
• L0 regularisation [4]
- Apply an augmentation of Concrete Dropout to parameters.
- Weights follow a Hard-Concrete distribution where each weight is associated with a gating
parameter that determines the drop rate.
Related Work
Introduction / Related Work / Methods and Experiments / Conclusion
07
Sparsification Methods
• Variational Dropout [5]
- Apply Gaussian dropout with trainable drop rates to the weights and interprets the model as
a variational posterior with a particular prior.
[5] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. arXiv preprint arXiv:1701.05369, 2017.
[6] Guillaume Leclerc, Manasi Vartak, Raul Castro Fernandez, Tim Kraska, and Samuel Madden. Smallify: Learning network size while training. arXiv preprint arXiv:1806.03723, 2018.
• Smallify [6]
- Use trainable gates on weights/units and regularize gates towards zero using L1 regularisation.
- Shown to be extremely effective at reaching high prune rates on VGG networks.
Methods and Experiments
Targeted Dropout
Introduction / Related Work / Methods and Experiments / Conclusion
08
• Want to make low-valued elements to be able to increase their value if they
become important during training.
• Introduce stochasticity into the process using two parameters
• Targeting proportion means selecting bottom weights as candidates for
dropout.
• Expected number of units to keep during each round of targeted dropout is
• Result is a reduction in the important subnetwork’s dependency on the
unimportant subnetwork.
Methods and Experiments
Dependence between the important and unimportant subnetworks
Introduction / Related Work / Methods and Experiments / Conclusion
09
• Important subnetwork is completely separated from the unimportant one.
• Estimate the effect of pruning weights by considering the second-degree Taylor expansion
of change in loss
• At the end of training, with critical point , (gradients of the loss for
parameters ) , leaving only the Hessian term
• Compute Hessian-weight product matrix as an estimate of weight correlations and network
dependence
• Empirically confirm that targeted dropout reduces dependence between the important
and unimportant subnetworks by an order of magnitude.
Methods and Experiments
Experiments
Introduction / Related Work / Methods and Experiments / Conclusion
10
• Perform experiments using the original ResNet, Wide ResNet, and Transformer architectures
applied to the CIFAR-10, ImageNet, and WMT English-German Translation datasets
Methods and Experiments
Experiments – scheduling targeted dropout
Introduction / Related Work / Methods and Experiments / Conclusion
14
• On the evaluation test with Smallify, authors discovered Scheduling the
Targeting Proportion and Drop out rate can dramatically improve accuracy.
- annealing from zero to 95%, and from 0% to 100%
Conclusion
Introduction / Related Work / Methods and Experiments / Conclusion
• Targeted dropout is a simple and effective regularisation tool for training
neural networks that are robust to post hoc pruning.
• Targeted dropout performs well across a range of network architectures
and tasks.
• Showed how dropout can be used as a tool to encode prior structural
assumptions into neural networks
16