Anzeige

Learning Sparse Networks using Targeted Dropout

Deep Learning Engineer um Yonsei University, Severance Hospital
25. Oct 2020
Anzeige

Más contenido relacionado

Similar a Learning Sparse Networks using Targeted Dropout(20)

Anzeige

Learning Sparse Networks using Targeted Dropout

  1. Learning Sparse Networks Using Targeted Dropout Hwang seung hyun Yonsei University Severance Hospital CCIDS Google Brain, University of Oxford, for.ai, Geoffrey Hinton | Neurips 2018 2020.08.09
  2. Introduction Related Work Methods and Experiments 01 02 03 Conclusion 04 Yonsei Unversity Severance Hospital CCIDS Contents
  3. Targeted Dropout Introduction – Background • Large number of learnable parameters can lead to overfitting. • There has been lots of works on compressing neural networks <Sparsification Techniques> Introduction / Related Work / Methods and Experiments / Conclusion 01 L1 penalty L2 penalty 1. sparsity-inducing regulariser 2. Post hoc pruning (training Full size network → pruning) - Removing weights with the smallest magnitude [1] - Ranking the weights by the sensitivity of the task performance and remove [2] [1] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015. [2] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems 3. Dropout Regularisation - Standard Dropout - Variational Dropout
  4. Targeted Dropout Introduction – Proposal • Issues of Previous works - Standard training does not encourage nets to be amenable to pruning - Applying sparsification techniques with little negative impact to task performance is difficult • Propose “Targeted Dropout” - Specifically apply dropout to the set of units that are believed to be less useful - Rank weights or units and apply dropout primarily to those elements with small magnitudes - Network learns to be robust the choice of post hoc pruning stretegy. Introduction / Related Work / Methods and Experiments / Conclusion 02
  5. Targeted Dropout Introduction – Contribution • Makes networks extremely robust to the post hoc pruning strategy of choice • Gives intimate control over the desired sparsity patterns • Easy to implement • Achieved impressive sparsity rates on a wide range of architectures and datsets - 99% sparsity on the ResNet-32 with less than 4% drop in test set accuracy on CIFAR-10 Introduction / Related Work / Methods and Experiments / Conclusion 03
  6. Related Work Introduction / Related Work / Methods and Experiments / Conclusion 04 Dropout • Two kinds of Bernoulli dropout techniques 1. Unit Dropout - Randomly drops units at each training step to reduce dependence between units and prevent overfitting 2. Weight Dropout - Randomly drops individual weights in the weight matrices at each training step. Dropping connections between layers
  7. Related Work Introduction / Related Work / Methods and Experiments / Conclusion 05 Magnitude-based pruning • Treat the top-k largest magnitude weights as important (use argmax-k) 1. Unit Pruning - Considers the units (column-vectors) of weight metrices under the L2 norm (usually faster with less computation) 2. Weight Pruning - Considers the entries of each feature vector under the L1 norm (usually more accurate)
  8. Related Work Introduction / Related Work / Methods and Experiments / Conclusion 06 Sparsification Methods • L1 regularisation [3] - cost added to the loss function, intended to drive unimportant weights to zero. [3] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015 [4] Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through l_0 regularization. arXiv preprint arXiv:1712.01312, 2017. • L0 regularisation [4] - Apply an augmentation of Concrete Dropout to parameters. - Weights follow a Hard-Concrete distribution where each weight is associated with a gating parameter that determines the drop rate.
  9. Related Work Introduction / Related Work / Methods and Experiments / Conclusion 07 Sparsification Methods • Variational Dropout [5] - Apply Gaussian dropout with trainable drop rates to the weights and interprets the model as a variational posterior with a particular prior. [5] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. arXiv preprint arXiv:1701.05369, 2017. [6] Guillaume Leclerc, Manasi Vartak, Raul Castro Fernandez, Tim Kraska, and Samuel Madden. Smallify: Learning network size while training. arXiv preprint arXiv:1806.03723, 2018. • Smallify [6] - Use trainable gates on weights/units and regularize gates towards zero using L1 regularisation. - Shown to be extremely effective at reaching high prune rates on VGG networks.
  10. Methods and Experiments Targeted Dropout Introduction / Related Work / Methods and Experiments / Conclusion 08 • Want to make low-valued elements to be able to increase their value if they become important during training. • Introduce stochasticity into the process using two parameters • Targeting proportion means selecting bottom weights as candidates for dropout. • Expected number of units to keep during each round of targeted dropout is • Result is a reduction in the important subnetwork’s dependency on the unimportant subnetwork.
  11. Methods and Experiments Dependence between the important and unimportant subnetworks Introduction / Related Work / Methods and Experiments / Conclusion 09 • Important subnetwork is completely separated from the unimportant one. • Estimate the effect of pruning weights by considering the second-degree Taylor expansion of change in loss • At the end of training, with critical point , (gradients of the loss for parameters ) , leaving only the Hessian term • Compute Hessian-weight product matrix as an estimate of weight correlations and network dependence • Empirically confirm that targeted dropout reduces dependence between the important and unimportant subnetworks by an order of magnitude.
  12. Methods and Experiments Experiments Introduction / Related Work / Methods and Experiments / Conclusion 10 • Perform experiments using the original ResNet, Wide ResNet, and Transformer architectures applied to the CIFAR-10, ImageNet, and WMT English-German Translation datasets
  13. Methods and Experiments Experiments Introduction / Related Work / Methods and Experiments / Conclusion 11
  14. Methods and Experiments Experiments Introduction / Related Work / Methods and Experiments / Conclusion 12
  15. Methods and Experiments Experiments Introduction / Related Work / Methods and Experiments / Conclusion 13
  16. Methods and Experiments Experiments – scheduling targeted dropout Introduction / Related Work / Methods and Experiments / Conclusion 14 • On the evaluation test with Smallify, authors discovered Scheduling the Targeting Proportion and Drop out rate can dramatically improve accuracy. - annealing from zero to 95%, and from 0% to 100%
  17. Methods and Experiments Experiments Introduction / Related Work / Methods and Experiments / Conclusion 15 • Comparison with Random Pruning method (pruning away a random subnetwork before training
  18. Conclusion Introduction / Related Work / Methods and Experiments / Conclusion • Targeted dropout is a simple and effective regularisation tool for training neural networks that are robust to post hoc pruning. • Targeted dropout performs well across a range of network architectures and tasks. • Showed how dropout can be used as a tool to encode prior structural assumptions into neural networks 16
Anzeige