Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Semi orthogonal low-rank matrix factorization for deep neural networks

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 25 Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Semi orthogonal low-rank matrix factorization for deep neural networks (20)

Anzeige

Aktuellste (20)

Semi orthogonal low-rank matrix factorization for deep neural networks

  1. 1. Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks Daniel Povey , Gaofeng Cheng, Yiming Wang, Ke Li, Hainan Xu, Mahsa Yarmohamadi, Sanjeev Khudanpur Center for Language and Speech Processing, Human Language Technology Center of Excellence, Johns Hopkins University, Baltimore, MD, USA University of Chinese Academy of Sciences, Beijing, China 2020/01 陳品媛
  2. 2. 2/31 Outline ■ Introduction ■ Training with semi-orthogonal constraint ■ Factorized model topologies ■ Experimental setup ■ Experiments ■ Conclusion
  3. 3. 3/31 Introduction
  4. 4. 4/31 Introduction ■ Automatic Speech Recognition - Acoustic modeling ■ Author proposes a factored formed of TDNN whose layers compresssed via SVD and a idea - Skip connections.
  5. 5. 5/31 Introduction - TDNN ■ Time Delay Neural Networks ■ One-dimensional Convolutional Neural Networks (1-d CNNs) ■ The context width increases as we go to upper layers.
  6. 6. 8/31 Introduction - SVD ■ Why SVD in DNN? • DNN has huge computation cost → reduce the model size • A large portion of weight parameters in DNN are very small. • Fast computation and small memory usage can be obtained for runtime evaluation. number of singular values %oftotalsingularvalues 15% 40%
  7. 7. 10/31 Training with semi-orthogonal constraint
  8. 8. 11/31 Training with semi-orthogonal constraint ■ Basic case - Update equation After every few (specifically, every 4) time-steps of SGD, we apply an efficient update that brings it closer to being a semi- orthogonal matrix. 𝑃 ≡ 𝑀𝑀 𝑇 Force P = I (orthogonal matrix property: AAT = I) 𝑄 ≡ 𝑃 − 𝐼 Minimize function: 𝑓 = 𝑡𝑟(𝑄𝑄 𝑇 ) M = patameter matrix i.e. the sum of squared elements of Q
  9. 9. 13/31 Training with semi-orthogonal constraint ■ Basic case - Update equation (cont.) • The derivative of a scalar w.r.t. a matrix is not transposed w.r.t. that matrix. • 𝜈 = 1 8 leads quadratic convergence 𝜕𝑓 𝜕𝑄 = 2𝑄 𝜕𝑓 𝜕𝑃 = 2𝑄 𝜕𝑓 𝜕𝑀 = 4𝑄𝑀 𝑃 ≡ 𝑀𝑀 𝑇 , 𝑄 ≡ 𝑃 − 𝐼, 𝑓 = 𝑡𝑟(𝑄𝑄 𝑇 ) 𝑀 ← 𝑀 − 4𝜈𝑄𝑀 (𝜈 = learning rate) 𝑀 ← 𝑀 − 1 2 (𝑀𝑀 𝑇 − 𝐼)𝑀 with 𝜈 = 1 8 (1)
  10. 10. 14/31 Training with semi-orthogonal constraint ■ Basic case - Weight Initialization • (1) diverge if M is too far from being orthonormal to start with, but this does not happen if using Glorot-style initialization (Xavier initialization) (𝜎 = 1 #𝑐𝑜𝑙 ). 𝑀 ← 𝑀 − 1 2 (𝑀𝑀 𝑇 − 𝐼)𝑀 (1) Reference: Understanding the difficulty of training deep feedforward neural networks (Xavier Glorot and Yoshua Bengio)
  11. 11. 16/31 Training with semi-orthogonal constraint ■ Scale case • Suppose we want M to be a scaled version of a semi-orthogonal matrix 𝑀 ← 𝑀 − 1 2𝛼2 (𝑀𝑀 𝑇 − 𝛼2 𝐼)𝑀 (2) Substitue 𝑀 with 1 𝛼 𝑀 (some specified contant 𝛼)
  12. 12. 17/31 Training with semi-orthogonal constraint ■ Floating case • Control how fast the parameters of the various layers change • Apply l2 regularization to the constrained layers • Compute scale 𝛼 and apply to (2) 𝑀 ← 𝑀 − 1 2𝛼2 (𝑀𝑀 𝑇 − 𝛼2 𝐼)𝑀 (2) 𝑃 ≡ 𝑀𝑀 𝑇 𝛼 = 𝑡𝑟(𝑃𝑃 𝑇 ) 𝑡𝑟(𝑃) (3)
  13. 13. 18/31 Training with semi-orthogonal constraint ■ Floating case Why 𝛼 = 𝑡𝑟(𝑃𝑃 𝑇 ) 𝑡𝑟(𝑃) ? 𝑀 is a matrix with orthonormal rows. We pick the scale that will give us an update to M that is orthogonal to M (viewed as a vector): i.e., 𝑀: = 𝑀 + 𝑋, we want to have 𝑡𝑟(𝑀𝑋 𝑇 ) = 0. 𝑀 ← 𝑀 − 1 2𝛼2 (𝑀𝑀 𝑇 − 𝛼2 𝐼)𝑀 (2) 𝑡𝑟(𝑀 × 𝑀 𝑇 × (𝑀 𝑇 𝑀 − 𝛼2 𝐼)) = 0 𝑡𝑟(𝑃𝑃 𝑇 − 𝛼2 𝑃) = 0 or 𝛼2 = 𝑡𝑟(𝑃𝑃 𝑇 )/𝑡𝑟(𝑃) 𝛼 = 𝑡𝑟(𝑃𝑃 𝑇 ) 𝑡𝑟(𝑃) (3) Ignore contant − 1 2𝛼2 𝑃 ≡ 𝑀𝑀 𝑇
  14. 14. 19/31 Factorized model topologies
  15. 15. 20/31 Factorized model topologies 1. Basic factorization M=AB, with B constrained to be semi-orthogonal M: 700 x 2100 A: 700 x 250, B: 250 x 2100 We call 250 as linear bottleneck dimension 2. Tuning the dimensions Tuning on 300hrs and we ended up using larger matrixes sizes, with a hidden-layer dimension of 1280 or 1536, a linear bottleneck dimension of 256, and more hidden layers. 3. Factorizing the convolution In part 1 example, the setup use constrained 3x1 convolutions followed by 1x1 convolution. We found better results when using a constrained 2x1 convolution followed by a 2x1 convolution. 700 2100 250
  16. 16. 21/31 Factorized model topologies 4. 3-stage splicing A constrained 2x1 convolution to dimension 256, followed by another constrained 2x1 convolution to dimension 256, followed by a 2x1 convolution back to the hidden-layer dimension. Even more better then part 3. 5. Dropout dropout mask are shared across time dropout schedule 𝛼(dropout strength): 0 → 0.5 → 0 continuous dropout scale: 1 − 2𝛼, 1 + 2𝛼 uniform distribution 6. Factorizing the final layer Even with very small datasets in which factorizing the TDNN layers was not helpful, factorizing the final layer was helpful. 256 1280 1280 256
  17. 17. 22/31 Factorized model topologies 7. Skip connections • Some layers receive as input, not just the output of the previous layer but also selected other prior layers (up to 3) which are appended to the previous one. • It helps to solve vanishing gradient problem.
  18. 18. 23/31 Experiments
  19. 19. 24/31 Experiments • Experimental setup 1. basic factorization Switchboard 300 hours Fisher+Switchboard 2000 hours MATERIAL • two low-resource languages: Swahili and Tagalog • 80 hours for each
  20. 20. 25/31 Experiments 2. Comparing model types
  21. 21. 26/31 Conclusions
  22. 22. 27/31 Conclusions 1. factorized TDNN (TDNN-F): an effective way to train networks with parameter matrices represented as the product of two or more smaller matrices, with all but one of the factors constrained to be semi-orthogonal. 2. skip connections can solve vanishing gradient 3. dropout mask that is shared across time 4. better result and faster to decode
  23. 23. 28/31 Appendix
  24. 24. 29/31 Appendix Factorizing the convolution time-stride = 1 1024-128: time-offset = -1, 0 128-1024: time-offset = 0, 1 time-stride = 3 1024-128: time-offset = -3, 0 128-1024: time-offset = 0, 3 time-stride = 0 1024-128: time-offset = 0 128-1024: time-offset = 0 Tdnnf structure Time-shared dropout weighted sum of the input and the output
  25. 25. 30/31 Factorized model topologies 5. Dropout dropout mask are shared across time dropout schedule 𝛼(dropout strength): 0 → 0.5 → 0 continuous dropout scale: 1 − 2𝛼, 1 + 2𝛼 uniform distribution batch#×dim×seq_len 6. Factorizing the final layer Even with very small datasets in which factorizing the TDNN layers was not helpful, factorizing the final layer was helpful. dim seq_lenbatch#

Hinweis der Redaktion

  • 2018 interspeech
  • Subsamping: 相鄰時間點所包含的上下文信息有很大部分重疊的,因此可以採用取樣的方法,只保留部分的連線,可以獲得原始模型近似的效果,同時能夠大大減小模型的計算量

    https://www.twblogs.net/a/5c778577bd9eee3399183f67
  • Eigendecomposition of a matrix: https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix

    low rank: 當捨棄掉小的且非零的特徵值
  • where ∑ is a diagonal matrix with A's singular values on the diagonal in the decreasing order.
    The m columns of U and the n columns of V are called the left-singular vectors and rightsingular vectors of A, respectively.
  • 之前的work都是在訓練過後的模型上做SVD,但這篇是直接以這個架構下去做訓練,所以來說一下他是怎麼做更新的
  • Product of independent variables

    implement in Caffe library, not in paper
    https://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization
  • Because we use batchnorm and because the ReLU nonlinearity is scale invariant, l2 does not have
    a true regularization effect when applied to the hidden layers; but it reduces the scale of the parameter matrix which makes
    it learn faster
  • tr(MX^t) = 0 正交定義

    https://github.com/kaldi-asr/kaldi/blob/7b762b1b32140cbf8fbf4c72b713b4bd18c71104/src/nnet3/nnet-utils.cc#L1009
  • 2. 現在實做是1280*128
  • In the current kaldi implementaion,"3-stage splicing" aspect and the skip connecctions were token out.
    https://groups.google.com/forum/#!topic/kaldi-help/gBinGgj6Xy4
  • Eval2000: full HUB5'00 evaluation set (also known as Eval2000) and its “switchboard” subset
    RT03: test set (LDC2007S10)
  • 依序是1*1, 2*1, 3*1

×