Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Semi orthogonal low-rank matrix factorization for deep neural networks
1. Semi-Orthogonal Low-Rank
Matrix Factorization for Deep
Neural Networks
Daniel Povey , Gaofeng Cheng, Yiming Wang, Ke Li,
Hainan Xu, Mahsa Yarmohamadi, Sanjeev Khudanpur
Center for Language and Speech Processing,
Human Language Technology Center of Excellence,
Johns Hopkins University, Baltimore, MD, USA
University of Chinese Academy of Sciences, Beijing, China
2020/01 陳品媛
4. 4/31
Introduction
■ Automatic Speech Recognition - Acoustic modeling
■ Author proposes a factored formed of TDNN whose layers
compresssed via SVD and a idea - Skip connections.
5. 5/31
Introduction - TDNN
■ Time Delay Neural Networks
■ One-dimensional Convolutional Neural Networks (1-d CNNs)
■ The context width increases as we go to upper layers.
6. 8/31
Introduction - SVD
■ Why SVD in DNN?
• DNN has huge computation cost → reduce the model size
• A large portion of weight parameters in DNN are very small.
• Fast computation and small memory usage can be obtained for
runtime evaluation.
number of singular values
%oftotalsingularvalues
15% 40%
8. 11/31
Training with semi-orthogonal constraint
■ Basic case - Update equation
After every few (specifically, every 4) time-steps of SGD, we
apply an efficient update that brings it closer to being a semi-
orthogonal matrix.
𝑃 ≡ 𝑀𝑀 𝑇
Force P = I (orthogonal matrix property: AAT
= I)
𝑄 ≡ 𝑃 − 𝐼
Minimize function:
𝑓 = 𝑡𝑟(𝑄𝑄 𝑇
)
M = patameter matrix
i.e. the sum of squared
elements of Q
9. 13/31
Training with semi-orthogonal constraint
■ Basic case - Update equation (cont.)
• The derivative of a scalar w.r.t. a matrix is not transposed
w.r.t. that matrix.
• 𝜈 =
1
8
leads quadratic convergence
𝜕𝑓
𝜕𝑄
= 2𝑄
𝜕𝑓
𝜕𝑃
= 2𝑄
𝜕𝑓
𝜕𝑀
= 4𝑄𝑀
𝑃 ≡ 𝑀𝑀 𝑇
, 𝑄 ≡ 𝑃 − 𝐼, 𝑓 = 𝑡𝑟(𝑄𝑄 𝑇
)
𝑀 ← 𝑀 − 4𝜈𝑄𝑀
(𝜈 = learning rate)
𝑀 ← 𝑀 −
1
2
(𝑀𝑀 𝑇
− 𝐼)𝑀 with 𝜈 =
1
8
(1)
10. 14/31
Training with semi-orthogonal constraint
■ Basic case - Weight Initialization
• (1) diverge if M is too far from being orthonormal to start with,
but this does not happen if using Glorot-style initialization
(Xavier initialization) (𝜎 =
1
#𝑐𝑜𝑙
).
𝑀 ← 𝑀 −
1
2
(𝑀𝑀 𝑇
− 𝐼)𝑀 (1)
Reference: Understanding the difficulty of training deep feedforward neural networks (Xavier Glorot and Yoshua Bengio)
11. 16/31
Training with semi-orthogonal constraint
■ Scale case
• Suppose we want M to be a scaled version of a semi-orthogonal
matrix
𝑀 ← 𝑀 −
1
2𝛼2 (𝑀𝑀 𝑇
− 𝛼2
𝐼)𝑀 (2)
Substitue 𝑀 with
1
𝛼
𝑀 (some specified contant 𝛼)
12. 17/31
Training with semi-orthogonal constraint
■ Floating case
• Control how fast the parameters of the various layers change
• Apply l2 regularization to the constrained layers
• Compute scale 𝛼 and apply to (2)
𝑀 ← 𝑀 −
1
2𝛼2 (𝑀𝑀 𝑇
− 𝛼2 𝐼)𝑀 (2)
𝑃 ≡ 𝑀𝑀 𝑇
𝛼 =
𝑡𝑟(𝑃𝑃 𝑇
)
𝑡𝑟(𝑃)
(3)
13. 18/31
Training with semi-orthogonal constraint
■ Floating case
Why 𝛼 =
𝑡𝑟(𝑃𝑃 𝑇
)
𝑡𝑟(𝑃)
?
𝑀 is a matrix with orthonormal rows. We pick the scale that will
give us an update to M that is orthogonal to M (viewed as a vector):
i.e., 𝑀: = 𝑀 + 𝑋, we want to have 𝑡𝑟(𝑀𝑋 𝑇
) = 0.
𝑀 ← 𝑀 −
1
2𝛼2 (𝑀𝑀 𝑇
− 𝛼2
𝐼)𝑀 (2)
𝑡𝑟(𝑀 × 𝑀 𝑇
× (𝑀 𝑇
𝑀 − 𝛼2 𝐼)) = 0
𝑡𝑟(𝑃𝑃 𝑇
− 𝛼2
𝑃) = 0 or 𝛼2
= 𝑡𝑟(𝑃𝑃 𝑇
)/𝑡𝑟(𝑃)
𝛼 =
𝑡𝑟(𝑃𝑃 𝑇
)
𝑡𝑟(𝑃)
(3)
Ignore contant −
1
2𝛼2
𝑃 ≡ 𝑀𝑀 𝑇
15. 20/31
Factorized model topologies
1. Basic factorization
M=AB, with B constrained to be semi-orthogonal
M: 700 x 2100
A: 700 x 250, B: 250 x 2100
We call 250 as linear bottleneck dimension
2. Tuning the dimensions
Tuning on 300hrs and we ended up using larger matrixes sizes, with a hidden-layer dimension
of 1280 or 1536, a linear bottleneck dimension of 256, and more hidden layers.
3. Factorizing the convolution
In part 1 example, the setup use constrained 3x1 convolutions followed by 1x1 convolution.
We found better results when using a constrained 2x1 convolution followed by a 2x1
convolution.
700
2100
250
16. 21/31
Factorized model topologies
4. 3-stage splicing
A constrained 2x1 convolution to dimension 256, followed by another constrained 2x1 convolution
to dimension 256, followed by a 2x1 convolution back to the hidden-layer dimension.
Even more better then part 3.
5. Dropout
dropout mask are shared across time
dropout schedule 𝛼(dropout strength): 0 → 0.5 → 0
continuous dropout scale: 1 − 2𝛼, 1 + 2𝛼 uniform distribution
6. Factorizing the final layer
Even with very small datasets in which factorizing the TDNN layers was not helpful, factorizing the
final layer was helpful.
256
1280
1280
256
17. 22/31
Factorized model topologies
7. Skip connections
• Some layers receive as input, not just the output of the previous layer but also selected
other prior layers (up to 3) which are appended to the previous one.
• It helps to solve vanishing gradient problem.
22. 27/31
Conclusions
1. factorized TDNN (TDNN-F): an effective way to train networks
with parameter matrices represented as the product of two or
more smaller matrices, with all but one of the factors
constrained to be semi-orthogonal.
2. skip connections can solve vanishing gradient
3. dropout mask that is shared across time
4. better result and faster to decode
24. 29/31
Appendix
Factorizing the convolution
time-stride = 1
1024-128: time-offset = -1, 0
128-1024: time-offset = 0, 1
time-stride = 3
1024-128: time-offset = -3, 0
128-1024: time-offset = 0, 3
time-stride = 0
1024-128: time-offset = 0
128-1024: time-offset = 0
Tdnnf structure
Time-shared dropout
weighted sum of the input
and the output
25. 30/31
Factorized model topologies
5. Dropout
dropout mask are shared across time
dropout schedule 𝛼(dropout strength): 0 → 0.5 → 0
continuous dropout scale: 1 − 2𝛼, 1 + 2𝛼 uniform distribution
batch#×dim×seq_len
6. Factorizing the final layer
Even with very small datasets in which factorizing the TDNN layers
was not helpful, factorizing the final layer was helpful.
dim
seq_lenbatch#
Eigendecomposition of a matrix: https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix
low rank: 當捨棄掉小的且非零的特徵值
where ∑ is a diagonal matrix with A's singular values on the diagonal in the decreasing order.
The m columns of U and the n columns of V are called the left-singular vectors and rightsingular vectors of A, respectively.
Product of independent variables
implement in Caffe library, not in paper
https://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization
Because we use batchnorm and because the ReLU nonlinearity is scale invariant, l2 does not have
a true regularization effect when applied to the hidden layers; but it reduces the scale of the parameter matrix which makes
it learn faster
In the current kaldi implementaion,"3-stage splicing" aspect and the skip connecctions were token out.
https://groups.google.com/forum/#!topic/kaldi-help/gBinGgj6Xy4
Eval2000: full HUB5'00 evaluation set (also known as Eval2000) and its “switchboard” subset
RT03: test set (LDC2007S10)