SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Semi-Orthogonal Low-Rank
Matrix Factorization for Deep
Neural Networks
Daniel Povey , Gaofeng Cheng, Yiming Wang, Ke Li,
Hainan Xu, Mahsa Yarmohamadi, Sanjeev Khudanpur
Center for Language and Speech Processing,
Human Language Technology Center of Excellence,
Johns Hopkins University, Baltimore, MD, USA
University of Chinese Academy of Sciences, Beijing, China
2020/01 陳品媛
2/31
Outline
■ Introduction
■ Training with semi-orthogonal constraint
■ Factorized model topologies
■ Experimental setup
■ Experiments
■ Conclusion
3/31
Introduction
4/31
Introduction
■ Automatic Speech Recognition - Acoustic modeling
■ Author proposes a factored formed of TDNN whose layers
compresssed via SVD and a idea - Skip connections.
5/31
Introduction - TDNN
■ Time Delay Neural Networks
■ One-dimensional Convolutional Neural Networks (1-d CNNs)
■ The context width increases as we go to upper layers.
8/31
Introduction - SVD
■ Why SVD in DNN?
• DNN has huge computation cost → reduce the model size
• A large portion of weight parameters in DNN are very small.
• Fast computation and small memory usage can be obtained for
runtime evaluation.
number of singular values
%oftotalsingularvalues
15% 40%
10/31
Training with semi-orthogonal constraint
11/31
Training with semi-orthogonal constraint
■ Basic case - Update equation
After every few (specifically, every 4) time-steps of SGD, we
apply an efficient update that brings it closer to being a semi-
orthogonal matrix.
𝑃 ≡ 𝑀𝑀 𝑇
Force P = I (orthogonal matrix property: AAT
= I)
𝑄 ≡ 𝑃 − 𝐼
Minimize function:
𝑓 = 𝑡𝑟(𝑄𝑄 𝑇
)
M = patameter matrix
i.e. the sum of squared
elements of Q
13/31
Training with semi-orthogonal constraint
■ Basic case - Update equation (cont.)
• The derivative of a scalar w.r.t. a matrix is not transposed
w.r.t. that matrix.
• 𝜈 =
1
8
leads quadratic convergence
𝜕𝑓
𝜕𝑄
= 2𝑄
𝜕𝑓
𝜕𝑃
= 2𝑄
𝜕𝑓
𝜕𝑀
= 4𝑄𝑀
𝑃 ≡ 𝑀𝑀 𝑇
, 𝑄 ≡ 𝑃 − 𝐼, 𝑓 = 𝑡𝑟(𝑄𝑄 𝑇
)
𝑀 ← 𝑀 − 4𝜈𝑄𝑀
(𝜈 = learning rate)
𝑀 ← 𝑀 −
1
2
(𝑀𝑀 𝑇
− 𝐼)𝑀 with 𝜈 =
1
8
(1)
14/31
Training with semi-orthogonal constraint
■ Basic case - Weight Initialization
• (1) diverge if M is too far from being orthonormal to start with,
but this does not happen if using Glorot-style initialization
(Xavier initialization) (𝜎 =
1
#𝑐𝑜𝑙
).
𝑀 ← 𝑀 −
1
2
(𝑀𝑀 𝑇
− 𝐼)𝑀 (1)
Reference: Understanding the difficulty of training deep feedforward neural networks (Xavier Glorot and Yoshua Bengio)
16/31
Training with semi-orthogonal constraint
■ Scale case
• Suppose we want M to be a scaled version of a semi-orthogonal
matrix
𝑀 ← 𝑀 −
1
2𝛼2 (𝑀𝑀 𝑇
− 𝛼2
𝐼)𝑀 (2)
Substitue 𝑀 with
1
𝛼
𝑀 (some specified contant 𝛼)
17/31
Training with semi-orthogonal constraint
■ Floating case
• Control how fast the parameters of the various layers change
• Apply l2 regularization to the constrained layers
• Compute scale 𝛼 and apply to (2)
𝑀 ← 𝑀 −
1
2𝛼2 (𝑀𝑀 𝑇
− 𝛼2 𝐼)𝑀 (2)
𝑃 ≡ 𝑀𝑀 𝑇
𝛼 =
𝑡𝑟(𝑃𝑃 𝑇
)
𝑡𝑟(𝑃)
(3)
18/31
Training with semi-orthogonal constraint
■ Floating case
Why 𝛼 =
𝑡𝑟(𝑃𝑃 𝑇
)
𝑡𝑟(𝑃)
?
𝑀 is a matrix with orthonormal rows. We pick the scale that will
give us an update to M that is orthogonal to M (viewed as a vector):
i.e., 𝑀: = 𝑀 + 𝑋, we want to have 𝑡𝑟(𝑀𝑋 𝑇
) = 0.
𝑀 ← 𝑀 −
1
2𝛼2 (𝑀𝑀 𝑇
− 𝛼2
𝐼)𝑀 (2)
𝑡𝑟(𝑀 × 𝑀 𝑇
× (𝑀 𝑇
𝑀 − 𝛼2 𝐼)) = 0
𝑡𝑟(𝑃𝑃 𝑇
− 𝛼2
𝑃) = 0 or 𝛼2
= 𝑡𝑟(𝑃𝑃 𝑇
)/𝑡𝑟(𝑃)
𝛼 =
𝑡𝑟(𝑃𝑃 𝑇
)
𝑡𝑟(𝑃)
(3)
Ignore contant −
1
2𝛼2
𝑃 ≡ 𝑀𝑀 𝑇
19/31
Factorized model topologies
20/31
Factorized model topologies
1. Basic factorization
M=AB, with B constrained to be semi-orthogonal
M: 700 x 2100
A: 700 x 250, B: 250 x 2100
We call 250 as linear bottleneck dimension
2. Tuning the dimensions
Tuning on 300hrs and we ended up using larger matrixes sizes, with a hidden-layer dimension
of 1280 or 1536, a linear bottleneck dimension of 256, and more hidden layers.
3. Factorizing the convolution
In part 1 example, the setup use constrained 3x1 convolutions followed by 1x1 convolution.
We found better results when using a constrained 2x1 convolution followed by a 2x1
convolution.
700
2100
250
21/31
Factorized model topologies
4. 3-stage splicing
A constrained 2x1 convolution to dimension 256, followed by another constrained 2x1 convolution
to dimension 256, followed by a 2x1 convolution back to the hidden-layer dimension.
Even more better then part 3.
5. Dropout
dropout mask are shared across time
dropout schedule 𝛼(dropout strength): 0 → 0.5 → 0
continuous dropout scale: 1 − 2𝛼, 1 + 2𝛼 uniform distribution
6. Factorizing the final layer
Even with very small datasets in which factorizing the TDNN layers was not helpful, factorizing the
final layer was helpful.
256
1280
1280
256
22/31
Factorized model topologies
7. Skip connections
• Some layers receive as input, not just the output of the previous layer but also selected
other prior layers (up to 3) which are appended to the previous one.
• It helps to solve vanishing gradient problem.
23/31
Experiments
24/31
Experiments
• Experimental setup
1. basic factorization
Switchboard 300 hours
Fisher+Switchboard 2000 hours
MATERIAL • two low-resource languages: Swahili and Tagalog
• 80 hours for each
25/31
Experiments
2. Comparing model types
26/31
Conclusions
27/31
Conclusions
1. factorized TDNN (TDNN-F): an effective way to train networks
with parameter matrices represented as the product of two or
more smaller matrices, with all but one of the factors
constrained to be semi-orthogonal.
2. skip connections can solve vanishing gradient
3. dropout mask that is shared across time
4. better result and faster to decode
28/31
Appendix
29/31
Appendix
Factorizing the convolution
time-stride = 1
1024-128: time-offset = -1, 0
128-1024: time-offset = 0, 1
time-stride = 3
1024-128: time-offset = -3, 0
128-1024: time-offset = 0, 3
time-stride = 0
1024-128: time-offset = 0
128-1024: time-offset = 0
Tdnnf structure
Time-shared dropout
weighted sum of the input
and the output
30/31
Factorized model topologies
5. Dropout
dropout mask are shared across time
dropout schedule 𝛼(dropout strength): 0 → 0.5 → 0
continuous dropout scale: 1 − 2𝛼, 1 + 2𝛼 uniform distribution
batch#×dim×seq_len
6. Factorizing the final layer
Even with very small datasets in which factorizing the TDNN layers
was not helpful, factorizing the final layer was helpful.
dim
seq_lenbatch#

Weitere ähnliche Inhalte

Ähnlich wie Semi orthogonal low-rank matrix factorization for deep neural networks

Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryAhmed Yousry
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningCastLabKAIST
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satChenYiHuang5
 
Network Deconvolution review [cdm]
Network Deconvolution review [cdm]Network Deconvolution review [cdm]
Network Deconvolution review [cdm]Dongmin Choi
 
Compressed learning for time series classification
Compressed learning for time series classificationCompressed learning for time series classification
Compressed learning for time series classification學翰 施
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchSatoru Katsumata
 
Reading papers - survey on Non-Convex Optimization
Reading papers - survey on Non-Convex OptimizationReading papers - survey on Non-Convex Optimization
Reading papers - survey on Non-Convex OptimizationX 37
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Varad Meru
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learningJunaid Bhat
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
Basic Learning Algorithms of ANN
Basic Learning Algorithms of ANNBasic Learning Algorithms of ANN
Basic Learning Algorithms of ANNwaseem khan
 
The Chimera Grid Concept and Application
The Chimera Grid Concept and Application The Chimera Grid Concept and Application
The Chimera Grid Concept and Application Putika Ashfar Khoiri
 
Introduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep LearningIntroduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep LearningVahid Mirjalili
 
ShuffleNet - PR054
ShuffleNet - PR054ShuffleNet - PR054
ShuffleNet - PR054Jinwon Lee
 
Restricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for AttributionRestricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for Attributiontaeseon ryu
 
Learning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaLearning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaSangwoo Mo
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsYoung-Geun Choi
 
AIML2 DNN 3.5hr (111-1).pdf
AIML2 DNN  3.5hr (111-1).pdfAIML2 DNN  3.5hr (111-1).pdf
AIML2 DNN 3.5hr (111-1).pdfssuserb4d806
 

Ähnlich wie Semi orthogonal low-rank matrix factorization for deep neural networks (20)

deep CNN vs conventional ML
deep CNN vs conventional MLdeep CNN vs conventional ML
deep CNN vs conventional ML
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
 
EE5180_G-5.pptx
EE5180_G-5.pptxEE5180_G-5.pptx
EE5180_G-5.pptx
 
Network Deconvolution review [cdm]
Network Deconvolution review [cdm]Network Deconvolution review [cdm]
Network Deconvolution review [cdm]
 
Compressed learning for time series classification
Compressed learning for time series classificationCompressed learning for time series classification
Compressed learning for time series classification
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam search
 
Reading papers - survey on Non-Convex Optimization
Reading papers - survey on Non-Convex OptimizationReading papers - survey on Non-Convex Optimization
Reading papers - survey on Non-Convex Optimization
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
 
Basic Learning Algorithms of ANN
Basic Learning Algorithms of ANNBasic Learning Algorithms of ANN
Basic Learning Algorithms of ANN
 
The Chimera Grid Concept and Application
The Chimera Grid Concept and Application The Chimera Grid Concept and Application
The Chimera Grid Concept and Application
 
Introduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep LearningIntroduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep Learning
 
ShuffleNet - PR054
ShuffleNet - PR054ShuffleNet - PR054
ShuffleNet - PR054
 
Restricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for AttributionRestricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for Attribution
 
Learning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaLearning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat Minima
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
AIML2 DNN 3.5hr (111-1).pdf
AIML2 DNN  3.5hr (111-1).pdfAIML2 DNN  3.5hr (111-1).pdf
AIML2 DNN 3.5hr (111-1).pdf
 

Kürzlich hochgeladen

Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 

Kürzlich hochgeladen (20)

Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 

Semi orthogonal low-rank matrix factorization for deep neural networks

  • 1. Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks Daniel Povey , Gaofeng Cheng, Yiming Wang, Ke Li, Hainan Xu, Mahsa Yarmohamadi, Sanjeev Khudanpur Center for Language and Speech Processing, Human Language Technology Center of Excellence, Johns Hopkins University, Baltimore, MD, USA University of Chinese Academy of Sciences, Beijing, China 2020/01 陳品媛
  • 2. 2/31 Outline ■ Introduction ■ Training with semi-orthogonal constraint ■ Factorized model topologies ■ Experimental setup ■ Experiments ■ Conclusion
  • 4. 4/31 Introduction ■ Automatic Speech Recognition - Acoustic modeling ■ Author proposes a factored formed of TDNN whose layers compresssed via SVD and a idea - Skip connections.
  • 5. 5/31 Introduction - TDNN ■ Time Delay Neural Networks ■ One-dimensional Convolutional Neural Networks (1-d CNNs) ■ The context width increases as we go to upper layers.
  • 6. 8/31 Introduction - SVD ■ Why SVD in DNN? • DNN has huge computation cost → reduce the model size • A large portion of weight parameters in DNN are very small. • Fast computation and small memory usage can be obtained for runtime evaluation. number of singular values %oftotalsingularvalues 15% 40%
  • 8. 11/31 Training with semi-orthogonal constraint ■ Basic case - Update equation After every few (specifically, every 4) time-steps of SGD, we apply an efficient update that brings it closer to being a semi- orthogonal matrix. 𝑃 ≡ 𝑀𝑀 𝑇 Force P = I (orthogonal matrix property: AAT = I) 𝑄 ≡ 𝑃 − 𝐼 Minimize function: 𝑓 = 𝑡𝑟(𝑄𝑄 𝑇 ) M = patameter matrix i.e. the sum of squared elements of Q
  • 9. 13/31 Training with semi-orthogonal constraint ■ Basic case - Update equation (cont.) • The derivative of a scalar w.r.t. a matrix is not transposed w.r.t. that matrix. • 𝜈 = 1 8 leads quadratic convergence 𝜕𝑓 𝜕𝑄 = 2𝑄 𝜕𝑓 𝜕𝑃 = 2𝑄 𝜕𝑓 𝜕𝑀 = 4𝑄𝑀 𝑃 ≡ 𝑀𝑀 𝑇 , 𝑄 ≡ 𝑃 − 𝐼, 𝑓 = 𝑡𝑟(𝑄𝑄 𝑇 ) 𝑀 ← 𝑀 − 4𝜈𝑄𝑀 (𝜈 = learning rate) 𝑀 ← 𝑀 − 1 2 (𝑀𝑀 𝑇 − 𝐼)𝑀 with 𝜈 = 1 8 (1)
  • 10. 14/31 Training with semi-orthogonal constraint ■ Basic case - Weight Initialization • (1) diverge if M is too far from being orthonormal to start with, but this does not happen if using Glorot-style initialization (Xavier initialization) (𝜎 = 1 #𝑐𝑜𝑙 ). 𝑀 ← 𝑀 − 1 2 (𝑀𝑀 𝑇 − 𝐼)𝑀 (1) Reference: Understanding the difficulty of training deep feedforward neural networks (Xavier Glorot and Yoshua Bengio)
  • 11. 16/31 Training with semi-orthogonal constraint ■ Scale case • Suppose we want M to be a scaled version of a semi-orthogonal matrix 𝑀 ← 𝑀 − 1 2𝛼2 (𝑀𝑀 𝑇 − 𝛼2 𝐼)𝑀 (2) Substitue 𝑀 with 1 𝛼 𝑀 (some specified contant 𝛼)
  • 12. 17/31 Training with semi-orthogonal constraint ■ Floating case • Control how fast the parameters of the various layers change • Apply l2 regularization to the constrained layers • Compute scale 𝛼 and apply to (2) 𝑀 ← 𝑀 − 1 2𝛼2 (𝑀𝑀 𝑇 − 𝛼2 𝐼)𝑀 (2) 𝑃 ≡ 𝑀𝑀 𝑇 𝛼 = 𝑡𝑟(𝑃𝑃 𝑇 ) 𝑡𝑟(𝑃) (3)
  • 13. 18/31 Training with semi-orthogonal constraint ■ Floating case Why 𝛼 = 𝑡𝑟(𝑃𝑃 𝑇 ) 𝑡𝑟(𝑃) ? 𝑀 is a matrix with orthonormal rows. We pick the scale that will give us an update to M that is orthogonal to M (viewed as a vector): i.e., 𝑀: = 𝑀 + 𝑋, we want to have 𝑡𝑟(𝑀𝑋 𝑇 ) = 0. 𝑀 ← 𝑀 − 1 2𝛼2 (𝑀𝑀 𝑇 − 𝛼2 𝐼)𝑀 (2) 𝑡𝑟(𝑀 × 𝑀 𝑇 × (𝑀 𝑇 𝑀 − 𝛼2 𝐼)) = 0 𝑡𝑟(𝑃𝑃 𝑇 − 𝛼2 𝑃) = 0 or 𝛼2 = 𝑡𝑟(𝑃𝑃 𝑇 )/𝑡𝑟(𝑃) 𝛼 = 𝑡𝑟(𝑃𝑃 𝑇 ) 𝑡𝑟(𝑃) (3) Ignore contant − 1 2𝛼2 𝑃 ≡ 𝑀𝑀 𝑇
  • 15. 20/31 Factorized model topologies 1. Basic factorization M=AB, with B constrained to be semi-orthogonal M: 700 x 2100 A: 700 x 250, B: 250 x 2100 We call 250 as linear bottleneck dimension 2. Tuning the dimensions Tuning on 300hrs and we ended up using larger matrixes sizes, with a hidden-layer dimension of 1280 or 1536, a linear bottleneck dimension of 256, and more hidden layers. 3. Factorizing the convolution In part 1 example, the setup use constrained 3x1 convolutions followed by 1x1 convolution. We found better results when using a constrained 2x1 convolution followed by a 2x1 convolution. 700 2100 250
  • 16. 21/31 Factorized model topologies 4. 3-stage splicing A constrained 2x1 convolution to dimension 256, followed by another constrained 2x1 convolution to dimension 256, followed by a 2x1 convolution back to the hidden-layer dimension. Even more better then part 3. 5. Dropout dropout mask are shared across time dropout schedule 𝛼(dropout strength): 0 → 0.5 → 0 continuous dropout scale: 1 − 2𝛼, 1 + 2𝛼 uniform distribution 6. Factorizing the final layer Even with very small datasets in which factorizing the TDNN layers was not helpful, factorizing the final layer was helpful. 256 1280 1280 256
  • 17. 22/31 Factorized model topologies 7. Skip connections • Some layers receive as input, not just the output of the previous layer but also selected other prior layers (up to 3) which are appended to the previous one. • It helps to solve vanishing gradient problem.
  • 19. 24/31 Experiments • Experimental setup 1. basic factorization Switchboard 300 hours Fisher+Switchboard 2000 hours MATERIAL • two low-resource languages: Swahili and Tagalog • 80 hours for each
  • 22. 27/31 Conclusions 1. factorized TDNN (TDNN-F): an effective way to train networks with parameter matrices represented as the product of two or more smaller matrices, with all but one of the factors constrained to be semi-orthogonal. 2. skip connections can solve vanishing gradient 3. dropout mask that is shared across time 4. better result and faster to decode
  • 24. 29/31 Appendix Factorizing the convolution time-stride = 1 1024-128: time-offset = -1, 0 128-1024: time-offset = 0, 1 time-stride = 3 1024-128: time-offset = -3, 0 128-1024: time-offset = 0, 3 time-stride = 0 1024-128: time-offset = 0 128-1024: time-offset = 0 Tdnnf structure Time-shared dropout weighted sum of the input and the output
  • 25. 30/31 Factorized model topologies 5. Dropout dropout mask are shared across time dropout schedule 𝛼(dropout strength): 0 → 0.5 → 0 continuous dropout scale: 1 − 2𝛼, 1 + 2𝛼 uniform distribution batch#×dim×seq_len 6. Factorizing the final layer Even with very small datasets in which factorizing the TDNN layers was not helpful, factorizing the final layer was helpful. dim seq_lenbatch#

Hinweis der Redaktion

  1. 2018 interspeech
  2. Subsamping: 相鄰時間點所包含的上下文信息有很大部分重疊的,因此可以採用取樣的方法,只保留部分的連線,可以獲得原始模型近似的效果,同時能夠大大減小模型的計算量 https://www.twblogs.net/a/5c778577bd9eee3399183f67
  3. Eigendecomposition of a matrix: https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix low rank: 當捨棄掉小的且非零的特徵值
  4. where ∑ is a diagonal matrix with A's singular values on the diagonal in the decreasing order. The m columns of U and the n columns of V are called the left-singular vectors and rightsingular vectors of A, respectively.
  5. 之前的work都是在訓練過後的模型上做SVD,但這篇是直接以這個架構下去做訓練,所以來說一下他是怎麼做更新的
  6. Product of independent variables implement in Caffe library, not in paper https://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization
  7. Because we use batchnorm and because the ReLU nonlinearity is scale invariant, l2 does not have a true regularization effect when applied to the hidden layers; but it reduces the scale of the parameter matrix which makes it learn faster
  8. tr(MX^t) = 0 正交定義 https://github.com/kaldi-asr/kaldi/blob/7b762b1b32140cbf8fbf4c72b713b4bd18c71104/src/nnet3/nnet-utils.cc#L1009
  9. 2. 現在實做是1280*128
  10. In the current kaldi implementaion,"3-stage splicing" aspect and the skip connecctions were token out. https://groups.google.com/forum/#!topic/kaldi-help/gBinGgj6Xy4
  11. Eval2000: full HUB5'00 evaluation set (also known as Eval2000) and its “switchboard” subset RT03: test set (LDC2007S10)
  12. 依序是1*1, 2*1, 3*1