SlideShare a Scribd company logo
1 of 6
Vanishing Gradients – What?
1. “Vanishing” means disappearing. Vanishing gradients means that error gradients becoming so small that we can barely see any update
on weights (refer grad descent equation). Hence, the convergence is not achieved.
2. Before going further, lets see below 3 equations to see when we multiply numbers that are between 0 to 1, the output is lesser than
values of both the input numbers.
3. Let’s assume a network shown on next page with sigmoid activation used across the network layers. Activations like tanh and sigmoid
limit the value of z between 0 and 1. The derivative value of these activations lies between 0 to 0.25. This makes any number multiplied
with these derivatives to reduce in absolute terms as seen in step 2.
Vanishing Gradients
Vanishing Gradients – How to Avoid?
1. Reason  Let’s see the equation for gradient of error w.r.t w17 and gradient of error w.r.t w23. The number of items required to be
multiplied to calculate gradient of error w.r.t w17 (a weight in initial layer) is way more than number of items required to be multiplied to
calculate gradient of error w.r.t w23 (a weight in later layers). Now, the terms in these gradients that do partial derivative of activation will
be valued between 0 to 0.25 (refer point 3). Since number of terms less than 1 is more for error gradients in initial layers, hence,
vanishing gradient effect is seen more prominently in the initial layers of network. The number of terms required to compute gradient
w.r.t w1, w2 etc. will be quite high.
Resolution  The way to avoid the chances of a vanishing gradient problem is to use activations whose derivative is not limited to values less
than 1. We can use Relu activation. Relu’s derivative for positive values is 1. The issue with Relu is it’s derivative for negative values is 0 which
makes contribution of some nodes 0. This can be managed by using Leaky Relu instead.
Vanishing Gradients – How to Avoid?
Vanishing Gradients – How to Avoid?
2. Reason  The first problem that we discussed was the usage of activations whose derivatives are low. The second problem deals with
low value of initialized weights. We can understand this from simple example as shown in network on previous page. The equations for
error grad w.r.t w1 includes value of w5 as well. Hence, if value of w5 is initialized very low, it will also plays a role in making the gradient
w.r.t w1 smaller i.e vanishing gradient.
We can also say Vanishing gradient problems will be more prominent in deep networks. This is because the number of multiplicative terms to
compute the gradient of initial layers in a deep network is very high.
Resolution  As we can see from below equations, the derivative of activation function along with weights play a role in causing vanishing
gradients because both are there in equation for computation of error gradient. We need to initialize the weights properly to avoid vanishing
gradient problem. We will discuss about it further in weight initialization strategy section.
Exploding Gradients – What?
1. “Exploding” means increasing to a large extent. Exploding gradients means that error gradients becoming so big that the update on
weights is too high in every iteration. This causes the weights to swindle a lot and causes error to keep missing the global minima. Hence,
the convergence becomes tough to be achieved.
2. Exploding gradients are caused due to usage of bigger weights used in the network.
3. Probable resolutions
1. Keep low learning rate to accommodate for higher weights
2. Gradient clipping
3. Gradient scaling
4. Gradient scaling
1. For every batch, get all the gradient vectors for all samples.
2. Find L2 norm of the concatenated error gradient vector.
1. If L2 norm > 1 (1 is used as an example here)
2. Scale/normalize the gradient terms such that L2 norm becomes 1
3. Code example  opt = SGD(lr=0.01, momentum=0.9, clipnorm=1.0)
5. Gradient clipping
1. For every sample in a batch, if the gradient value w.r.t any weight is outside a range (let’s say -0.5 <= gradient_value <= 0.5), we clip
the gradient value to the border values. If gradient value is 0.6, we clip it to make it 0.5.
2. Code example  opt = SGD(lr=0.01, momentum=0.9, clipvalue=0.5)
6. Generic practice is to use same values of clipping / scaling throughout the network.

More Related Content

What's hot

Autoencoders
AutoencodersAutoencoders
AutoencodersCloudxLab
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Simplilearn
 
Optimization in Deep Learning
Optimization in Deep LearningOptimization in Deep Learning
Optimization in Deep LearningYan Xu
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural networkMojammilHusain
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
 
Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)SungminYou
 
An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms Hakky St
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Machine Learning With Logistic Regression
Machine Learning  With Logistic RegressionMachine Learning  With Logistic Regression
Machine Learning With Logistic RegressionKnoldus Inc.
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnnKuppusamy P
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)Abhimanyu Dwivedi
 
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Simplilearn
 
Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural NetworksDatabricks
 
Convolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsConvolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsKasun Chinthaka Piyarathna
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 

What's hot (20)

Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Recurrent neural network
Recurrent neural networkRecurrent neural network
Recurrent neural network
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Optimization in Deep Learning
Optimization in Deep LearningOptimization in Deep Learning
Optimization in Deep Learning
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural network
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)
 
An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Machine Learning With Logistic Regression
Machine Learning  With Logistic RegressionMachine Learning  With Logistic Regression
Machine Learning With Logistic Regression
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
 
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural Networks
 
Convolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsConvolutional Neural Network and Its Applications
Convolutional Neural Network and Its Applications
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 
Deep learning
Deep learningDeep learning
Deep learning
 

Similar to Vanishing & Exploding Gradients

3. Training Artificial Neural Networks.pptx
3. Training Artificial Neural Networks.pptx3. Training Artificial Neural Networks.pptx
3. Training Artificial Neural Networks.pptxmunwar7
 
4. OPTIMIZATION NN AND FL.pptx
4. OPTIMIZATION NN AND FL.pptx4. OPTIMIZATION NN AND FL.pptx
4. OPTIMIZATION NN AND FL.pptxkumarkaushal17
 
ML_ Unit 2_Part_B
ML_ Unit 2_Part_BML_ Unit 2_Part_B
ML_ Unit 2_Part_BSrimatre K
 
Dimd_m_004 DL.pdf
Dimd_m_004 DL.pdfDimd_m_004 DL.pdf
Dimd_m_004 DL.pdfjuan631
 
Machine-Learning-with-Ridge-and-Lasso-Regression.pdf
Machine-Learning-with-Ridge-and-Lasso-Regression.pdfMachine-Learning-with-Ridge-and-Lasso-Regression.pdf
Machine-Learning-with-Ridge-and-Lasso-Regression.pdfAyadIliass
 
Linear logisticregression
Linear logisticregressionLinear logisticregression
Linear logisticregressionkongara
 
Deep learning concepts
Deep learning conceptsDeep learning concepts
Deep learning conceptsJoe li
 
Shrinkage Methods in Linear Regression
Shrinkage Methods in Linear RegressionShrinkage Methods in Linear Regression
Shrinkage Methods in Linear RegressionBennoG1
 
Theory of linear programming
Theory of linear programmingTheory of linear programming
Theory of linear programmingTarun Gehlot
 
ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..butest
 
ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..butest
 
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...Taiji Suzuki
 
PRML Chapter 7
PRML Chapter 7PRML Chapter 7
PRML Chapter 7Sunwoo Kim
 
Multilayer & Back propagation algorithm
Multilayer & Back propagation algorithmMultilayer & Back propagation algorithm
Multilayer & Back propagation algorithmswapnac12
 
Dynamic programmng2
Dynamic programmng2Dynamic programmng2
Dynamic programmng2debolina13
 

Similar to Vanishing & Exploding Gradients (20)

3. Training Artificial Neural Networks.pptx
3. Training Artificial Neural Networks.pptx3. Training Artificial Neural Networks.pptx
3. Training Artificial Neural Networks.pptx
 
4. OPTIMIZATION NN AND FL.pptx
4. OPTIMIZATION NN AND FL.pptx4. OPTIMIZATION NN AND FL.pptx
4. OPTIMIZATION NN AND FL.pptx
 
ML_ Unit 2_Part_B
ML_ Unit 2_Part_BML_ Unit 2_Part_B
ML_ Unit 2_Part_B
 
Dimd_m_004 DL.pdf
Dimd_m_004 DL.pdfDimd_m_004 DL.pdf
Dimd_m_004 DL.pdf
 
Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)
 
Machine-Learning-with-Ridge-and-Lasso-Regression.pdf
Machine-Learning-with-Ridge-and-Lasso-Regression.pdfMachine-Learning-with-Ridge-and-Lasso-Regression.pdf
Machine-Learning-with-Ridge-and-Lasso-Regression.pdf
 
Linear logisticregression
Linear logisticregressionLinear logisticregression
Linear logisticregression
 
Deep learning concepts
Deep learning conceptsDeep learning concepts
Deep learning concepts
 
Shrinkage Methods in Linear Regression
Shrinkage Methods in Linear RegressionShrinkage Methods in Linear Regression
Shrinkage Methods in Linear Regression
 
Daa unit 1
Daa unit 1Daa unit 1
Daa unit 1
 
Levenberg - Marquardt (LM) algorithm_ aghazade
Levenberg - Marquardt (LM) algorithm_ aghazadeLevenberg - Marquardt (LM) algorithm_ aghazade
Levenberg - Marquardt (LM) algorithm_ aghazade
 
Theory of linear programming
Theory of linear programmingTheory of linear programming
Theory of linear programming
 
ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..
 
ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..
 
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
 
PRML Chapter 7
PRML Chapter 7PRML Chapter 7
PRML Chapter 7
 
Batch Normalization
Batch NormalizationBatch Normalization
Batch Normalization
 
Regresión
RegresiónRegresión
Regresión
 
Multilayer & Back propagation algorithm
Multilayer & Back propagation algorithmMultilayer & Back propagation algorithm
Multilayer & Back propagation algorithm
 
Dynamic programmng2
Dynamic programmng2Dynamic programmng2
Dynamic programmng2
 

Recently uploaded

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 

Vanishing & Exploding Gradients

  • 1. Vanishing Gradients – What? 1. “Vanishing” means disappearing. Vanishing gradients means that error gradients becoming so small that we can barely see any update on weights (refer grad descent equation). Hence, the convergence is not achieved. 2. Before going further, lets see below 3 equations to see when we multiply numbers that are between 0 to 1, the output is lesser than values of both the input numbers. 3. Let’s assume a network shown on next page with sigmoid activation used across the network layers. Activations like tanh and sigmoid limit the value of z between 0 and 1. The derivative value of these activations lies between 0 to 0.25. This makes any number multiplied with these derivatives to reduce in absolute terms as seen in step 2.
  • 3. Vanishing Gradients – How to Avoid? 1. Reason  Let’s see the equation for gradient of error w.r.t w17 and gradient of error w.r.t w23. The number of items required to be multiplied to calculate gradient of error w.r.t w17 (a weight in initial layer) is way more than number of items required to be multiplied to calculate gradient of error w.r.t w23 (a weight in later layers). Now, the terms in these gradients that do partial derivative of activation will be valued between 0 to 0.25 (refer point 3). Since number of terms less than 1 is more for error gradients in initial layers, hence, vanishing gradient effect is seen more prominently in the initial layers of network. The number of terms required to compute gradient w.r.t w1, w2 etc. will be quite high. Resolution  The way to avoid the chances of a vanishing gradient problem is to use activations whose derivative is not limited to values less than 1. We can use Relu activation. Relu’s derivative for positive values is 1. The issue with Relu is it’s derivative for negative values is 0 which makes contribution of some nodes 0. This can be managed by using Leaky Relu instead.
  • 4. Vanishing Gradients – How to Avoid?
  • 5. Vanishing Gradients – How to Avoid? 2. Reason  The first problem that we discussed was the usage of activations whose derivatives are low. The second problem deals with low value of initialized weights. We can understand this from simple example as shown in network on previous page. The equations for error grad w.r.t w1 includes value of w5 as well. Hence, if value of w5 is initialized very low, it will also plays a role in making the gradient w.r.t w1 smaller i.e vanishing gradient. We can also say Vanishing gradient problems will be more prominent in deep networks. This is because the number of multiplicative terms to compute the gradient of initial layers in a deep network is very high. Resolution  As we can see from below equations, the derivative of activation function along with weights play a role in causing vanishing gradients because both are there in equation for computation of error gradient. We need to initialize the weights properly to avoid vanishing gradient problem. We will discuss about it further in weight initialization strategy section.
  • 6. Exploding Gradients – What? 1. “Exploding” means increasing to a large extent. Exploding gradients means that error gradients becoming so big that the update on weights is too high in every iteration. This causes the weights to swindle a lot and causes error to keep missing the global minima. Hence, the convergence becomes tough to be achieved. 2. Exploding gradients are caused due to usage of bigger weights used in the network. 3. Probable resolutions 1. Keep low learning rate to accommodate for higher weights 2. Gradient clipping 3. Gradient scaling 4. Gradient scaling 1. For every batch, get all the gradient vectors for all samples. 2. Find L2 norm of the concatenated error gradient vector. 1. If L2 norm > 1 (1 is used as an example here) 2. Scale/normalize the gradient terms such that L2 norm becomes 1 3. Code example  opt = SGD(lr=0.01, momentum=0.9, clipnorm=1.0) 5. Gradient clipping 1. For every sample in a batch, if the gradient value w.r.t any weight is outside a range (let’s say -0.5 <= gradient_value <= 0.5), we clip the gradient value to the border values. If gradient value is 0.6, we clip it to make it 0.5. 2. Code example  opt = SGD(lr=0.01, momentum=0.9, clipvalue=0.5) 6. Generic practice is to use same values of clipping / scaling throughout the network.

Editor's Notes

  1. Why BN is not applied in batch or stochastic mode? Whe using RELU, you can encounter dying RELU problem, then use leaky RELU with He initialization strategy – do in activation function video
  2. Why BN is not applied in batch or stochastic mode? Whe using RELU, you can encounter dying RELU problem, then use leaky RELU with He initialization strategy – do in activation function video
  3. Why BN is not applied in batch or stochastic mode? Whe using RELU, you can encounter dying RELU problem, then use leaky RELU with He initialization strategy – do in activation function video
  4. Why BN is not applied in batch or stochastic mode? Whe using RELU, you can encounter dying RELU problem, then use leaky RELU with He initialization strategy – do in activation function video
  5. Why BN is not applied in batch or stochastic mode? Whe using RELU, you can encounter dying RELU problem, then use leaky RELU with He initialization strategy – do in activation function video
  6. Why BN is not applied in batch or stochastic mode? Whe using RELU, you can encounter dying RELU problem, then use leaky RELU with He initialization strategy – do in activation function video