SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
[course site]
Day 3 Lecture 1
Backpropagation
Elisa Sayrol
2
Acknowledgements
Kevin McGuinness
kevin.mcguinness@dcu.ie
Research Fellow
Insight Centre for Data Analytics
Dublin City University
…in our last lecture
Multilayer perceptrons
When each node in each layer is a linear
combination of all inputs from the previous
layer then the network is called a multilayer
perceptron (MLP)
Weights can be organized into matrices.
Forward pass computes
𝐚(#)
Training MLPs
With Multiple layers we need to minimize the loss function 𝓛 𝒇 𝜽 𝒙 , 𝒚 with respect to all the parameters
of the model 𝜽(W(k), b(k)):
𝑊
∗
= 𝑎𝑟𝑔𝑚𝑖𝑛4ℒ 𝑓4 𝑥 , 𝑦
Gradient Descent: Move the parameter 𝜽𝒋		in small steps in the direction opposite sign of the derivative of
the loss with respect j:
𝜽𝒋
(𝒏) = 𝜽𝒋
(𝒏<𝟏) − 𝜶(𝒏<𝟏) @ 𝜵 𝜽𝒋
𝓛(𝒚,𝒇 𝒙 )
Stochastic gradient descent (SGD): estimate the gradient with one sample, or better, with a minibatch
of examples.
For MLP gradients can be found using the chain rule of differentiation.
The calculations reveal that the gradient wrt. the parameters in layer k only depends on the error from the
above layer and the output from the layer below. This means that the gradients for each layer can be
computed iteratively, starting at the last layer and propagating the error back through the network. This is
known as the backpropagation algorithm.
• Computational Graphs
• Examples applying chain of rule in simple graphs
• Backpropagation applied to Multilayer Perceptron
• Issues on Backpropagation and training
Backpropagation algorithm
Computational graphs
z
x y
x
u(1) u(2)
·
+
y^
x w b
σ
U(1) U(2)
matmul
+
H
X W b
relu
u(1)
u(2)
·
y^
x w λ
x
u(3)
sum
sqrt
𝑧 = 𝑥𝑦 𝑦C=𝜎(xFw + b) 𝑯=max 0, 𝑿𝑾 + 𝒃 𝑦C=xFw
𝜆 P 𝑤R
S
R
From Deep Learning Book
Computational graphs
Applying the Chain Rule to Computational Graphs
𝑦 = 𝑔(𝑥)
𝑧 = 𝑓 𝑔 𝑥
𝑑𝑧
𝑑𝑥
=
𝑑𝑧
𝑑𝑦
𝑑𝑦
𝑑𝑥
𝜕𝑧
𝜕𝑥R
= P
𝜕𝑧
𝜕𝑦V
𝜕𝑦V
𝜕𝑥RV
𝛻𝒙 𝑧 =
𝜕𝒚
𝜕𝒙
F
𝛻𝒚 𝑧
𝑑𝑧
𝑑𝑦
𝑑𝑦
𝑑𝑥X
𝑑𝑦
𝑑𝑥S
𝑑𝑧
𝑑𝑥X
𝑑𝑧
𝑑𝑥S
𝑑𝑧
𝑑𝑥X
=
𝑑𝑧
𝑑𝑦
𝑑𝑦
𝑑𝑥X
𝑑𝑧
𝑑𝑥S
=
𝑑𝑧
𝑑𝑦
𝑑𝑦
𝑑𝑥S
For vectors:
𝑥X
𝑥S
𝑦 𝑧
fg
𝑑𝑧
𝑑𝑦
𝑧 = 𝑓(𝑦)
Computational graphs
Numerical Examples
From Stanford Course: Convolutional Neural Networks for Visual Recognition 2017
+
x
𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧
𝑥
𝑦
𝑧
𝑞
−2
5
−4
-12
3
𝑞 = 𝑥 + 𝑦
𝑓 = 𝑞𝑧
𝜕𝑞
𝜕𝑥
= 1
𝜕𝑞
𝜕𝑦
= 1
𝜕𝑓
𝜕𝑞
= 𝑧
𝜕𝑓
𝜕𝑧
= 𝑞
𝑊𝑒	𝑤𝑎𝑛𝑡	𝑡𝑜	𝑐𝑜𝑚𝑝𝑢𝑡𝑒:	
𝜕𝑓
𝜕𝑥
,
𝜕𝑓
𝜕𝑦
,
𝜕𝑓
𝜕𝑧
𝐸𝑥𝑎𝑚𝑝𝑙𝑒	𝑥 = −2, 𝑦 = 5, 𝑧 = −4
𝜕𝑓
𝜕𝑓
= 1
𝜕𝑓
𝜕𝑓
= 1
1
𝜕𝑓
𝜕𝑞
= 𝑧 = −4
-4
𝜕𝑓
𝜕𝑧
= 𝑞 = 3
𝜕𝑓
𝜕𝑥
=
𝜕𝑓
𝜕𝑞
𝜕𝑞
𝜕𝑥
= −4 · 1 = −4
𝜕𝑓
𝜕𝑦
=
𝜕𝑓
𝜕𝑞
𝜕𝑞
𝜕𝑦
= −4 · 1 = −4
-4
-4
3
Computational graphs
Numerical Examples
x
+𝑓 𝑥, 𝑦, 𝑧 = 𝜎 𝑤k 𝑥k + 𝑤X 𝑥X + 𝑏
𝑤0
𝑥0
𝑏
x
𝑤1
𝑥1
+ σ
𝑑𝜎(𝑥)
𝑥
=
𝑒<m
1 + 𝑒<m 2
=
1 + 𝑒<m
− 1
1 + 𝑒<m
1
1+ 𝑒<m
𝜎 𝑥 =
1
1 + 𝑒<m
2
−1
−3
−2
−3
−2
6
4 1 0.73
1
𝑑𝜎(𝑥)
𝑥
= (1 − 𝜎(𝑥))(𝜎(𝑥))
0,20,2
0,2
-0,2
0,2
0,2
0,4
-0,4
-0,6
From Stanford Course: Convolutional Neural Networks for Visual Recognition
Computational graphs
Gates. Backward Pass
𝜎 𝑥 =
1
1 + 𝑒<m
𝑑𝜎(𝑥)
𝑥
= (1 − 𝜎(𝑥))(𝜎(𝑥))
𝑞 = 𝑥 + 𝑦 𝜕𝑞
𝜕𝑥
= 1
𝜕𝑞
𝜕𝑦
= 1
𝑓 = 𝑞𝑧
𝜕𝑓
𝜕𝑞
= 𝑧
𝜕𝑓
𝜕𝑧
= 𝑞
Sum: Distributes the gradient to both branches
Product: Switches gradient weigth values
Max: Routes the gradient only to the higher input
branche (not sensitive to the lower branche)
𝑥
𝑦
-0,2
0,2
0
max
2
1
2
+
In general: Derivative of a function
Add branches: Branches that split in the forward pass
and merge in the backward pass, add gradients
Computational graphs
Numerical Examples
From Stanford Course: Convolutional Neural Networks for Visual Recognition 2017
x
𝑓 𝑥, 𝑊 = 𝑾 @ 𝒙 2 = P 𝑾 @ 𝒙 R
S = P 𝒒 R
S
p
RqX
p
RqX
𝑾
𝒙
𝒒 0,16
𝜕𝑓
𝜕𝑞R
= 2𝑞R
1
𝛻 𝑾 𝑓 = 2𝒒 @ 𝒙F
L2
0.1 0.5
−0.3 0.8
0.2
0.4
0.22
0.26
0.44
0.52
0.088 0.176
0.104 0.208
𝛻𝒒 𝑓 = 2𝒒
𝛻𝒙 𝑓 = 2𝑾 𝑇 @ 𝒒
−0.112
0.636
Backpropagation applied to Multilayer Perceptron
For a single neuron with its linear and non-linear part
ℎX
w
g(·)
ℎS
x
ℎX
wyX
𝒂xyX
𝒉xyX = 𝑔(𝑾x 𝒉x +𝒃x) = 𝑔(𝒂xyX)
𝜕𝒉x
𝜕𝒂x
= 𝑔|(𝒂x)
𝜕𝒉𝒂xyX
𝜕𝒉x
= 𝑾 𝒌
Probability Class given an input
(softmax)
h2 h3a3 a4 h4
Loss
Hidden Hidden Output
W2 W3
x a2
Input
W1
Figure Credit: Kevin McGuiness
Forward Pass
Probability Class given an input
(softmax)
Loss function; e.g., negative log-likelihood
(good for classification)
h2 h3a3 a4 h4
Loss
Hidden Hidden Output
W2 W3
x a2
Input
W1
Regularization term (L2 Norm)
aka as weight decay
Figure Credit: Kevin McGuiness
Forward Pass
Probability Class given an input
(softmax)
Minimize the loss (plus some
regularization term) w.r.t. Parameters
over the whole training set.
Loss function; e.g., negative log-likelihood
(good for classification)
h2 h3a3 a4 h4
Loss
Hidden Hidden Output
W2 W3
x a2
Input
W1
Regularization term (L2 Norm)
aka as weight decay
Figure Credit: Kevin McGuiness
Forward Pass
1. Find the error in the top layer:
h2 h3a3 a4 h4
Loss
Hidden Hidden Output
W2 W3
x a2
Input
W1
L
Figure Credit: Kevin McGuiness
Backward Pass
1. Find the error in the top layer: 2. Compute weight updates
h2 h3a3 a4 h4
Loss
Hidden Hidden Output
W2 W3
x a2
Input
W1
L
Figure Credit: Kevin McGuiness
Backward Pass
To simplify we don’t consider the biass
1. Find the error in the top layer: 3. Backpropagate error to layer below2. Compute weight updates
h2 h3a3 a4 h4
Loss
Hidden Hidden Output
W2 W3
x a2
Input
W1
L
Figure Credit: Kevin McGuiness
Backward Pass
To simplify we don’t consider the biass
Issues on Backpropagation and Training
Gradient Descent: Move the parameter 𝜃Vin small steps in the direction opposite sign of the
derivative of the loss with respect j.
𝜃(p) = 𝜃(p<X) − 𝛼 p<X @ 𝛻4ℒ 𝑦, 𝑓 𝑥 − 𝜆𝜃 p<X
Weight Decay: Penalizes large weights, distributes values among all the parameters
Stochastic gradient descent (SGD): estimate the gradient with one sample, or better, with a
minibatch of examples.
Momentum: the movement direction of parameters averages the gradient estimation with
previous ones.
Several strategies have been proposed to update the weights: optimizers
Weight initialization
Need to pick a starting point for gradient
descent: an initial set of weights
Zero is a very bad idea!
Zero is a critical point
Error signal will not propagate
Gradients will be zero: no progress
Constant value also bad idea:
Need to break symmetry
Use small random values:
E.g. zero mean Gaussian noise with constant
variance
Ideally we want inputs to activation functions
(e.g. sigmoid, tanh, ReLU) to be mostly in the
linear area to allow larger gradients to
propagate and converge faster.
0
tanh
Small
gradient
Large
gradient
bad good
In the backward pass you might be in the flat part of the sigmoid (or any other activation
function like tanh) so derivative tends to zero and your training loss will not go down
“Vanishing Gradients”
Note on hyperparameters
So far we have lots of hyperparameters to choose:
1. Learning rate (α)
2. Regularization constant (λ)
3. Number of epochs
4. Number of hidden layers
5. Nodes in each hidden layer
6. Weight initialization strategy
7. Loss function
8. Activation functions
9. …

Weitere ähnliche Inhalte

Was ist angesagt?

The world of loss function
The world of loss functionThe world of loss function
The world of loss function홍배 김
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier홍배 김
 
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Detailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss FunctionDetailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss Function범준 김
 
K-means, EM and Mixture models
K-means, EM and Mixture modelsK-means, EM and Mixture models
K-means, EM and Mixture modelsVu Pham
 
Variational Autoencoders For Image Generation
Variational Autoencoders For Image GenerationVariational Autoencoders For Image Generation
Variational Autoencoders For Image GenerationJason Anderson
 
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
Skip RNN: Learning to Skip State Updates in Recurrent Neural NetworksSkip RNN: Learning to Skip State Updates in Recurrent Neural Networks
Skip RNN: Learning to Skip State Updates in Recurrent Neural NetworksUniversitat Politècnica de Catalunya
 
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Show, Attend and Tell: Neural Image Caption Generation with Visual AttentionShow, Attend and Tell: Neural Image Caption Generation with Visual Attention
Show, Attend and Tell: Neural Image Caption Generation with Visual AttentionEun Ji Lee
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...MLconf
 
VAE-type Deep Generative Models
VAE-type Deep Generative ModelsVAE-type Deep Generative Models
VAE-type Deep Generative ModelsKenta Oono
 
Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr taeseon ryu
 
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowLearning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowAltoros
 
Machine learning applications in aerospace domain
Machine learning applications in aerospace domainMachine learning applications in aerospace domain
Machine learning applications in aerospace domain홍배 김
 
Convolutional Neural Network (CNN) presentation from theory to code in Theano
Convolutional Neural Network (CNN) presentation from theory to code in TheanoConvolutional Neural Network (CNN) presentation from theory to code in Theano
Convolutional Neural Network (CNN) presentation from theory to code in TheanoSeongwon Hwang
 

Was ist angesagt? (20)

The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018
The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018
The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
 
Detailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss FunctionDetailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss Function
 
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
 
Iclr2016 vaeまとめ
Iclr2016 vaeまとめIclr2016 vaeまとめ
Iclr2016 vaeまとめ
 
K-means, EM and Mixture models
K-means, EM and Mixture modelsK-means, EM and Mixture models
K-means, EM and Mixture models
 
Variational Autoencoders For Image Generation
Variational Autoencoders For Image GenerationVariational Autoencoders For Image Generation
Variational Autoencoders For Image Generation
 
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
 
Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
Skip RNN: Learning to Skip State Updates in Recurrent Neural NetworksSkip RNN: Learning to Skip State Updates in Recurrent Neural Networks
Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
 
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Show, Attend and Tell: Neural Image Caption Generation with Visual AttentionShow, Attend and Tell: Neural Image Caption Generation with Visual Attention
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
 
VAE-type Deep Generative Models
VAE-type Deep Generative ModelsVAE-type Deep Generative Models
VAE-type Deep Generative Models
 
Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr
 
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowLearning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
 
K-means and GMM
K-means and GMMK-means and GMM
K-means and GMM
 
Machine learning applications in aerospace domain
Machine learning applications in aerospace domainMachine learning applications in aerospace domain
Machine learning applications in aerospace domain
 
Convolutional Neural Network (CNN) presentation from theory to code in Theano
Convolutional Neural Network (CNN) presentation from theory to code in TheanoConvolutional Neural Network (CNN) presentation from theory to code in Theano
Convolutional Neural Network (CNN) presentation from theory to code in Theano
 

Ähnlich wie Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)

Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorchJun Young Park
 
Koh_Liang_ICML2017
Koh_Liang_ICML2017Koh_Liang_ICML2017
Koh_Liang_ICML2017Masa Kato
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsElvis DOHMATOB
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4arogozhnikov
 
Machine learning ppt and presentation code
Machine learning ppt and presentation codeMachine learning ppt and presentation code
Machine learning ppt and presentation codesharma239172
 
2 random variables notes 2p3
2 random variables notes 2p32 random variables notes 2p3
2 random variables notes 2p3MuhannadSaleh
 
20180831 riemannian representation learning
20180831 riemannian representation learning20180831 riemannian representation learning
20180831 riemannian representation learningsegwangkim
 
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...Atsushi Nitanda
 
Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님
Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님
Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님AI Robotics KR
 
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Learning a nonlinear embedding by preserving class neibourhood structure   최종Learning a nonlinear embedding by preserving class neibourhood structure   최종
Learning a nonlinear embedding by preserving class neibourhood structure 최종WooSung Choi
 
Connected components and shortest path
Connected components and shortest pathConnected components and shortest path
Connected components and shortest pathKaushik Koneru
 
Intro to Quant Trading Strategies (Lecture 7 of 10)
Intro to Quant Trading Strategies (Lecture 7 of 10)Intro to Quant Trading Strategies (Lecture 7 of 10)
Intro to Quant Trading Strategies (Lecture 7 of 10)Adrian Aley
 
Distributional RL via Moment Matching
Distributional RL via Moment MatchingDistributional RL via Moment Matching
Distributional RL via Moment Matchingtaeseon ryu
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagationParveenMalik18
 
MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1arogozhnikov
 
Jsai final final final
Jsai final final finalJsai final final final
Jsai final final finaldinesh malla
 

Ähnlich wie Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence) (20)

Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
 
Koh_Liang_ICML2017
Koh_Liang_ICML2017Koh_Liang_ICML2017
Koh_Liang_ICML2017
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
 
Machine learning ppt and presentation code
Machine learning ppt and presentation codeMachine learning ppt and presentation code
Machine learning ppt and presentation code
 
2 random variables notes 2p3
2 random variables notes 2p32 random variables notes 2p3
2 random variables notes 2p3
 
20180831 riemannian representation learning
20180831 riemannian representation learning20180831 riemannian representation learning
20180831 riemannian representation learning
 
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
 
Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님
Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님
Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Learning a nonlinear embedding by preserving class neibourhood structure   최종Learning a nonlinear embedding by preserving class neibourhood structure   최종
Learning a nonlinear embedding by preserving class neibourhood structure 최종
 
Connected components and shortest path
Connected components and shortest pathConnected components and shortest path
Connected components and shortest path
 
Intro to Quant Trading Strategies (Lecture 7 of 10)
Intro to Quant Trading Strategies (Lecture 7 of 10)Intro to Quant Trading Strategies (Lecture 7 of 10)
Intro to Quant Trading Strategies (Lecture 7 of 10)
 
Distributional RL via Moment Matching
Distributional RL via Moment MatchingDistributional RL via Moment Matching
Distributional RL via Moment Matching
 
Lash
LashLash
Lash
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagation
 
MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1
 
Jsai final final final
Jsai final final finalJsai final final final
Jsai final final final
 

Mehr von Universitat Politècnica de Catalunya

The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...Universitat Politècnica de Catalunya
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoUniversitat Politècnica de Catalunya
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Universitat Politècnica de Catalunya
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosUniversitat Politècnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Universitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Universitat Politècnica de Catalunya
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Universitat Politècnica de Catalunya
 

Mehr von Universitat Politècnica de Catalunya (20)

Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
 

Kürzlich hochgeladen

ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
Cyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis ProjectCyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis Projectdanielbell861
 
Create Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI DesktopCreate Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI DesktopThinkInnovation
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 

Kürzlich hochgeladen (13)

ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
Cyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis ProjectCyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis Project
 
Create Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI DesktopCreate Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI Desktop
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 

Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)

  • 1. [course site] Day 3 Lecture 1 Backpropagation Elisa Sayrol
  • 3. …in our last lecture
  • 4. Multilayer perceptrons When each node in each layer is a linear combination of all inputs from the previous layer then the network is called a multilayer perceptron (MLP) Weights can be organized into matrices. Forward pass computes 𝐚(#)
  • 5. Training MLPs With Multiple layers we need to minimize the loss function 𝓛 𝒇 𝜽 𝒙 , 𝒚 with respect to all the parameters of the model 𝜽(W(k), b(k)): 𝑊 ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛4ℒ 𝑓4 𝑥 , 𝑦 Gradient Descent: Move the parameter 𝜽𝒋 in small steps in the direction opposite sign of the derivative of the loss with respect j: 𝜽𝒋 (𝒏) = 𝜽𝒋 (𝒏<𝟏) − 𝜶(𝒏<𝟏) @ 𝜵 𝜽𝒋 𝓛(𝒚,𝒇 𝒙 ) Stochastic gradient descent (SGD): estimate the gradient with one sample, or better, with a minibatch of examples. For MLP gradients can be found using the chain rule of differentiation. The calculations reveal that the gradient wrt. the parameters in layer k only depends on the error from the above layer and the output from the layer below. This means that the gradients for each layer can be computed iteratively, starting at the last layer and propagating the error back through the network. This is known as the backpropagation algorithm.
  • 6. • Computational Graphs • Examples applying chain of rule in simple graphs • Backpropagation applied to Multilayer Perceptron • Issues on Backpropagation and training Backpropagation algorithm
  • 7. Computational graphs z x y x u(1) u(2) · + y^ x w b σ U(1) U(2) matmul + H X W b relu u(1) u(2) · y^ x w λ x u(3) sum sqrt 𝑧 = 𝑥𝑦 𝑦C=𝜎(xFw + b) 𝑯=max 0, 𝑿𝑾 + 𝒃 𝑦C=xFw 𝜆 P 𝑤R S R From Deep Learning Book
  • 8. Computational graphs Applying the Chain Rule to Computational Graphs 𝑦 = 𝑔(𝑥) 𝑧 = 𝑓 𝑔 𝑥 𝑑𝑧 𝑑𝑥 = 𝑑𝑧 𝑑𝑦 𝑑𝑦 𝑑𝑥 𝜕𝑧 𝜕𝑥R = P 𝜕𝑧 𝜕𝑦V 𝜕𝑦V 𝜕𝑥RV 𝛻𝒙 𝑧 = 𝜕𝒚 𝜕𝒙 F 𝛻𝒚 𝑧 𝑑𝑧 𝑑𝑦 𝑑𝑦 𝑑𝑥X 𝑑𝑦 𝑑𝑥S 𝑑𝑧 𝑑𝑥X 𝑑𝑧 𝑑𝑥S 𝑑𝑧 𝑑𝑥X = 𝑑𝑧 𝑑𝑦 𝑑𝑦 𝑑𝑥X 𝑑𝑧 𝑑𝑥S = 𝑑𝑧 𝑑𝑦 𝑑𝑦 𝑑𝑥S For vectors: 𝑥X 𝑥S 𝑦 𝑧 fg 𝑑𝑧 𝑑𝑦 𝑧 = 𝑓(𝑦)
  • 9. Computational graphs Numerical Examples From Stanford Course: Convolutional Neural Networks for Visual Recognition 2017 + x 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧 𝑥 𝑦 𝑧 𝑞 −2 5 −4 -12 3 𝑞 = 𝑥 + 𝑦 𝑓 = 𝑞𝑧 𝜕𝑞 𝜕𝑥 = 1 𝜕𝑞 𝜕𝑦 = 1 𝜕𝑓 𝜕𝑞 = 𝑧 𝜕𝑓 𝜕𝑧 = 𝑞 𝑊𝑒 𝑤𝑎𝑛𝑡 𝑡𝑜 𝑐𝑜𝑚𝑝𝑢𝑡𝑒: 𝜕𝑓 𝜕𝑥 , 𝜕𝑓 𝜕𝑦 , 𝜕𝑓 𝜕𝑧 𝐸𝑥𝑎𝑚𝑝𝑙𝑒 𝑥 = −2, 𝑦 = 5, 𝑧 = −4 𝜕𝑓 𝜕𝑓 = 1 𝜕𝑓 𝜕𝑓 = 1 1 𝜕𝑓 𝜕𝑞 = 𝑧 = −4 -4 𝜕𝑓 𝜕𝑧 = 𝑞 = 3 𝜕𝑓 𝜕𝑥 = 𝜕𝑓 𝜕𝑞 𝜕𝑞 𝜕𝑥 = −4 · 1 = −4 𝜕𝑓 𝜕𝑦 = 𝜕𝑓 𝜕𝑞 𝜕𝑞 𝜕𝑦 = −4 · 1 = −4 -4 -4 3
  • 10. Computational graphs Numerical Examples x +𝑓 𝑥, 𝑦, 𝑧 = 𝜎 𝑤k 𝑥k + 𝑤X 𝑥X + 𝑏 𝑤0 𝑥0 𝑏 x 𝑤1 𝑥1 + σ 𝑑𝜎(𝑥) 𝑥 = 𝑒<m 1 + 𝑒<m 2 = 1 + 𝑒<m − 1 1 + 𝑒<m 1 1+ 𝑒<m 𝜎 𝑥 = 1 1 + 𝑒<m 2 −1 −3 −2 −3 −2 6 4 1 0.73 1 𝑑𝜎(𝑥) 𝑥 = (1 − 𝜎(𝑥))(𝜎(𝑥)) 0,20,2 0,2 -0,2 0,2 0,2 0,4 -0,4 -0,6 From Stanford Course: Convolutional Neural Networks for Visual Recognition
  • 11. Computational graphs Gates. Backward Pass 𝜎 𝑥 = 1 1 + 𝑒<m 𝑑𝜎(𝑥) 𝑥 = (1 − 𝜎(𝑥))(𝜎(𝑥)) 𝑞 = 𝑥 + 𝑦 𝜕𝑞 𝜕𝑥 = 1 𝜕𝑞 𝜕𝑦 = 1 𝑓 = 𝑞𝑧 𝜕𝑓 𝜕𝑞 = 𝑧 𝜕𝑓 𝜕𝑧 = 𝑞 Sum: Distributes the gradient to both branches Product: Switches gradient weigth values Max: Routes the gradient only to the higher input branche (not sensitive to the lower branche) 𝑥 𝑦 -0,2 0,2 0 max 2 1 2 + In general: Derivative of a function Add branches: Branches that split in the forward pass and merge in the backward pass, add gradients
  • 12. Computational graphs Numerical Examples From Stanford Course: Convolutional Neural Networks for Visual Recognition 2017 x 𝑓 𝑥, 𝑊 = 𝑾 @ 𝒙 2 = P 𝑾 @ 𝒙 R S = P 𝒒 R S p RqX p RqX 𝑾 𝒙 𝒒 0,16 𝜕𝑓 𝜕𝑞R = 2𝑞R 1 𝛻 𝑾 𝑓 = 2𝒒 @ 𝒙F L2 0.1 0.5 −0.3 0.8 0.2 0.4 0.22 0.26 0.44 0.52 0.088 0.176 0.104 0.208 𝛻𝒒 𝑓 = 2𝒒 𝛻𝒙 𝑓 = 2𝑾 𝑇 @ 𝒒 −0.112 0.636
  • 13. Backpropagation applied to Multilayer Perceptron For a single neuron with its linear and non-linear part ℎX w g(·) ℎS x ℎX wyX 𝒂xyX 𝒉xyX = 𝑔(𝑾x 𝒉x +𝒃x) = 𝑔(𝒂xyX) 𝜕𝒉x 𝜕𝒂x = 𝑔|(𝒂x) 𝜕𝒉𝒂xyX 𝜕𝒉x = 𝑾 𝒌
  • 14. Probability Class given an input (softmax) h2 h3a3 a4 h4 Loss Hidden Hidden Output W2 W3 x a2 Input W1 Figure Credit: Kevin McGuiness Forward Pass
  • 15. Probability Class given an input (softmax) Loss function; e.g., negative log-likelihood (good for classification) h2 h3a3 a4 h4 Loss Hidden Hidden Output W2 W3 x a2 Input W1 Regularization term (L2 Norm) aka as weight decay Figure Credit: Kevin McGuiness Forward Pass
  • 16. Probability Class given an input (softmax) Minimize the loss (plus some regularization term) w.r.t. Parameters over the whole training set. Loss function; e.g., negative log-likelihood (good for classification) h2 h3a3 a4 h4 Loss Hidden Hidden Output W2 W3 x a2 Input W1 Regularization term (L2 Norm) aka as weight decay Figure Credit: Kevin McGuiness Forward Pass
  • 17. 1. Find the error in the top layer: h2 h3a3 a4 h4 Loss Hidden Hidden Output W2 W3 x a2 Input W1 L Figure Credit: Kevin McGuiness Backward Pass
  • 18. 1. Find the error in the top layer: 2. Compute weight updates h2 h3a3 a4 h4 Loss Hidden Hidden Output W2 W3 x a2 Input W1 L Figure Credit: Kevin McGuiness Backward Pass To simplify we don’t consider the biass
  • 19. 1. Find the error in the top layer: 3. Backpropagate error to layer below2. Compute weight updates h2 h3a3 a4 h4 Loss Hidden Hidden Output W2 W3 x a2 Input W1 L Figure Credit: Kevin McGuiness Backward Pass To simplify we don’t consider the biass
  • 20. Issues on Backpropagation and Training Gradient Descent: Move the parameter 𝜃Vin small steps in the direction opposite sign of the derivative of the loss with respect j. 𝜃(p) = 𝜃(p<X) − 𝛼 p<X @ 𝛻4ℒ 𝑦, 𝑓 𝑥 − 𝜆𝜃 p<X Weight Decay: Penalizes large weights, distributes values among all the parameters Stochastic gradient descent (SGD): estimate the gradient with one sample, or better, with a minibatch of examples. Momentum: the movement direction of parameters averages the gradient estimation with previous ones. Several strategies have been proposed to update the weights: optimizers
  • 21. Weight initialization Need to pick a starting point for gradient descent: an initial set of weights Zero is a very bad idea! Zero is a critical point Error signal will not propagate Gradients will be zero: no progress Constant value also bad idea: Need to break symmetry Use small random values: E.g. zero mean Gaussian noise with constant variance Ideally we want inputs to activation functions (e.g. sigmoid, tanh, ReLU) to be mostly in the linear area to allow larger gradients to propagate and converge faster. 0 tanh Small gradient Large gradient bad good
  • 22. In the backward pass you might be in the flat part of the sigmoid (or any other activation function like tanh) so derivative tends to zero and your training loss will not go down “Vanishing Gradients”
  • 23. Note on hyperparameters So far we have lots of hyperparameters to choose: 1. Learning rate (α) 2. Regularization constant (λ) 3. Number of epochs 4. Number of hidden layers 5. Nodes in each hidden layer 6. Weight initialization strategy 7. Loss function 8. Activation functions 9. …