SlideShare a Scribd company logo
1 of 56
Download to read offline
DEEP FEEDFORWARD NETWORKS
AND REGULARIZATION
LICHENG ZHANG
OVERVIEW
• Regularization
• L2/L1/elastic
• Dropout
• Batch normalization
• Data augmentation
• Early stopping
• Neural network
• Perceptron
• Activation functions
• Back-propagation
FEEDFORWARD NETWORK
“3-layer neural net” or “2-hidden-layers neural net”
FEEDFORWARD NETWORK (ANIMATION)
NEURON (UNIT)
PERCEPTRON FORWARD PASS
2
3
-1
1
Inputs weights sum activation function
∑ f
bias
0.1
0.5
2.5
3.0
output
(2*0.1)+
(3*0.5)+
(-1*2.5)+
(1*3.0)
Output = f(
)
PERCEPTRON FORWARD PASS
2
3
-1
1
Inputs weights sum activation function
∑ f
bias
0.1
0.5
2.5
3.0
output
Output = f(2.2)
=𝜎(2.2)
=
1
1+𝑒−2.2 = 0.90
MULTI-OUTPUT PERCEPTRON
𝑥0
𝑜0
𝑥1
𝑥2
Input layer Output layer
MULTI-OUTPUT PERCEPTRON
𝑥0
𝑜1
𝑥1
𝑥2
Input layer Output layer
𝑜0
MULTI-LAYER PERCEPTRON (MLP)
𝑥0
ℎ2
𝑥1
𝑥2
Input layer Hidden layer
ℎ1
𝑜1
Output layer
𝑜0
ℎ0
ℎ3
DEEP NEURAL NETWORK
𝑥0
ℎ2
𝑥1
𝑥2
Input layer Hidden layer
ℎ1
𝑜1
Output layer
𝑜0
ℎ0
ℎ3
ℎ2
ℎ1
ℎ0
ℎ3
……
http://www.asimovinstitute.org/neural-network-zoo/
UNIVERSAL APPROXIMATION THEOREM
“A feedforward network with a linear output layer and at least one hidden layer with any ‘squashing’
activation function (such as the logistic sigmoid) can approximate any Borel measurable function from one
finite-dimensional space to another with any desired nonzero amount of error, provided that the network
is given enough hidden units.”
• ----- Hornik et al., Cybenko, 1989
COMPUTATIONAL GRAPHS
Z=x*y
𝑦 = 𝜎(𝑤𝑥 + 𝑏)
H = 𝑟𝑒𝑙𝑢(𝑊𝑋 + 𝑏)
= max(0, 𝑊𝑋 + 𝑏)
𝑦 = 𝑤𝑥
𝑢(3)
= 𝜆∑ 𝑤
LOSS FUNCTION
• A loss function (cost function) tells us how good our current model is, or
how far away our model to the real answer.
𝐿(𝑤) =
1
𝑁
𝑖
𝑁
𝑙𝑜𝑠𝑠 (𝑓(𝑥 𝑖 ; 𝑤), 𝑦 𝑖 )
• Hinge loss
• Softmax loss
• Mean Squared Error (L2 loss)  Regression 𝐿(𝑤) =
1
𝑁
∑𝑖
𝑁
𝑓 𝑥 𝑖
; 𝑤 − 𝑦 𝑖 2
• Cross entropy Loss  Classification 𝐿 𝑤 =
1
𝑁
∑𝑖
𝑁
[ 𝑦 𝑖
𝑙𝑜𝑔 𝑓 𝑥 𝑖
; 𝑤 + 1 − 𝑦 𝑖
log 1 − 𝑓 𝑥 𝑖
; 𝑤 ]
• …
N = # examples
predicted actual
GRADIENT DESCENT
• Designing and training a neural network is not much different from training
any other machine learning model with gradient descent: use Calculus to get
derivatives of the loss function respect to each parameter.
𝑤𝑗 = 𝑤𝑗 − α
𝜕𝐿(𝑤)
𝜕𝑤𝑗
𝛼 is learning rate
https://developers.google.com/machine-learning/crash-course/fitter/graph
GRADIENT DESCENT
• In practice, instead of using all data points, we do
• Stochastic gradient descent (using 1 sample at each iteration)
• Mini-Batch gradient descent (using n samples at each iteration)
Problems with SGD:
• If loss changes quickly in one direction and slowly in another  jitter along steep direction
• If loss function has a local minima or saddle point  zero gradient, SGD gets stuck
Solutions:
• SGD + momentum, etc
BACK-PROPAGATION
• It allows the information from the loss to flow backward through
the network in order to compute the gradient.
𝑥0 ℎ0 𝑂0 𝐿(𝑤)
𝑊1 𝑊2
𝜕𝐿 𝑤
𝜕𝑤2
=
BACK-PROPAGATION
• It allows the information from the loss to flow backward through
the network in order to compute the gradient.
𝑥0 ℎ0 𝑂0 𝐿(𝑤)
𝑊1 𝑊2
𝜕𝐿 𝑤
𝜕𝑤2
=
𝜕𝐿 𝑤
𝜕𝑂0
∗
BACK-PROPAGATION
• It allows the information from the loss to flow backward through
the network in order to compute the gradient.
𝑥0 ℎ0 𝑂0 𝐿(𝑤)
𝑊1 𝑊2
𝜕𝐿 𝑤
𝜕𝑤2
=
𝜕𝐿 𝑤
𝜕𝑂0
∗
𝜕𝑂0
𝜕𝑤2
Chain rule
BACK-PROPAGATION
• It allows the information from the loss to flow backward through
the network in order to compute the gradient.
𝑥0 ℎ0 𝑂0 𝐿(𝑤)
𝑊1 𝑊2
𝜕𝐿 𝑤
𝜕𝑤1
=
BACK-PROPAGATION
• It allows the information from the loss to flow backward through
the network in order to compute the gradient.
𝑥0 ℎ0 𝑂0 𝐿(𝑤)
𝑊1 𝑊2
𝜕𝐿 𝑤
𝜕𝑤1
=
𝜕𝐿 𝑤
𝜕𝑂0
∗
𝜕𝑂0
𝜕ℎ0
∗
Chain rule
BACK-PROPAGATION
• It allows the information from the loss to flow backward through
the network in order to compute the gradient.
𝑥0 ℎ0 𝑂0 𝐿(𝑤)
𝑊1 𝑊2
𝜕𝐿 𝑤
𝜕𝑤1
=
𝜕𝐿 𝑤
𝜕𝑂0
∗
𝜕𝑂0
𝜕ℎ0
∗
𝜕ℎ0
𝜕𝑤1
Chain rule
BACK-PROPAGATION: SIMPLE EXAMPLE
𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧
e.g. x = -2, y = 5, z=-4
𝑞 = 𝑥 + 𝑦
𝜕𝑞
𝜕𝑥
= 1,
𝜕𝑞
𝜕𝑦
= 1
f = qz
𝜕𝑓
𝜕𝑞
= 𝑧,
𝜕𝑓
𝜕𝑧
= 𝑞
+
*
x
y
z
-2
5
-4
3
-12f
q
Want:
𝜕𝑓
𝜕𝑥
,
𝜕𝑓
𝜕𝑦
,
𝜕𝑓
𝜕𝑧
𝜕𝑓
𝜕𝑓
1
𝜕𝑓
𝜕𝑧
3
𝜕𝑓
𝜕𝑞
-4
𝜕𝑓
𝜕𝑦
-4
Chain Rule:
𝜕𝑓
𝜕𝑦
=
𝜕𝑓
𝜕𝑞
𝜕𝑞
𝜕𝑦
𝜕𝑓
𝜕𝑥
-4
Chain Rule:
𝜕𝑓
𝜕𝑥
=
𝜕𝑓
𝜕𝑞
𝜕𝑞
𝜕𝑥
ACTIVATION FUNCTIONS
𝒇
Importance of activation functions is to introduce non-linearity into the network.
ACTIVATION FUNCTIONS
For output layer:
• Sigmoid
• Softmax
• Tanh
For hidden layer:
• ReLU
• LeakyReLU
• ELU
ACTIVATION FUNCTIONS
ACTIVATION FUNCTIONS
3 problems:
1. Saturated neurons “kill” the gradients
• What happens when x= -10?
• What happens when x = 0?
• What happens when x = 10
Sigmoid
gate
𝜎 𝑥 =
1
1 + 𝑒−𝑥
𝜕𝜎
𝜕𝑥
x
𝜕𝐿
𝜕𝜎
𝜕𝐿
𝜕𝜎
=
𝜕𝜎
𝜕𝑥
𝜕𝐿
𝜕𝜎
ACTIVATION FUNCTIONS
3 problems:
1. Saturated neurons “kill” the gradients
2. Sigmoid outputs are not zero-centered
Consider what happens when the input to a neuron is always positive…
𝑓(
𝑖
𝑤𝑖 𝑥𝑖 + 𝑏 )
What can we say about the gradients on w?
Always all positive or all negative 
(this is also why you want zero-mean data!)
𝜕𝐿
𝜕𝑤𝑖
=
𝜕𝐿
𝜕𝑓
𝜕𝑓
𝜕𝑤𝑖
=
𝜕𝐿
𝜕𝑓
∗ 𝑥𝑖
Inefficient!
ACTIVATION FUNCTIONS
3 problems:
1. Saturated neurons “kill” the gradients
2. Sigmoid outputs are not zero-centered
3. Exp() is a bit compute expensive
ACTIVATION FUNCTIONS
ACTIVATION FUNCTIONS
ACTIVATION FUNCTIONS
• Not zero-centered output
• An annoyance when x < 0
People like to initialize ReLU
neurons with slightly positive biases
(e.g. 0.01)
ACTIVATION FUNCTIONS
ACTIVATION FUNCTIONS
Clevert et al., 2015
MAXOUT “NEURON”
IN PRACTICE (GOOD RULE OF THUMB)
• For hidden layers:
• Use ReLU. Be careful with your learning rates
• Try out Leaky ReLU / Maxout / ELU
• Try out tanh but don’t expect too much
• Don’t use Sigmoid
REGULARIZATION
• Regularization is “any modification we make to the
learning algorithm that is intended to reduce the
generalization error, but not its training error”.
REGULARIZATION
𝐿 𝑊 =
1
𝑁
𝑖
𝑁
𝐿𝑖 𝑓 𝑥 𝑖
; 𝑊 , 𝑦 𝑖
Data loss: model predictions
should match training data
REGULARIZATION
𝐿(𝑊) =
1
𝑁
𝑖
𝑁
𝐿𝑖 𝑓 𝑥 𝑖
; 𝑊 , 𝑦 𝑖
+ 𝜆𝑅(𝑊)
Data loss: model predictions
should match training data
Regularization: Model
Should be “simple”, so it
works on test data
Occam’s Razor:
“Among competing hypotheses,
The simplest is the best”
William of Ockham, 1285-1347
REGULARIZATION
• In common use:
• L2 regularization
• L1 regularization
• Elastic net (L1 + L2)
• Dropout
• Batch normalization
• Data Augmentation
• Early Stopping
𝑅 𝑤 = ∑𝑤𝑗
2
𝑅 𝑤 = ∑|𝑤𝑗|
𝑅 𝑤 = ∑(𝛽𝑤𝑗
2
+ wj )
Regularization is a technique designed to counter neural
network over-fitting.
𝐿(𝑊) =
1
𝑁
𝑖
𝑁
𝐿𝑖 𝑓 𝑥 𝑖 ; 𝑊 , 𝑦 𝑖 + 𝜆𝑅(𝑊)
L2 REGULARIZATION
• penalizes the square value of the weight (which explains also the “2”
from the name).
• tends to drive all the weights to smaller values.
𝐿(𝑊) =
1
𝑁
𝑖
𝑁
𝐿𝑖 𝑓 𝑥 𝑖
; 𝑊 , 𝑦 𝑖
+ 𝜆∑𝑤𝑗
2
No regularization
L2 regularization
Weights distribution
L1 REGULARIZATION
• penalizes the absolute value of the weight (v- shape function)
• tends to drive some weights to exactly zero (introducing sparsity in the
model), while allowing some weights to be big
𝐿(𝑊) =
1
𝑁
𝑖
𝑁
𝐿𝑖 𝑓 𝑥 𝑖
; 𝑊 , 𝑦 𝑖
+ 𝜆|𝑤𝑗|
No regularization
L1 regularization
Weights distribution
DROPOUT
In each forward pass, randomly set
some neurons to zero. Probability of
dropping is a hyperparameter; 0.5 is
common.
You can imagine that if neurons are
randomly dropped out of the network
during training, that other neurons will
have to step in and handle the
representation required to make
predictions for the missing neurons.
This is believed to result in multiple
independent internal representations
being learned by the network.
DROPOUT
Another interpretation:
• Dropout is training a large ensemble of models (that
share parameters)
• Each binary mask is one model
An fully connected layer with 4096 units has
24096
~101233
possible masks!
Only ~1082
atoms in the universe…
DENSE-SPARSE-DENSE TRAINING
https://arxiv.org/pdf/1607.04381v1.pdf
BATCH NORMALIZATION
“you want unit Gaussian activations? Just make them so.”
BATCH NORMALIZATION
Usually inserted after fully
connected or convolutional layers,
and before nonlinearity.
• Improves gradient flow through the network
• Allows higher learning rates
• Reduces the strong dependence on initialization
• Acts as a form of regularization in a funny way,
and slightly reduces the need for dropout, maybe
Note: at test time BatchNorm layer
functions differently:
The mean/std are not computed
based on the batch. Instead, a
single fixed empirical mean of
activations during training is used.
(e.g. can be estimated during
training with running averages)
DATA AUGMENTATION
The best way to make a machine learning model generalize better is to train it on more data.
DATA AUGMENTATION
The best way to make a machine learning model generalize better is to train it on more data.
DATA AUGMENTATION
Horizontal flips
Random crops and scales
Color Jitter
• Simple: Randomize
contrast and brightness
Get creative for your problem!
 Translation
 Rotation
 Stretching
 Shearing
 Lens distortions
 (go crazy)
EARLY STOPPING
It is probably the most commonly used form of
regularization in deep learning to prevent overfitting:
• Effective
• Simple
Think of this as a hyperparameter selection
algorithm. The number of training steps is another
hyperparameter.
REFERENCE
• Deep Learning book ------ http://www.deeplearningbook.org/
• Stanford CNN course ----- http://cs231n.stanford.edu/index.html
• Regularization in deep learning ----- https://chatbotslife.com/regularization-in-deep-learning-f649a45d6e0
• So much more to learn, go explore!
• THANK YOU

More Related Content

What's hot

Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Simplilearn
 

What's hot (20)

Feed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descentFeed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descent
 
Back propagation
Back propagationBack propagation
Back propagation
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Perceptron & Neural Networks
Perceptron & Neural NetworksPerceptron & Neural Networks
Perceptron & Neural Networks
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Autoencoders in Deep Learning
Autoencoders in Deep LearningAutoencoders in Deep Learning
Autoencoders in Deep Learning
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptron
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer Perceptron
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
 
Autoencoder
AutoencoderAutoencoder
Autoencoder
 
Perceptron
PerceptronPerceptron
Perceptron
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 
Cnn
CnnCnn
Cnn
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
 

Similar to Deep Feed Forward Neural Networks and Regularization

Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Simplilearn
 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyay
abhishek upadhyay
 

Similar to Deep Feed Forward Neural Networks and Regularization (20)

Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagation
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Introduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep LearningIntroduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep Learning
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
 
Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learning
 
Lecture 5: Neural Networks II
Lecture 5: Neural Networks IILecture 5: Neural Networks II
Lecture 5: Neural Networks II
 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyay
 
Lec 6-bp
Lec 6-bpLec 6-bp
Lec 6-bp
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
 
Deep learning from scratch
Deep learning from scratch Deep learning from scratch
Deep learning from scratch
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
 
Artificial neural networks
Artificial neural networksArtificial neural networks
Artificial neural networks
 
Eye deep
Eye deepEye deep
Eye deep
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learning
 

More from Yan Xu

More from Yan Xu (20)

Kaggle winning solutions: Retail Sales Forecasting
Kaggle winning solutions: Retail Sales ForecastingKaggle winning solutions: Retail Sales Forecasting
Kaggle winning solutions: Retail Sales Forecasting
 
Basics of Dynamic programming
Basics of Dynamic programming Basics of Dynamic programming
Basics of Dynamic programming
 
Walking through Tensorflow 2.0
Walking through Tensorflow 2.0Walking through Tensorflow 2.0
Walking through Tensorflow 2.0
 
Practical contextual bandits for business
Practical contextual bandits for businessPractical contextual bandits for business
Practical contextual bandits for business
 
Introduction to Multi-armed Bandits
Introduction to Multi-armed BanditsIntroduction to Multi-armed Bandits
Introduction to Multi-armed Bandits
 
A Data-Driven Question Generation Model for Educational Content - by Jack Wang
A Data-Driven Question Generation Model for Educational Content - by Jack WangA Data-Driven Question Generation Model for Educational Content - by Jack Wang
A Data-Driven Question Generation Model for Educational Content - by Jack Wang
 
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
 
Deep Hierarchical Profiling & Pattern Discovery: Application to Whole Brain R...
Deep Hierarchical Profiling & Pattern Discovery: Application to Whole Brain R...Deep Hierarchical Profiling & Pattern Discovery: Application to Whole Brain R...
Deep Hierarchical Profiling & Pattern Discovery: Application to Whole Brain R...
 
Detecting anomalies on rotating equipment using Deep Stacked Autoencoders - b...
Detecting anomalies on rotating equipment using Deep Stacked Autoencoders - b...Detecting anomalies on rotating equipment using Deep Stacked Autoencoders - b...
Detecting anomalies on rotating equipment using Deep Stacked Autoencoders - b...
 
Introduction to Autoencoders
Introduction to AutoencodersIntroduction to Autoencoders
Introduction to Autoencoders
 
State of enterprise data science
State of enterprise data scienceState of enterprise data science
State of enterprise data science
 
Long Short Term Memory
Long Short Term MemoryLong Short Term Memory
Long Short Term Memory
 
Linear algebra and probability (Deep Learning chapter 2&3)
Linear algebra and probability (Deep Learning chapter 2&3)Linear algebra and probability (Deep Learning chapter 2&3)
Linear algebra and probability (Deep Learning chapter 2&3)
 
HML: Historical View and Trends of Deep Learning
HML: Historical View and Trends of Deep LearningHML: Historical View and Trends of Deep Learning
HML: Historical View and Trends of Deep Learning
 
Secrets behind AlphaGo
Secrets behind AlphaGoSecrets behind AlphaGo
Secrets behind AlphaGo
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network
 
Introduction to Neural Network
Introduction to Neural NetworkIntroduction to Neural Network
Introduction to Neural Network
 
Nonlinear dimension reduction
Nonlinear dimension reductionNonlinear dimension reduction
Nonlinear dimension reduction
 
Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering
 

Recently uploaded

Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
RohitNehra6
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
Lokesh Kothari
 

Recently uploaded (20)

GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Creating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsCreating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening Designs
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 

Deep Feed Forward Neural Networks and Regularization

  • 1. DEEP FEEDFORWARD NETWORKS AND REGULARIZATION LICHENG ZHANG
  • 2. OVERVIEW • Regularization • L2/L1/elastic • Dropout • Batch normalization • Data augmentation • Early stopping • Neural network • Perceptron • Activation functions • Back-propagation
  • 3. FEEDFORWARD NETWORK “3-layer neural net” or “2-hidden-layers neural net”
  • 6. PERCEPTRON FORWARD PASS 2 3 -1 1 Inputs weights sum activation function ∑ f bias 0.1 0.5 2.5 3.0 output (2*0.1)+ (3*0.5)+ (-1*2.5)+ (1*3.0) Output = f( )
  • 7. PERCEPTRON FORWARD PASS 2 3 -1 1 Inputs weights sum activation function ∑ f bias 0.1 0.5 2.5 3.0 output Output = f(2.2) =𝜎(2.2) = 1 1+𝑒−2.2 = 0.90
  • 10. MULTI-LAYER PERCEPTRON (MLP) 𝑥0 ℎ2 𝑥1 𝑥2 Input layer Hidden layer ℎ1 𝑜1 Output layer 𝑜0 ℎ0 ℎ3
  • 11. DEEP NEURAL NETWORK 𝑥0 ℎ2 𝑥1 𝑥2 Input layer Hidden layer ℎ1 𝑜1 Output layer 𝑜0 ℎ0 ℎ3 ℎ2 ℎ1 ℎ0 ℎ3 ……
  • 13. UNIVERSAL APPROXIMATION THEOREM “A feedforward network with a linear output layer and at least one hidden layer with any ‘squashing’ activation function (such as the logistic sigmoid) can approximate any Borel measurable function from one finite-dimensional space to another with any desired nonzero amount of error, provided that the network is given enough hidden units.” • ----- Hornik et al., Cybenko, 1989
  • 14. COMPUTATIONAL GRAPHS Z=x*y 𝑦 = 𝜎(𝑤𝑥 + 𝑏) H = 𝑟𝑒𝑙𝑢(𝑊𝑋 + 𝑏) = max(0, 𝑊𝑋 + 𝑏) 𝑦 = 𝑤𝑥 𝑢(3) = 𝜆∑ 𝑤
  • 15. LOSS FUNCTION • A loss function (cost function) tells us how good our current model is, or how far away our model to the real answer. 𝐿(𝑤) = 1 𝑁 𝑖 𝑁 𝑙𝑜𝑠𝑠 (𝑓(𝑥 𝑖 ; 𝑤), 𝑦 𝑖 ) • Hinge loss • Softmax loss • Mean Squared Error (L2 loss)  Regression 𝐿(𝑤) = 1 𝑁 ∑𝑖 𝑁 𝑓 𝑥 𝑖 ; 𝑤 − 𝑦 𝑖 2 • Cross entropy Loss  Classification 𝐿 𝑤 = 1 𝑁 ∑𝑖 𝑁 [ 𝑦 𝑖 𝑙𝑜𝑔 𝑓 𝑥 𝑖 ; 𝑤 + 1 − 𝑦 𝑖 log 1 − 𝑓 𝑥 𝑖 ; 𝑤 ] • … N = # examples predicted actual
  • 16. GRADIENT DESCENT • Designing and training a neural network is not much different from training any other machine learning model with gradient descent: use Calculus to get derivatives of the loss function respect to each parameter. 𝑤𝑗 = 𝑤𝑗 − α 𝜕𝐿(𝑤) 𝜕𝑤𝑗 𝛼 is learning rate https://developers.google.com/machine-learning/crash-course/fitter/graph
  • 17. GRADIENT DESCENT • In practice, instead of using all data points, we do • Stochastic gradient descent (using 1 sample at each iteration) • Mini-Batch gradient descent (using n samples at each iteration) Problems with SGD: • If loss changes quickly in one direction and slowly in another  jitter along steep direction • If loss function has a local minima or saddle point  zero gradient, SGD gets stuck Solutions: • SGD + momentum, etc
  • 18. BACK-PROPAGATION • It allows the information from the loss to flow backward through the network in order to compute the gradient. 𝑥0 ℎ0 𝑂0 𝐿(𝑤) 𝑊1 𝑊2 𝜕𝐿 𝑤 𝜕𝑤2 =
  • 19. BACK-PROPAGATION • It allows the information from the loss to flow backward through the network in order to compute the gradient. 𝑥0 ℎ0 𝑂0 𝐿(𝑤) 𝑊1 𝑊2 𝜕𝐿 𝑤 𝜕𝑤2 = 𝜕𝐿 𝑤 𝜕𝑂0 ∗
  • 20. BACK-PROPAGATION • It allows the information from the loss to flow backward through the network in order to compute the gradient. 𝑥0 ℎ0 𝑂0 𝐿(𝑤) 𝑊1 𝑊2 𝜕𝐿 𝑤 𝜕𝑤2 = 𝜕𝐿 𝑤 𝜕𝑂0 ∗ 𝜕𝑂0 𝜕𝑤2 Chain rule
  • 21. BACK-PROPAGATION • It allows the information from the loss to flow backward through the network in order to compute the gradient. 𝑥0 ℎ0 𝑂0 𝐿(𝑤) 𝑊1 𝑊2 𝜕𝐿 𝑤 𝜕𝑤1 =
  • 22. BACK-PROPAGATION • It allows the information from the loss to flow backward through the network in order to compute the gradient. 𝑥0 ℎ0 𝑂0 𝐿(𝑤) 𝑊1 𝑊2 𝜕𝐿 𝑤 𝜕𝑤1 = 𝜕𝐿 𝑤 𝜕𝑂0 ∗ 𝜕𝑂0 𝜕ℎ0 ∗ Chain rule
  • 23. BACK-PROPAGATION • It allows the information from the loss to flow backward through the network in order to compute the gradient. 𝑥0 ℎ0 𝑂0 𝐿(𝑤) 𝑊1 𝑊2 𝜕𝐿 𝑤 𝜕𝑤1 = 𝜕𝐿 𝑤 𝜕𝑂0 ∗ 𝜕𝑂0 𝜕ℎ0 ∗ 𝜕ℎ0 𝜕𝑤1 Chain rule
  • 24. BACK-PROPAGATION: SIMPLE EXAMPLE 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧 e.g. x = -2, y = 5, z=-4 𝑞 = 𝑥 + 𝑦 𝜕𝑞 𝜕𝑥 = 1, 𝜕𝑞 𝜕𝑦 = 1 f = qz 𝜕𝑓 𝜕𝑞 = 𝑧, 𝜕𝑓 𝜕𝑧 = 𝑞 + * x y z -2 5 -4 3 -12f q Want: 𝜕𝑓 𝜕𝑥 , 𝜕𝑓 𝜕𝑦 , 𝜕𝑓 𝜕𝑧 𝜕𝑓 𝜕𝑓 1 𝜕𝑓 𝜕𝑧 3 𝜕𝑓 𝜕𝑞 -4 𝜕𝑓 𝜕𝑦 -4 Chain Rule: 𝜕𝑓 𝜕𝑦 = 𝜕𝑓 𝜕𝑞 𝜕𝑞 𝜕𝑦 𝜕𝑓 𝜕𝑥 -4 Chain Rule: 𝜕𝑓 𝜕𝑥 = 𝜕𝑓 𝜕𝑞 𝜕𝑞 𝜕𝑥
  • 25. ACTIVATION FUNCTIONS 𝒇 Importance of activation functions is to introduce non-linearity into the network.
  • 26. ACTIVATION FUNCTIONS For output layer: • Sigmoid • Softmax • Tanh For hidden layer: • ReLU • LeakyReLU • ELU
  • 28. ACTIVATION FUNCTIONS 3 problems: 1. Saturated neurons “kill” the gradients
  • 29. • What happens when x= -10? • What happens when x = 0? • What happens when x = 10 Sigmoid gate 𝜎 𝑥 = 1 1 + 𝑒−𝑥 𝜕𝜎 𝜕𝑥 x 𝜕𝐿 𝜕𝜎 𝜕𝐿 𝜕𝜎 = 𝜕𝜎 𝜕𝑥 𝜕𝐿 𝜕𝜎
  • 30. ACTIVATION FUNCTIONS 3 problems: 1. Saturated neurons “kill” the gradients 2. Sigmoid outputs are not zero-centered
  • 31. Consider what happens when the input to a neuron is always positive… 𝑓( 𝑖 𝑤𝑖 𝑥𝑖 + 𝑏 ) What can we say about the gradients on w? Always all positive or all negative  (this is also why you want zero-mean data!) 𝜕𝐿 𝜕𝑤𝑖 = 𝜕𝐿 𝜕𝑓 𝜕𝑓 𝜕𝑤𝑖 = 𝜕𝐿 𝜕𝑓 ∗ 𝑥𝑖 Inefficient!
  • 32. ACTIVATION FUNCTIONS 3 problems: 1. Saturated neurons “kill” the gradients 2. Sigmoid outputs are not zero-centered 3. Exp() is a bit compute expensive
  • 35. ACTIVATION FUNCTIONS • Not zero-centered output • An annoyance when x < 0 People like to initialize ReLU neurons with slightly positive biases (e.g. 0.01)
  • 39. IN PRACTICE (GOOD RULE OF THUMB) • For hidden layers: • Use ReLU. Be careful with your learning rates • Try out Leaky ReLU / Maxout / ELU • Try out tanh but don’t expect too much • Don’t use Sigmoid
  • 40. REGULARIZATION • Regularization is “any modification we make to the learning algorithm that is intended to reduce the generalization error, but not its training error”.
  • 41. REGULARIZATION 𝐿 𝑊 = 1 𝑁 𝑖 𝑁 𝐿𝑖 𝑓 𝑥 𝑖 ; 𝑊 , 𝑦 𝑖 Data loss: model predictions should match training data
  • 42. REGULARIZATION 𝐿(𝑊) = 1 𝑁 𝑖 𝑁 𝐿𝑖 𝑓 𝑥 𝑖 ; 𝑊 , 𝑦 𝑖 + 𝜆𝑅(𝑊) Data loss: model predictions should match training data Regularization: Model Should be “simple”, so it works on test data Occam’s Razor: “Among competing hypotheses, The simplest is the best” William of Ockham, 1285-1347
  • 43. REGULARIZATION • In common use: • L2 regularization • L1 regularization • Elastic net (L1 + L2) • Dropout • Batch normalization • Data Augmentation • Early Stopping 𝑅 𝑤 = ∑𝑤𝑗 2 𝑅 𝑤 = ∑|𝑤𝑗| 𝑅 𝑤 = ∑(𝛽𝑤𝑗 2 + wj ) Regularization is a technique designed to counter neural network over-fitting. 𝐿(𝑊) = 1 𝑁 𝑖 𝑁 𝐿𝑖 𝑓 𝑥 𝑖 ; 𝑊 , 𝑦 𝑖 + 𝜆𝑅(𝑊)
  • 44. L2 REGULARIZATION • penalizes the square value of the weight (which explains also the “2” from the name). • tends to drive all the weights to smaller values. 𝐿(𝑊) = 1 𝑁 𝑖 𝑁 𝐿𝑖 𝑓 𝑥 𝑖 ; 𝑊 , 𝑦 𝑖 + 𝜆∑𝑤𝑗 2 No regularization L2 regularization Weights distribution
  • 45. L1 REGULARIZATION • penalizes the absolute value of the weight (v- shape function) • tends to drive some weights to exactly zero (introducing sparsity in the model), while allowing some weights to be big 𝐿(𝑊) = 1 𝑁 𝑖 𝑁 𝐿𝑖 𝑓 𝑥 𝑖 ; 𝑊 , 𝑦 𝑖 + 𝜆|𝑤𝑗| No regularization L1 regularization Weights distribution
  • 46. DROPOUT In each forward pass, randomly set some neurons to zero. Probability of dropping is a hyperparameter; 0.5 is common. You can imagine that if neurons are randomly dropped out of the network during training, that other neurons will have to step in and handle the representation required to make predictions for the missing neurons. This is believed to result in multiple independent internal representations being learned by the network.
  • 47. DROPOUT Another interpretation: • Dropout is training a large ensemble of models (that share parameters) • Each binary mask is one model An fully connected layer with 4096 units has 24096 ~101233 possible masks! Only ~1082 atoms in the universe…
  • 49. BATCH NORMALIZATION “you want unit Gaussian activations? Just make them so.”
  • 50. BATCH NORMALIZATION Usually inserted after fully connected or convolutional layers, and before nonlinearity. • Improves gradient flow through the network • Allows higher learning rates • Reduces the strong dependence on initialization • Acts as a form of regularization in a funny way, and slightly reduces the need for dropout, maybe Note: at test time BatchNorm layer functions differently: The mean/std are not computed based on the batch. Instead, a single fixed empirical mean of activations during training is used. (e.g. can be estimated during training with running averages)
  • 51. DATA AUGMENTATION The best way to make a machine learning model generalize better is to train it on more data.
  • 52. DATA AUGMENTATION The best way to make a machine learning model generalize better is to train it on more data.
  • 53. DATA AUGMENTATION Horizontal flips Random crops and scales Color Jitter • Simple: Randomize contrast and brightness Get creative for your problem!  Translation  Rotation  Stretching  Shearing  Lens distortions  (go crazy)
  • 54. EARLY STOPPING It is probably the most commonly used form of regularization in deep learning to prevent overfitting: • Effective • Simple Think of this as a hyperparameter selection algorithm. The number of training steps is another hyperparameter.
  • 55. REFERENCE • Deep Learning book ------ http://www.deeplearningbook.org/ • Stanford CNN course ----- http://cs231n.stanford.edu/index.html • Regularization in deep learning ----- https://chatbotslife.com/regularization-in-deep-learning-f649a45d6e0 • So much more to learn, go explore!