Deep learning tools and techniques can be used to build convolutional neural networks (CNNs). Neural networks learn from observational training data by automatically inferring rules to solve problems. Neural networks use multiple hidden layers of artificial neurons to process input data and produce output. Techniques like backpropagation, cross-entropy cost functions, softmax activations, and regularization help neural networks learn more effectively and avoid issues like overfitting.
2. Outline
Part IV: Tools and Technology to build CNN
Part III: Deep Learning
Part II: Improving the way neural networks learn
Part I: Introduction of Neural Network
4. Basic Approach
• Breaking big problem into many small task that computer can easily
perform
• In a neural network we don't tell the computer how to solve our
problem
• Instead, it learns from observational/training data, figuring out its
own solution (automatically infer rules) to the problem
5. Handwriting Digit Recognition(Propotype Problem)
Input Output
16 x 16 = 256
1x
2x
256x……
Ink → 1
No ink → 0
……
y1
y2
y10
Each dimension represents
the confidence of a digit.
is 1
is 2
is 0
……
0.1
0.7
0.2
The image
is “2”
7. Artificial Neuron -- Perceptron
X1,x2,x3 are binary inputs
Produce binary output
Introduce weights on each inputs
Perceptron makes your decision by weighing up different factors/evidences
Here b = -threshold
b is called Bias
8. Learning Algorithm
• Automatically tune the weights and biases of a network
• Property: Small change in some weight (or bias) to cause only a small
corresponding change in the output
Small change in the weights or bias of any
single perceptron in the network can
sometimes cause the output of that
perceptron to completely flip, say from 0 to 1
This may classify one digit correctly but
completely wrong to classify other digits
9. Artificial Neuron -- Sigmoid
• Instead of input 0 or 1 it can take any value between 0 and 1
• Output can be between 0 and 1
• Sigmoid is the smoothed out perceptron
• Small change in weight and bias makes
small change in output – Achieved
• Function shape matters here so later we can think about other activation
function
10. Some Intuitive Explanation of NN
• Say input to neural network is
• Decide whether or not the digit is a 0 by weighing up evidence from
the hidden layer of neurons
• first neuron in the hidden layer detects
• Second neuron in the hidden layer detects
• Third neuron in the hidden layer detects
• Fourth neuron in the hidden layer detects
This is just a heuristically way to think about good neural network architecture
Classify
12. Learning with Gradient Descent
𝑤1
𝑤2
Assume there are only two
parameters v1 and v2 in a
network.
The colors represent the value of C.
Randomly pick a starting point 𝜃0
Compute the gradient at 𝜃0
𝛻𝐶 𝜃0
𝜃0
𝛻𝐶 𝜃0
Amount of change in parameter
−𝜂𝛻𝐶 𝜃0
−𝜂𝛻𝐶 𝜃0
𝜃 = 𝑣1, 𝑣2
Cost Function Surface
𝜃∗
According to calculus
small of C for small change
in direction of v1 and v2
Parameter learning steps
13. Learning with Gradient Descent
𝑤1
𝑤2
𝜃0
𝜃1 𝛻𝐶 𝜃1
−𝜂𝛻𝐶 𝜃1
𝛻𝐶 𝜃2
−𝜂𝛻𝐶 𝜃2
𝜃2
Eventually, we would
reach a minima …..
Randomly pick a starting point 𝜃0
𝛻𝐶 𝜃0
−𝜂𝛻𝐶 𝜃0
Parameter learning steps
Compute the gradient at 𝜃0
Amount of change in parameter
Assume there are only two
parameters v1 and v2 in a
network. 𝜃 = 𝑣1, 𝑣2
According to calculus small
change of C due to small
change in direction of v1 and v2
Final formula for parameter
optimization
14. List of Further Improvements
• As mentioned earlier there are other type of cost functions
• Researcher came up with different forms of Gradient descent and
tried to introduce concept from physical world (i.e introducing
momentum)
• Many advancement happening on learning rate itself
• Came up with different techniques to initialize starting value of
parameters in Gradient descent
• Lot of improvements happened on neuron Activation function itself
15. Stochastic Gradient Descent
Where
High time
complexity when
huge sample size
Work around
Estimate the gradient ∇C by computing ∇Cx for a small sample
of randomly chosen training inputs
Mini-Batch
16. Mini-batch
x1
NN
……
y1
𝑦1
𝐶1
x31 NN y31
𝑦31
𝐶31
x2
NN
……
y2
𝑦2
𝐶2
x16 NN y16
𝑦16
𝐶16
Pick the 1st batch
Randomly initialize 𝜃0
𝜃1 ← 𝜃0 − 𝜂𝛻𝐶 𝜃0
Pick the 2nd batch
𝜃2 ← 𝜃1 − 𝜂𝛻𝐶 𝜃1
Until all mini-batches
have been picked
…
one epoch
Mini-batchMini-batch
Repeat the above process
𝐶 = 𝐶1 + 𝐶31 + ⋯
𝐶 = 𝐶2
+ 𝐶16
+ ⋯
17. Backpropagation
Goal: To compute the partial derivatives ∂C/∂w and ∂C/∂b of the cost function C with
respect to any weight w or bias b in whole network
Required Assumption on Cost function:
cost function should be average of individual cost for each input.
Benefit of this assumption:
Total gradient are calculated from all inputs but NN can be trained by one input at
a time.
Notations:
Weight for the connection from the kth neuron in the (l−1) th layer to the jth neuron in the lth layer
Bias of the jth neuron in the lth layer
Activation of the jth neuron in the lth layer
The weighted input to the activation function for neuron j in layer l
18. Backpropagation cont..
Fundamental equations behind backpropagation:
• Equation for the error in the output layer:
Element wise
product
• Equation for the error δl in terms of the error in the next layer, δl+1
Moving the error
backward through
the network
• Equation for the rate of change of the cost with respect to any bias in the network
• Equation for the rate of change of the cost with respect to any weight in the network
25. Backpropagation cont..
Exercise
1. Write a Python code to implement backpropagation algorithm as
mentioned in slide#24. You can download and use any suitable dataset for
parameter training.
2. Modify your code to remove loop mentioned in step# 2. Can we replace
this loop doing single matrix operation?
27. Learning Slow Down Problem
Toy Example
Train this network to get output 0 taking input 1 where cost
function is Quadratic and Sigmoid activation function
Using chain rule and differentiating with respect to the
weight and bias
We can see from this graph that when the neuron's
output is close to 1 or 0, the curve gets very flat, and so
σ′(z) gets very small. ∂C/∂w and ∂C/∂b get very small
Quadratic cost function has learning slowness issue when network output
approaches to 0 or 1
28. Cross-Entropy Cost Function
Cross-entropy functional
form for this toy example:
Toy example
Now we can show that
∂C/∂wj and ∂C/∂b does not have σ′(z) term.
The larger the error, the faster the neuron will learn
No more slow down in learning when σ(z) close to 0 or 1
cross-entropy is nearly always the better choice,
provided the output neurons are sigmoid neurons
Exercise
for you
Generalized Cost Function
29. Exercise
Show that slowness problem can be resolved if we use linear
neurons in the output layer even if we use quadratic cost
function and sigmoid activation in all internal neurons.
30. Softmax
• Softmax layer as the output layer
1z
2z
3z
Softmax Layer
e
e
e
1z
e
2z
e
3z
e
3
1
1
1
j
zz j
eey
3
1j
z j
e
3
-3
1 2.7
20
0.05
0.88
0.12
≈0
Probability:
1 > 𝑦𝑖 > 0
𝑖 𝑦𝑖 = 1
3
1
2
2
j
zz j
eey
3
1
3
3
j
zz j
eey
31. Softmax with Log-likelihood Cost Function
Solve Learning Slow Down
Log-likelihood cost: Where, is output of Softmax function
When output probability -> 1 (network is doing good job) then cost will be small
When output probability -> 0 (network isn’t doing good job) then cost will be large
Key to the learning slowdown is the behaviour of the quantities ∂C/∂wL
jk and ∂C/∂bj
L
Exactly similar form with
Cross-entropy with Sigmoid
output layer
Softmax with log-likelihood cost function behave similar to cross-entropy with sigmoid output layer
33. Overfitting Problem
• Training data accuracy increases as we increase number of epochs on fix network
architecture and fixed training dataset but Test data accuracy reaches in
saturation after some time
• Complex network with many parameters but not adequate training examples
General Strategy to overcome overfitting
Increase training data size
Reduce network size
Use validation set to determine best hyper parameter settings
Observe Overfitting issue
34. Regularization
Regularization can reduce overfitting, even when we have a fixed network and fixed training
data. It helps to resist to learn errors in the training data and only learn common patterns
L2 regularization
L1 regularization
Dropout
• Modify cost function and force network weight
parameters not to increase too much due to some
peculiarities in the training data
• Add only weight rescaling factor in the learning rule
Dropout doesn't rely on modifying the cost function. Instead, it modifies the
network itself.
Artificially increasing the training set size Introduce small amount of distortion in training data to increase total
size. Like small rotation of image, include background noise in the
speech data
35. Dropout
Training:
𝜃 𝑡 ← 𝜃 𝑡−1 − 𝜂𝛻𝐶 𝜃 𝑡−1
Each time before computing the gradients
Each neuron has p% to dropout
Pick a mini-batch
36. Dropout
Training:
Each time before computing the gradients
Each neuron has p% to dropout
Using the new network for training
The structure of the network is changed.
Thinner!
For each mini-batch, we resample the dropout neurons
𝜃 𝑡 ← 𝜃 𝑡−1 − 𝜂𝛻𝐶 𝜃 𝑡−1
Pick a mini-batch
37. Dropout
Testing:
No dropout
If the dropout rate at training is p%,
all the weights times (1-p)%
Assume that the dropout rate is 50%.
If a weight w = 1 by training, set 𝑤 = 0.5 for testing.
38. Dropout - Intuitive Reason
When teams up, if everyone expect the partner will do
the work, nothing will be done finally.
However, if you know your partner will dropout, you
will do better.
My partner will
put bad , so I
was going to do
When testing, no one dropout actually, so obtaining
good results eventually.
39. Dropout - Intuitive Reason
• Why the weights should multiply (1-p)% (dropout rate) when testing?
Training of Dropout Testing of Dropout
𝑤1
𝑤2
𝑤3
𝑤4
𝑧
𝑤1
𝑤2
𝑤3
𝑤4
𝑧′
Assume dropout rate is 50%
0.5 ×
0.5 ×
0.5 ×
0.5 ×
No dropout
Weights from training
𝑧′ ≈ 2𝑧
𝑧′ ≈ 𝑧
Weights multiply (1-p)%
40. Dropout is a kind of ensemble.
Ensemble
Network
1
Network
2
Network
3
Network
4
Train a bunch of networks with different structures
Training
Set
Set 1 Set 2 Set 3 Set 4
41. Dropout is a kind of ensemble.
Ensemble
y1
Network
1
Network
2
Network
3
Network
4
Testing data x
y2 y3 y4
average
42. Dropout is a kind of ensemble.
Training of
Dropout
minibatch
1
……
Using one mini-batch to train one network
Some parameters in the network are shared
minibatch
2
minibatch
3
minibatch
4
M neurons
2M possible
networks
43. Dropout is a kind of ensemble.
testing data x
Testing of Dropout
……average
y1 y2 y3
All the
weights
multiply
(1-p)%
≈ y
44. Better Way Weight Initialization
~ N(0,sqar(sum of non zero input neurons))
When large number of
non zero input neurons
Output σ(z) from the
hidden neuron will be
very close to either 1
or 0
Saturate hidden neuron. Training will be slowed down
Clever choice of cost function helps with saturated output
neuron, it does nothing at all for the problem with saturated
hidden neurons
Gaussian weight initialization (mean 0 , stdev 1)
45. Better Way Weight Initialization cont..
We need a better technique to bring down standard deviation of Z
New kind of weight initialization
Gaussian random variables with mean 0 and standard deviation
Very less standard deviation of Z
Hidden neurons are not saturated
48. Why Deep Learning?
• Deep network increase accuracy
• It breaks down complex question into very simple questions. It does this through
series of many layers
• It modularizes classification task
49. Why Deep Network Hard to Train?
Vanishing gradient problem
Deep Network- Toy example
Weight initialized by Gaussian with
mean 0 and standard deviation 1
1.
2.
3.
16 times smaller
Neurons in the
earlier layers learn
much more slowly
than neurons in later
layers
1.
2.
3.
50. Convolutional Neural Network (CNN)
Three basic ideas
local receptive fields
shared weights
Pooling
It won't connect every input pixel to every hidden neuron. Instead, it only
makes connections in small, localized regions of the input image
It uses the same weight and bias for each of the hidden neurons in a
particular hidden layer
Simplify the information in the output from the convolutional layer
51. Local Receptive Fields
• Hidden neuron connects to small, localized region of the input neurons
local receptive
field
length of shift of Local receptive field window to create hidden neuronsStride length:
52. Shared Weights and Biases
• Share same set of weights and bias for each local receptive fields windows
Activation value of j,k th hidden neuron
Where, local receptive field
window size 5 X 5
• All the neurons in the one hidden layer detect exactly the same feature, just at
different location of input data
• Shared weights and bias are often said to define a kernel or filter
• Map from the input layer to the hidden layer is call a feature map
• Multiple feature maps forms the convolutional layer
53. Pooling Layer
• Pooling layers are usually used immediately after convolutional layers
• Pooling layer takes each feature map output from the convolutional layer and
prepares a condensed feature map
• One common procedure for pooling is known as max-pooling. simply outputs the
maximum activation in given input region
• This helps reduce the number of parameters needed in later layers
55. Tips for Training CNN
• Use Rectified linear units (ReLU) instead of Sigmoid activation function (handle
vanishing gradient problem)
• Expand training data introducing some distortion, rotation, shift, background noise etc.
• Try with introducing extra convolutional-pooling layers
• Inserting an extra fully-connected layer
• Use dropout regularization to the fully-connected layers
57. Open Source Libraries
• Machine learning library Theano
-- it has implementation of backpropagation of CNN, dropout like all useful
components to build CNN
-- it can run code on either a CPU or, if available, a NVIDIA GPU
• CAFFE
• Deeplearning4J
• Torch
58. Exercise
Install Caffe after installing NVDIA driver and Cuda platform in your machine. Then run
Alexnet CNN model in GPU mode.
“Hello world” for deep learning
Data: http://yann.lecun.com/exdb/mnist/
The same for even more complex tasks.
Fully Connected Feedforward Network
You can always connect the neurons in your own way.
“+” is ignored
Each dimension corresponds to a digit (10 dimension is needed)
“Hello world” for deep learning
Data: http://yann.lecun.com/exdb/mnist/
“Hello world” for deep learning
Data: http://yann.lecun.com/exdb/mnist/
“Hello world” for deep learning
Data: http://yann.lecun.com/exdb/mnist/
“Hello world” for deep learning
Data: http://yann.lecun.com/exdb/mnist/
With softmax, the summation of all the ouputs would be one.
Can be considered as probability if you want ……
Eta
人站在 theta0 環顧四周 看看哪裡最低,那個方向就是 gradient
“Hello world” for deep learning
Data: http://yann.lecun.com/exdb/mnist/
“Hello world” for deep learning
Data: http://yann.lecun.com/exdb/mnist/
Shuffle data, and repeat above process
“Hello world” for deep learning
Data: http://yann.lecun.com/exdb/mnist/
“Hello world” for deep learning
Data: http://yann.lecun.com/exdb/mnist/
“Hello world” for deep learning
Data: http://yann.lecun.com/exdb/mnist/
“Hello world” for deep learning
Data: http://yann.lecun.com/exdb/mnist/
“Hello world” for deep learning
Data: http://yann.lecun.com/exdb/mnist/
“Hello world” for deep learning
Data: http://yann.lecun.com/exdb/mnist/
“Hello world” for deep learning
Data: http://yann.lecun.com/exdb/mnist/
“Hello world” for deep learning
Data: http://yann.lecun.com/exdb/mnist/
“Hello world” for deep learning
Data: http://yann.lecun.com/exdb/mnist/
Three questions:
Model
Cost
Train
“Hello world” for deep learning
Data: http://yann.lecun.com/exdb/mnist/
“Hello world” for deep learning
Data: http://yann.lecun.com/exdb/mnist/
“Hello world” for deep learning
Data: http://yann.lecun.com/exdb/mnist/
Why it is name soft max?
Monotonicity of softmax
Non-locality of softmax
“Hello world” for deep learning
Data: http://yann.lecun.com/exdb/mnist/
“Hello world” for deep learning
Data: http://yann.lecun.com/exdb/mnist/
“Hello world” for deep learning
Data: http://yann.lecun.com/exdb/mnist/
“Hello world” for deep learning
Data: http://yann.lecun.com/exdb/mnist/
Iteration v.s. epoch!!!!!!!!
Do not worry that someone will not update
Bias do not have to multiply !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Reasonable !!!!!!!!!!!!!!!!!!!!
Why the weights should multiply p (dropout rate) at testing?