Opening of our Deep Learning Lunch & Learn series. First session: introduction to Neural Networks, Gradient descent and backpropagation, by Pablo J. Villacorta, with a prologue by Fernando Velasco
5. Índice Analítico Si menos de dos horas en
comer tú tardas, apresurado
eres
Maestro Yoda. Leal y muy noble
6. Pablo J. Villacorta Iglesias
pvillacorta@stratio.com
July 2017
Deep Learning Course
Session 1: Introduction to
Artificial Neural Networks
6
7. Contents
Artificial Neural Networks: concept and motivation
Gradient descent in Logistic regression
The backpropagation algorithm
1
2
3
References and further reading4
7
8. Review: learning a model from data
Features Target (only in supervised
learning)
8
x1
5.1
x1
= x2
= 3.5
x3
1.4
x4
0.2
10. Motivation I: the need for non-linear decision boundaries
● What happens if we have n = 50 features?
○ Too many new features generated
● Even worse: what if we have an image,
where each pixel is a feature, and want to
learn the concept being displayed?
● Trick: create new features as non-linear combinations of
existing features, and give them to the linear classifier
○ Eg: use x1
and x2
but also x1
x2
, x1
2
, x2
2
○ Pros: still using a simple (white-box) classifier,
understandable by business people (econometrics)
Decision boundary = set of points that are equally likely to belong to any of two classes
Simple classifiers such as linear regression and logistic regression can only find linear boundaries
10
11. Motivation II: the brain as a “universal” learning algorithm
11
● Neurons behave as computing units. Each neuron
receives electric input signals (called spikes) through the
dendrites, computes an output with them and sends it
through the axon to other neurons connected to it
● The human brain can learn almost any task
○ Let’s see how it is structured and try to imitate
it if we want a really good learning machine
● Axon-dendrite transmission is done in
a mechanism called synapse.
● This process never changes, yet the
brain is always learning
12. A computational model of a natural neuron
12
x0
=1
x1
x2
x3
Output (called activation):
g( T
x) = g( 0
+ 1
x1
+ 2
x2
+ 3
x3
)
The value of the parameters =
depends on the neuron (they
represent “what the neuron has
learned upt to now”).
Function g(·) is known as the activation
function and it is a non-linear function
of the input.
Most often, it is the sigmoid function:
g(z) = 1 / ( 1 + e-z
)
0
1
2
3
Conclusion: with the sigmoid, each
neuron actually learns a logistic
regression for a (mysterious)
sub-task which contributes
appropriately to the network task.
g( T
x) = 1 / (1 + e-( 0 + 1x1 + 2x2 + 3x3)
)
x1
x = x2
x3
SIMPLE PERCEPTRON
13. Equation (“hypothesis”) of a small neural network
13
x1
x2
x3
a2
(2)
a1
(2)
a3
(2)
a1
(3)
Input layer
(layer 1)
Output layer
(layer 3)
Hidden layer(s)
(layer 2)
ai
(j)
= activation of the i-th neuron of layer j
B(j)
= matrix of parameters multiplied by the inputs (activations)
from layer j to compute the activations of layer j+1
h (x)
Neuron 1
of layer 2
N2 of L2
Neuron 3
of layer 2
Neuron 1
of layer 3
MULTILAYER PERCEPTRON (MLP)
14. Equation (“hypothesis”) of a simple neural network
14
x1
x2
x3
Input layer
(layer 1)
Output layer
(layer 3)
Hidden layer(s)
(layer 2)
ai
(j)
= activation of the i-th neuron of layer j
B(j)
= matrix of parameters to be multiplied by the inputs
(activations) from layer j to compute the activations of layer j+1
h (x)
( 1
(1)
)T
1,0
(1)
1,1
(1)
1,2
(1)
1,3
(1)
B(1)
= ( 2
(1)
)T
= 2,0
(1)
2,1
(1)
2,2
(1)
2,3
(1)
( 3
(1)
)T
3,0
(1)
3,1
(1)
3,2
(1)
3,3
(1)
B(2)
= ( 1
(2)
)T
= 1,0
(2)
1,1
(2)
1,2
(2)
1,3
(2)
1
(1)
a1
(2)
2
(1)
a2
(2)
3
(1)
a3
(2)
1
(2)
a1
(3)
a1
(2)
= g(B(1)
10
+ B(1)
11
x1
+ B(1)
12
x2
+ B(1)
13
x3
)
a2
(2)
= g(B(1)
20
+ B(1)
21
x1
+ B(1)
22
x2
+ B(1)
23
x3
)
a3
(2)
= g(B(1)
30
+ B(1)
31
x1
+ B(1)
32
x2
+ B(1)
33
x3
)
h (x) = a1
(3)
= g(B(2)
10
+ B(2)
11
a1
(2)
+ B(2)
12
a2
(2)
+ B(2)
13
a3
(2)
)
which is a logistic regression with new variables a1
, a2
,
a3
created as non-linear transformations of x1
, x2
, x3
16. Two-class and multiclass classification with a neural network
16
K classes = K neurons in the output layer (K > 2)
2 classes = 1 neuron in the output layer
x1
x2
x3
a2
(2)
a1
(2)
a3
(2)
a1
(3)
We see the whole network
as a function hB
: ℝp
→ℝ
(p is the number of features)
h (x)
We see the whole network as a function hB
: ℝp
→ℝK
(p is the number of features, K is the number of classes)
18. The concept of cost function in Machine Learning
● In any Machine Learning model, fitting a model means finding the best values of its parameters.
● The best model is that whose parameter values minimize the total error with respect to the actual outputs
● Since the error depends on the parameters chosen, it is a function called cost function J. The most common error
measure is the MSE (Mean Squared Error):
J ( 0
, 1
, …, R
) = [ 1/(2m) ] i=1
m
( ŷi
- yi
)2
= m = number of training examples
= [ 1/(2m) ] i=1
m
(h 0, 1, …, R
(xi
) - yi
)2
h 0, 1, …, R
= hypothesis (the equation of the model)
● Instead of minimizing the total sum, we minimize the average error, (1/m).
● Finding the optimum of any function f is equivalent to finding the optimum of f / 2.
● Hence we divide by 2 because it eases further calculations.
18
Ideally the cost
function is convex:
for every pair of
points, the curve is
always below the
line (or hyper-plane)
between them
Cost function of one parameter Cost function of two parameters
19. Gradient descent with one variable
● If the cost function is convex: only one global optimum, which is the global minimum.
● Closed form of the minimum:
○ Compute the derivative of the model equation and find the point where it is 0 (exact solution).
○ Multiple variables: partial derivatives, and solve an equation system to find where all are 0 simultaneously.
● Can be difficult if the model equation is complicated, has many variables (large equation system) or is not derivable.
● Solution: approximate iterative algorithm called gradient descent (also valid for non-convex functions!)
19
GRADIENT DESCENT ALGORITHM
0
←some initial value
←some fixed (small) constant
( is called the learning ratio )
t←0
tolerance←small value (e.g: 0.000001)
while dJ/d | = (t)
> tolerance :
(t+1)
← (t)
- ( dJ/d | = (t)
)
derivate < 0 derivate > 0
⟶
dJ/d < 0 :
(t+1) > (t):
increases
⟵
dJ/d < 0 :
(t+1) < (t):
decreases
* : derivate = 0
20. Gradient descent in general (variables = ( 0
, 1
, …, p
))
● With multiple variables: evaluate the modulus (noted ||·||) of the gradient vector to test the stopping criterion
20
GRADIENT DESCENT ALGORITHM (p variables)
0
←some initial vector ( 01
, …, 0p
)
←some fixed (small) constant
( is called the learning ratio )
t←0
tolerance←small value (e.g: 0.000001)
while ||∇g | = (t)
|| > tolerance:
(t+1)
← (t)
- || ∇g | = (t)
||
NOTE: ∇g = ( J/ 0
, J/ 1
, … , J/ p
) ∈ ℝp
● In general, cost functions are not convex: many local
optima. Gradient descent does not guarantee the minimum
● The solution found depends on the starting point
In summary: we need to compute the
partial derivatives at each parameter point
21. Gradient descent with two variables
● Note that the error function J is determined uniquely by the dataset and the shape model being fitted.
● The function (and hence its derivative) does not change during the algorithm, we just evaluate the derivate function
at different values of the model parameters, which are the variables of that function.
● E.g: linear regression with 2 variables x1
, x2
. The model is
h (x) = 0
+ 1
x1
+ 2
x2
Imagine this dataset: x1
= (2, 3, 1.8)
x2
= (4, 5, 3.2) (having only m = 2 examples)
Then
J( 0
, 1
, 2
) = (1/(2·2)) [( 0
+ 2 1
+ 3 2
- 1.8)2
+ ( 0
+ 4 1
+ 5 2
- 3.2)2
]
J/ 0
= (1/2) ( ( 0
+ 2 1
+ 3 2
- 1.8) + ( 0
+ 4 1
+ 5 2
- 3.2) )
J/ 1
= (1/2) ( ( 0
+ 2 1
+ 3 2
- 1.8)·2 + ( 0
+ 4 1
+ 5 2
- 3.2)·4 )
J/ 2
= (1/2) ( ( 0
+ 2 1
+ 3 2
- 1.8)·3 + ( 0
+ 4 1
+ 5 2
- 3.2)·5 )
∇g = ( (1/2) ( ( 0
+ 2 1
+ 3 2
- 1.8) + ( 0
+ 4 1
+ 5 2
- 3.2) ),
(1/2) ( ( 0
+ 2 1
+ 3 2
- 1.8)·2 + ( 0
+ 4 1
+ 5 2
- 3.2)·4 ),
(1/2) ( ( 0
+ 2 1
+ 3 2
- 1.8)·3 + ( 0
+ 4 1
+ 5 2
- 3.2)·5 ) ) 21
and now we can start evaluating ∇g at different
points (t)
, each point being a vector of ℝ3
.
22. Cost function of logistic regression and a neural network
● In logistic regression, h (x) = 1 / (1 + e-( )
) and so the cost function of a logistic regression,
J( 0
, 1
, …, R
) = [ 1/(2m) ] i=1
m
( 1 / (1 + e-( )
) - yi
)2
is non-convex. A somewhat equivalent, convex cost function in logistic regression is
J( 0
, 1
, …, R
) = -(1/m) [ i=1
m
yi
log(h (xi
)) + (1-yi
)log(1 - h (xi
)) ] (we ignore any regularization term)
22
T
x
T
x
Recall that a neural network can be seen as an aggregation of logistic regressions. In a NN with
K neurons in the output layer (no matter how many are inside), the cost function is
J(B) = -(1/m) [ i=1
m
k=1
K
yi
log(hB
(xi
))k
+ (1-yi
)log(1 - (hB
(xi
))k
) ] B = {B(1)
, B(2)
, … }
which is again non-convex.
How can we compute the partial derivatives J / Bij
(ℓ)
at each step… and not die along the
way? The task seems a bit (computationally) heavy...
24. Backpropagation
Algorithm to compute the partial derivatives of the cost function with respect to each parameter.
● Intuition: compute the contribution of each neuron to the final error, and change its weights
accordingly
○ Compute the contribution (deltas) of each neuron to the error of each example separately,
and then accumulate over all examples to obtain the total contribution of each neuron to the
total error.
Example: we first apply forward propagation to compute every aj
(ℓ)
and the output of the network hB
(x)
24
z1
(3)
= B10
(2)
+ B11
(2)
a1
(2)
+ B12
(2)
a2
(2)
a1
(3)
= g(z1
(3)
)
B10
(1)
,B11
(1)
,B12
(1)
z1
(2)
→a1
(2)
B20
(1)
,B21
(1)
,B22
(1)
z2
(2)
→a2
(2)
B10
(2)
,B11
(2)
,B12
(2)
z1
(3)
→a1
(3)
B20
(2)
,B21
(2)
,B22
(2)
z2
(3)
→a2
(3)
x1
x2
B10
(3)
,B11
(3)
,B12
(3)
z1
(4)
→a1
(4)
a1
(2)
a2
(2)
25. Backpropagation: contribution of a1
(3)
to the error
How wrong is a1
(3)
? In other words:
How much did neuron 1 of layer 3 contribute to the network error on a given example (xi
, yi
) ?
1
(4)
= yi
- a1
(4)
1
(3)
= B11
(3)
1
(4)
because a1
(3)
had contributed to a1
(4)
with the term a1
(3)
B11
(3)
(recall: a1
(4)
= g( B10
(3)
+ a1
(3)
B11
(3)
+ a2
(3)
B12
(3)
) )
25
z1
(2)
→a1
(2)
z2
(2)
→a2
(2)
z1
(3)
→a1
(3)
Error: 1
(3)
z2
(3)
→a2
(3)
x1
x2
B10
(3)
,B11
(3)
,B12
(3)
z1
(4)
→a1
(4)
Error: 1
(4)
26. Backpropagation: contribution of a1
(2)
to the error
How wrong is a1
(2)
? In other words:
How much did neuron 1 of layer 2 contribute to the network error on a given example (xi
, yi
) ?
1
(2)
= B11
(2)
1
(3)
+ B21
(2)
2
(3)
because a1
(2)
contributed to
26
z1
(2)
→a1
(2)
Error: 1
(2)
z2
(2)
→a2
(2)
B10
(2)
,B11
(2)
,B12
(2)
z1
(3)
→a1
(3)
Error: 1
(3)
B20
(2)
,B21
(2)
,B22
(2)
z2
(3)
→a2
(3)
Error: 2
(3)
x1
x2
z1
(4)
→a1
(4)
Error: 1
(4)
a1
(3)
with term a1
(2)
B11
(2)
(recall: a1
(3)
= g(B10
(2)
+ a1
(2)
B11
(2)
+ a2
(2)
B12
(2)
) )
a2
(3)
with term a1
(2)
B21
(2)
(recall: a2
(3)
= g(B10
(2)
+ a1
(2)
B11
(2)
+ a2
(2)
B12
(2)
) )
27. Backpropagation algorithm
INPUT: training set {(x1
, y1
), (x2
, y2
), …, (xm
, ym
)}
Initialize ij
(ℓ)
←0 for every i, j, ℓ
For i = 1 to m:
a(1)
← xi
Perform forward propagation to compute every aj
(ℓ)
for ℓ = 1, 2, …,L (recall the aj
(L)
are the outputs of the network)
(L)
← a(L)
- yi
(errors of the output layer)
Compute the (L-1)
, (L-2)
, …, (2)
as (ℓ)
= (B(ℓ)
)T (ℓ+1)
.* (a(ℓ)
.* (1 - a(ℓ)
) (where .* means element-wise product)
ij
(ℓ)
← ij
(ℓ)
+ aj
(ℓ)
i
(ℓ+1)
Dij
(ℓ)
= (1/m) ij
(ℓ)
(assuming no regularization)
Finally……….. the D’s are the components of the gradient vector evaluated at the current values of the parameters:
J / Bij
(ℓ)
| Bij(ℓ) = Bij(ℓ)(t)
= Dij
(ℓ)
and now we can use them to update the parameter values B(t)
to obtain B(t+1)
, either as in gradient descent or any other
optimization method
27
28. Summary
● Neural Networks are a machine learning model inspired in the human brain
● They appear as a way to create highly non-linear features in an intelligent way
○ It is not the only model dealing with a non-linear frontier, e.g. Support Vector Machines
● Training a Neural Network requires a lot of training data
○ … because they are needed to obtain a good approximation of the gradient at each point of the parameter
space (and because there are a lot of parameters, it is a high-dimensional space!)
● The backpropagation algorithm allows computing the gradient at each point much more efficiently than doing it
directly.
28