1. Christof Monz
Informatics Institute
University of Amsterdam
Data Mining
Part 3: Neural Networks
Overview
Christof Monz
Data Mining - Part 3: Neural Networks
1
Perceptrons
Gradient descent search
Multi-layer neural networks
The backpropagation algorithm
2. Neural Networks
Christof Monz
Data Mining - Part 3: Neural Networks
2
Analogy to biological neural systems, the most
robust learning systems we know
Attempt to understand natural biological
systems through computational modeling
Massive parallelism allows for computational
efficiency
Help understand ‘distributed’ nature of neural
representations
Intelligent behavior as an ‘emergent’ property of
large number of simple units rather than from
explicitly encoded symbolic rules and algorithms
Neural Network Learning
Christof Monz
Data Mining - Part 3: Neural Networks
3
Learning approach based on modeling
adaptation in biological neural systems
Perceptron: Initial algorithm for learning
simple neural networks (single layer) developed
in the 1950s
Backpropagation: More complex algorithm for
learning multi-layer neural networks developed
in the 1980s.
3. Real Neurons
Christof Monz
Data Mining - Part 3: Neural Networks
4
Human Neural Network
Christof Monz
Data Mining - Part 3: Neural Networks
5
4. Modeling Neural Networks
Christof Monz
Data Mining - Part 3: Neural Networks
6
Perceptrons
Christof Monz
Data Mining - Part 3: Neural Networks
7
5. Perceptrons
Christof Monz
Data Mining - Part 3: Neural Networks
8
A perceptron is a single layer neural network
with one output unit
The output of a perceptron is computed as
follows
o(x1 ...xn) =
1 ifw0 +w1x1 +...+wnxn > 0
−1 otherwise
Assume a ‘dummy’ input x0 = 1 we can write:
o(x1 ...xn) =
1 if ∑n
i=0 wixi > 0
−1 otherwise
Perceptrons
Christof Monz
Data Mining - Part 3: Neural Networks
9
Learning a perceptron involves choosing the
‘right’ values for the weights w0 ...wn
The set of candidate hypotheses is
H = {w | w ∈ ℜ(n+1)}
6. Representational Power
Christof Monz
Data Mining - Part 3: Neural Networks
10
A single perceptron represent many boolean
functions, e.g. AND, OR, NAND (¬AND), . . . ,
but not all (e.g., XOR)
Peceptron Training Rule
Christof Monz
Data Mining - Part 3: Neural Networks
11
The perceptron training rule can be defined
for each weight as:
wi ← wi +∆wi
where ∆wi = η(t −o)xi
where t is the target output, o is the output of
the perceptron, and η is the learning rate
This scenario assume that we know what the
target outputs are supposed to be like
7. Peceptron Training Rule Example
Christof Monz
Data Mining - Part 3: Neural Networks
12
If t = o then η(t −o)xi = 0 and ∆wi = 0, i.e.
the weight for wi remains unchanged, regardless
of the learning rate and the input values (i.e. xi)
Let’s assume a learning rate of η = 0.1 and an
input value of xi = 0.8
• If t = +1 and o = −1, then
∆wi = 0.1(1 −(−1)))0.8 = 0.16
• If t = −1 and o = +1, then
∆wi = 0.1(−1 −1)))0.8 = −0.16
Peceptron Training Rule
Christof Monz
Data Mining - Part 3: Neural Networks
13
The perceptron training rule converges after a
finite number of iterations
Stopping criterion holds if the amount of
changes falls below a pre-defined threshold θ,
e.g., if |∆w|L1 < θ
But only if the training examples are linearly
separable
8. The Delta Rule
Christof Monz
Data Mining - Part 3: Neural Networks
14
The delta rule overcomes the shortcoming of
the perceptron training rule not being
guaranteed to converge if the examples are not
linearly separable
Delta rule is based on gradient descent search
Let’s assume we have an unthresholded
perceptron: o(x) = w ·x
We can define the training error as:
E(w) = 1
2 ∑
d∈D
(td −od)2
where D is the set of training examples
Error Surface
Christof Monz
Data Mining - Part 3: Neural Networks
15
9. Gradient Descent
Christof Monz
Data Mining - Part 3: Neural Networks
16
The gradient of E is the vector pointing in the
direction of the steepest increase for any point
on the error surface
∇E(w) = ∂E
∂w0
, ∂E
∂w1
,..., ∂E
∂wn
Since we are interested in minimizing the error,
we consider negative gradients: −∇E(w)
The training rule for gradient descent is:
w ← w +∆w
where ∆w = −η∇E(w)
Gradient Descent
Christof Monz
Data Mining - Part 3: Neural Networks
17
The training rule for individual weights is
defined as wi ← wi +∆wi
where ∆wi = −η ∂E
∂wi
Instantiating E for the error function we use
gives: ∂E
∂wi
= ∂
∂wi
1
2 ∑
d∈D
(td −od)2
How do we use partial derivatives to actually
compute updates to weights at each step?
10. Gradient Descent
Christof Monz
Data Mining - Part 3: Neural Networks
18
∂E
∂wi
=
∂
∂wi
1
2
∑
d∈D
(td −od)2
=
1
2
∑
d∈D
∂
∂wi
(td −od)2
=
1
2
∑
d∈D
2(td −od)
∂
∂wi
(td −od)
= ∑
d∈D
(td −od)
∂
∂wi
(td −od)
∂E
∂wi
= ∑
d∈D
(td −od)·(−xid)
Gradient Descent
Christof Monz
Data Mining - Part 3: Neural Networks
19
The delta rule for individual weights can now be
written as wi ← wi +∆wi
where ∆wi = η ∑
d∈D
(td −od)xid
The gradient descent algorithm
• picks initial random weights
• computes the outputs
• updates each weight by adding ∆wi
• repeats until converge
11. The Gradient Descent Algorithm
Christof Monz
Data Mining - Part 3: Neural Networks
20
Each training example is a pair < x,t >
1 Initialize each wi to some small random value
2 Until the termination condition is met do:
2.1 Initialize each ∆wi to 0
2.2 For each < x,t >∈ D do
2.2.1 Compute o(x)
2.2.2 For each weight wi do
∆wi ← ∆wi +η(t −o)xi
2.3 For each weight wi do
wi ← wi +∆wi
The Gradient Descent Algorithm
Christof Monz
Data Mining - Part 3: Neural Networks
21
The gradient descent algorithm will find the
global minimum, provided that the learning rate
is small enough
If the learning rate is too large, this algorithm
runs into the risk of overstepping the global
minimum
It’s a common strategy to gradually the
decrease the learning rate
This algorithm works also in case the training
examples are not linearly separable
12. Shortcomings of Gradient Descent
Christof Monz
Data Mining - Part 3: Neural Networks
22
Converging to a minimum can be quite slow
(i.e. it can take thousands of steps). Increasing
the learning rate on the other hand can lead to
overstepping minima
If there are multiple local minima in the error
surface, gradient descent can get stuck in one
of them and not find the global minimum
Stochastic gradient descent alleviates these
difficulties
Stochastic Gradient Descent
Christof Monz
Data Mining - Part 3: Neural Networks
23
Gradient descent updates the weights after
summing over all training examples
Stochastic (or incremental) gradient descent
updates weights incrementally after calculating
the error for each individual training example
This this end step 2.3 is deleted and step 2.2.2
modified
13. Stochastic Gradient Descent
Christof Monz
Data Mining - Part 3: Neural Networks
24
Each training example is a pair < x,t >
1 Initialize each wi to some small random value
2 Until the termination condition is met do:
2.1 Initialize each ∆wi to 0
2.2 For each < x,t >∈ D do
2.2.1 Compute o(x)
2.2.2 For each weight wi do
wi ← wi +η(t −o)xi
Comparison
Christof Monz
Data Mining - Part 3: Neural Networks
25
In standard gradient descent summing over
multiple examples requires more computations
per weight update step
As a consequence standard gradient descent
often uses larger learning rates than stochastic
gradient descent
Stochastic gradient descent can avoid falling
into local minima because it uses the different
∇Ed(w) rather than the overall ∇E(w) to guide
its search
14. Multi-Layer Neural Networks
Christof Monz
Data Mining - Part 3: Neural Networks
26
Perceptrons only have two layers: the input
layer and the output layer
Perceptrons only have one output unit
Perceptrons are limited in their expressiveness
Multi-layer neural networks consist of an input
layer, a hidden layer, and an output layer
Multi-layer neural networks can have several
output units
Multi-Layer Neural Networks
Christof Monz
Data Mining - Part 3: Neural Networks
27
15. Multi-Layer Neural Networks
Christof Monz
Data Mining - Part 3: Neural Networks
28
The units of the hidden layer function as input
units to the next layer
However, multiple layers of linear units still
produce only linear functions
The step function in perceptrons is another
choice, but it is not differentiable, and therefore
not suitable for gradient descent search
Solution: the sigmoid function, a non-linear,
differentiable threshold function
Sigmoid Unit
Christof Monz
Data Mining - Part 3: Neural Networks
29
16. The Sigmoid Function
Christof Monz
Data Mining - Part 3: Neural Networks
30
The output is computed as o = σ(w ·x)
where σ(y) = 1
1+e−y
i.e. o = σ(w ·x) = 1
1+e−(w·x)
Another nice property of the sigmoid function is
that its derivative is easily expressed:
dσ(y)
dy
= σ(y)·(1 −σ(y))
Learning with Multiple Layers
Christof Monz
Data Mining - Part 3: Neural Networks
31
The gradient descent search can be used to
train multi-layer neural networks, but the
algorithm has to be adapted
Firstly, there can be multiple output units, and
therefore the error function as to be generalized:
E(w) = 1
2 ∑
d∈D
∑
k∈outputs
(tkd −okd)2
Secondly, the error ‘feedback’ has to be fed
through multiple layers
17. Backpropagation Algorithm
Christof Monz
Data Mining - Part 3: Neural Networks
32
For each training example < x,t > do
1. Input x to the network and compute ou for every unit in
the network
2. For each output unit k calculate its error δk :
δk ← ok (1 −ok )(tk −ok )
3. For each hidden unit h calculate its error δh:
δh ← oh(1 −oh) ∑
k∈outputs
wkhδk
4. Update each network weight wji :
wji ← wji +∆wji
where ∆wji = ηδj xji
Note: xji is the value from unit i to j and wji is
the weight of connecting unit i to j,
Backpropagation Algorithm
Christof Monz
Data Mining - Part 3: Neural Networks
33
Step 1 propagates the input forward through
the network
Steps 2–4 propagate the errors backward
through the network
Step 2 is similar to the delta rule in gradient
descent (step 2.3)
Step 3 sums over the errors of all output units
influence by a given hidden unit (this is because
the training data only provides direct feedback
for the output units)
18. Applications of Neural Networks
Christof Monz
Data Mining - Part 3: Neural Networks
34
Text to speech
Fraud detection
Automated vehicles
Game playing
Handwriting recognition
Summary
Christof Monz
Data Mining - Part 3: Neural Networks
35
Perceptrons, simple one layer neural networks
Perceptron training rule
Gradient descent search
Multi-layer neural networks
Backpropagation algorithm