Lecture provides a comprehensive introduction to basic deep learning concepts, including feed-forward path, backpropagation algorithm, activation function and vanishing gradients.
50. -8 -6 -4 -2 4 6 82
0.2
0.4
0.6
0.8
1.0
sig(x) =
1
1 + e−x
x
sig(x)
Sigmoid activation function
0.99
No matter how big your x can grow, your sig(x)
will always be between 0 and 1
92. .05
0.1
Adding one more type of neurons
TruthAnswer Error
?
?
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.40 (w5)
0.55 (w8)
0.45(w6)
?
?
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
0.00.0
Point
coordinates
as input
0.25(w3)
0.50(w7)
94. -8 -6 -4 -2 2 4 6 8
0.2
0.4
0.6
0.8
1.0
x
sig(x)
OutputInput
x
w1
sig(w1 * x)
What is the role of bias in Neural Networks?
Source: http://www.uta.fi/sis/tie/neuro/index/Neurocomputing2.pdf
95. -8 -6 -4 -2 2 4 6 8
0.2
0.4
0.6
0.8
1.0
x
sig(w1 * x)
OutputInput
x
w1
sig(w1 * x)
What is the role of bias in Neural Networks?
w1 =[0.5, 1.0, 2.0]
Source: http://www.uta.fi/sis/tie/neuro/index/Neurocomputing2.pdf
96. -8 -6 -4 -2 2 4 6 8
0.2
0.4
0.6
0.8
1.0
x
sig(w1 * x)
-8 -6 -4 -2 2 4 6 8
0.2
0.4
0.6
0.8
1.0
x
sig(x)
OutputInput
x
w1
sig(w1 * x)
What is the role of bias in Neural Networks?
w1 =[0.5, 1.0, 2.0]
OutputInput
x
w1
sig(w1 * x + b1)
b b1
Source: http://www.uta.fi/sis/tie/neuro/index/Neurocomputing2.pdf
97. OutputInput
x
w1
sig(w1 * x)
Bias helps to shift the resulting curve
w1 =[0.5, 1.0, 2.0]
OutputInput
x
w1
sig(w1 * x + b1)
b1
= [-4.0, 0.0, 4.0]
b b1
w1=[1.0]
-8 -6 -4 -2 2 4 6 8
0.2
0.4
0.6
0.8
1.0
x
sig(w1 * x + b1)
-8 -6 -4 -2 2 4 6 8
0.2
0.4
0.6
0.8
1.0
x
sig(w1 * x)
b1 b1
Source: http://www.uta.fi/sis/tie/neuro/index/Neurocomputing2.pdf
98. OutputInput
x
w1
sig(w1 * x)
Bias helps to shift the resulting curve
w1 =[0.5, 1.0, 2.0]
OutputInput
x
w1
sig(w1 * x + b1)
b1
= [-4.0, 0.0, 4.0]
b b1
w1=[1.0]
sig(x) =
1
1 + e−x
-8 -6 -4 -2 2 4 6 8
0.2
0.4
0.6
0.8
1.0
x
sig(w1 * x + b1)
-8 -6 -4 -2 2 4 6 8
0.2
0.4
0.6
0.8
1.0
x
sig(w1 * x)
b1 b1
Source: http://www.uta.fi/sis/tie/neuro/index/Neurocomputing2.pdf
99. OutputInput
x
w1
sig(w1 * x)
Bias helps to shift the resulting curve
w1 =[0.5, 1.0, 2.0]
OutputInput
x
w1
sig(w1 * x + b1)
b1
= [-4.0, 0.0, 4.0]
b b1
w1=[1.0]
sig(x) =
1
1 + e−(wx+b)
-8 -6 -4 -2 2 4 6 8
0.2
0.4
0.6
0.8
1.0
x
sig(w1 * x + b1)
-8 -6 -4 -2 2 4 6 8
0.2
0.4
0.6
0.8
1.0
x
sig(w1 * x)
b1 b1
Source: http://www.uta.fi/sis/tie/neuro/index/Neurocomputing2.pdf
127. .05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.40 (w5)
0.55 (w8)
0.45(w6)
0.50(w7)
?
?
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
0.00.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
How good are these predictions?
Feed-forward path
128. .05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.40 (w5)
0.55 (w8)
0.45(w6)
0.50(w7)
?
?
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
0.00.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
In order to find out how good
are these predictions, we
should compare them to the
expected outcomes
129. .05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.40 (w5)
0.55 (w8)
0.45(w6)
0.50(w7)
?
?
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
0.00.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
In order to find out how good
are these predictions, we
should compare them to the
expected outcomes
Eo1
=
1
2
(truth − out(o1))2
130. .05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.40 (w5)
0.55 (w8)
0.45(w6)
0.50(w7)
?
?
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
0.00.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
In order to find out how good
are these predictions, we
should compare them to the
expected outcomes
Eo1
=
1
2
(truth − out(o1))2
=
1
2
(0.751 − 0)2
131. .05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.40 (w5)
0.55 (w8)
0.45(w6)
0.50(w7)
?
?
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
0.00.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
In order to find out how good
are these predictions, we
should compare them to the
expected outcomes
Eo1
=
1
2
(truth − out(o1))2
=
1
2
(0.751 − 0)2
=
1
2
(−0.751)2
132. .05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.40 (w5)
0.55 (w8)
0.45(w6)
0.50(w7)
0.282
?
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
0.00.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
In order to find out how good
are these predictions, we
should compare them to the
expected outcomes
Eo1
=
1
2
(truth − out(o1))2
=
1
2
(0.751 − 0)2
= 0.282
133. 0.0.05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.40 (w5)
0.55 (w8)
0.45(w6)
0.50(w7)
?
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
In order to find out how good
are these predictions, we
should compare them to the
expected outcomes
0.282
Eo1
=
1
2
(truth − out(o1))2
=
1
2
(0.751 − 0)2
= 0.282
Eo2
=
1
2
(truth − out(o2))2
=
1
2
(0.773 − 1)2
= 0.025
134. 0.0.05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.40 (w5)
0.55 (w8)
0.45(w6)
0.50(w7)
0.025
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
In order to find out how good
are these predictions, we
should compare them to the
expected outcomes
0.282
Eo1
=
1
2
(truth − out(o1))2
=
1
2
(0.751 − 0)2
= 0.282
Eo2
=
1
2
(truth − out(o2))2
=
1
2
(0.773 − 1)2
= 0.025
135. .05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.40 (w5)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
0.00.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
How good are these predictions?
0.025
0.282
136. .05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.40 (w5)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
0.00.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
How good are these predictions?
0.025
0.282
Well, they are this good.
137. .05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.40 (w5)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
0.00.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Overall the error is: Etotal = Eo1
+ Eo2
138. .05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.40 (w5)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
0.00.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Overall the error is: Etotal = Eo1
+ Eo2
= 0.282 + 0.025 = 0.307
Etotal = 0.307
139. .05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.40 (w5)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
0.00.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Our goal is to reduce the total error ( )!Etotal
Etotal = 0.307
140. 0.0.05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.40 (w5)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Etotal = 0.307
Our goal is to reduce the total error ( )!Etotal
The only thing we can do is change weights
141. 0.40 (w5)
0.0.05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Etotal = 0.307
How does w5 influence the total error?
142. 0.0.05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
(w5)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Etotal = 0.307
0.45
0.50
0.35
0.30
We can change it randomly
(increase or decrease) and
see what happens to
How does w5 influence the total error?
0.55
0.25
Etotal
0.40
143. 0.0.05
0.1
TruthAnswer Error
0.757
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
(w5)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.286
Etotal = 0.311
0.45
0.50
0.35
0.30
We can change it randomly
(increase or decrease) and
see what happens to
How does w5 influence the total error?
0.55
0.25
Etotal
0.40
144. 0.0.05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
(w5)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Etotal = 0.307
0.45
0.50
0.35
0.30
We can change it randomly
(increase or decrease) and
see what happens to
How does w5 influence the total error?
0.55
0.25
Etotal
0.40
145. 0.0.05
0.1
TruthAnswer Error
0.745
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
(w5)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.278
Etotal = 0.303
0.45
0.50
0.35
0.30
We can change it randomly
(increase or decrease) and
see what happens to
How does w5 influence the total error?
0.55
0.25
Etotal
0.40
146. 0.0.05
0.1
TruthAnswer Error
0.740
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
(w5)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.273
Etotal = 0.299
0.45
0.50
0.35
0.30
We can change it randomly
(increase or decrease) and
see what happens to
How does w5 influence the total error?
0.55
0.25
Etotal
0.40
147. 0.0.05
0.1
TruthAnswer Error
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
(w5)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
Etotal = 0.299
0.45
0.50
0.35
0.30
We can change it randomly
(increase or decrease) and
see what happens to
How does w5 influence the total error?
0.55
0.25
Etotal
0.40
w5
EtotalEtotal
Higher weightLower weight
0.740 0.273
148. 0.0.05
0.1
TruthAnswer Error
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
(w5)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
Etotal = 0.299
0.45
0.50
0.35
0.30
We can change it randomly
(increase or decrease) and
see what happens to
How does w5 influence the total error?
0.55
0.25
Etotal
0.40
w5
EtotalEtotal
Higher weightLower weight
The problem is that every time we
need to recalculate all the neuron
values again.
0.740 0.273
149. 0.0.05
0.1
TruthAnswer Error
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
(w5)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
Etotal = 0.299
0.45
0.50
0.35
0.30
0.55
0.25
0.40
0.740 0.273
150. 0.0.05
0.1
TruthAnswer Error
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
(w5)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
Etotal = 0.299
0.45
0.50
0.35
0.30
0.55
0.25
0.40
0.740 0.273
153. weight
Etotal
weight + 1
We want a way to efficiently update all our weights so
that total error decreases most substantially
weight - 1
154. weight
Etotal
weight + 1
We want a way to efficiently update all our weights so
that total error decreases most substantially
We want to compute the gradient
weight - 1
155. weight
Etotal
weight + 1
We want to compute the gradient
Gradient shows the direction of the greatest increase of the
function
weight - 1
156. weight
Etotal
weight + 1
We want to compute the gradient
∂E
∂w
Gradient shows the direction of the greatest increase of the
function
weight - 1
157. weight
Etotal
weight + 1
We want to compute the gradient
−
∂E
∂w
weight - 1
∂E
∂w
Gradient shows the direction of the greatest increase of the
function
158. w5
Etotal
weight + 1
It looks OK when there is just one weight to take care of
We want to compute the gradient
−
∂E
∂w
∂E
∂w
160. Etotal
weight + 1
∂E
∂w
w5
Gradient still can be computed efficiently using
backpropagation algorithm
It starts to look a lot scarier when we try to optimise all
weights
w4
w3
w2
w1
w6 w7
w8
−
∂E
∂w
161. 0.40 (w5)
0.0.05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Etotal = 0.307
How does w5 influence the total error?
Backpropagation
algorithm
162. 0.40 (w5)
0.0.05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Etotal = 0.307
How does w5 influence the total error?
∂Etotal
∂w5
= . . .
Backpropagation
algorithm
163. ∂Etotal
∂w5
=
∂Etotal
∂Eo1
* . . .
0.40 (w5)
0.0.05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Etotal = 0.307
How does w5 influence the total error?
How does error on out(o1) influence the total error?
Backpropagation
algorithm
164. ∂Etotal
∂w5
=
∂Etotal
∂Eo1
*
∂Eo1
∂outo1
* . . .
0.40 (w5)
0.0.05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Etotal = 0.307
How does w5 influence the total error?
How does out(o1) influence the error on out(o1)?
Backpropagation
algorithm
165. ∂Etotal
∂w5
=
∂Etotal
∂Eo1
*
∂Eo1
∂outo1
*
∂outo1
∂ino1
* . . .
0.40 (w5)
0.0.05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Etotal = 0.307
How does w5 influence the total error?
How does input(o1) influence the out(o1)?
Backpropagation
algorithm
174. ∂Etotal
∂w5
= 1 * 0.751 *
∂outo1
∂ino1
*
∂ino1
∂w5
∂outo1
∂ino1
∂outo1
∂ino1
=
sig(x) =
1
1 + e−x
0.40 (w5)
0.0.05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Etotal = 0.307
How does w5 influence the total error?
0.5933 out(h1)
Backpropagation
algorithm
175. ∂outo1
∂ino1
= outo1
* (1 − outo1
) = 0.751 * (1 − 0.751) = 0.186
∂Etotal
∂w5
= 1 * 0.751 *
∂outo1
∂ino1
*
∂ino1
∂w5
∂outo1
∂ino1
sig(x) =
1
1 + e−x
0.40 (w5)
0.0.05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Etotal = 0.307
How does w5 influence the total error?
0.5933 out(h1)
Backpropagation
algorithm
176. ∂Etotal
∂w5
= 1 * 0.751 * 0.186 *
∂ino1
∂w5
∂ino1
∂w5
∂ino1
∂w5
=
in01
= . . .
0.40 (w5)
0.0.05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Etotal = 0.307
How does w5 influence the total error?
0.5933 out(h1)
Backpropagation
algorithm
177. ∂Etotal
∂w5
= 1 * 0.751 * 0.186 *
∂ino1
∂w5
∂ino1
∂w5
∂ino1
∂w5
=
in01
= outh1
* w5 + outh2
* w6 + b2
0.40 (w5)
0.0.05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Etotal = 0.307
How does w5 influence the total error?
0.5933 out(h1)
Backpropagation
algorithm
178. ∂ino1
∂w5
= outh1
= . . .
∂Etotal
∂w5
= 1 * 0.751 * 0.186 *
∂ino1
∂w5
∂ino1
∂w5
in01
= outh1
* w5 + outh2
* w6 + b2
0.40 (w5)
0.0.05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Etotal = 0.307
How does w5 influence the total error?
0.5933 out(h1)
Backpropagation
algorithm
179. ∂ino1
∂w5
= outh1
= 0.5933
∂Etotal
∂w5
= 1 * 0.751 * 0.186 *
∂ino1
∂w5
∂ino1
∂w5
in01
= outh1
* w5 + outh2
* w6 + b2
0.40 (w5)
0.0.05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Etotal = 0.307
How does w5 influence the total error?
0.5933 out(h1)
Backpropagation
algorithm
180. ∂Etotal
∂w5
= 1 * 0.751 * 0.186 * 0.5933
0.40 (w5)
0.0.05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Etotal = 0.307
How does w5 influence the total error?
0.5933 out(h1)
Backpropagation
algorithm
181. ∂Etotal
∂w5
= 1 * 0.751 * 0.186 * 0.5933 = 0.082
0.40 (w5)
0.0.05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Etotal = 0.307
How does w5 influence the total error?
Backpropagation
algorithm 0.5933 out(h1)
If value of the gradient is positive, then if we increase the
w5, the total error will …?
182. ∂Etotal
∂w5
= 1 * 0.751 * 0.186 * 0.5933 = 0.082
0.40 (w5)
0.0.05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Etotal = 0.307
How does w5 influence the total error?
Backpropagation
algorithm 0.5933 out(h1)
If value of the gradient is positive, then if we increase the
w5, the total error will increase!
183. ∂Etotal
∂w5
= 1 * 0.751 * 0.186 * 0.5933 = 0.082
0.40 (w5)
0.0.05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Etotal = 0.307
How does w5 influence the total error?
Backpropagation
algorithm 0.5933 out(h1)
w5new
= w5old
−
∂Etotal
∂w5
So we update w5:
If value of the gradient is positive, then if we increase the
w5, the total error will increase!
184. ∂Etotal
∂w5
= 1 * 0.751 * 0.186 * 0.5933 = 0.082
0.40 (w5)
0.0.05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Etotal = 0.307
How does w5 influence the total error?
Backpropagation
algorithm 0.5933 out(h1)
w5new
= w5old
− 0.082
So we update w5:
If value of the gradient is positive, then if we increase the
w5, the total error will increase!
185. ∂Etotal
∂w5
= 1 * 0.751 * 0.186 * 0.5933 = 0.082
0.40 (w5)
0.0.05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Etotal = 0.307
How does w5 influence the total error?
Backpropagation
algorithm 0.5933 out(h1)
So we update w5:
If update by a lot,
we can miss to optimum
w5
Etotal
If value of the gradient is positive, then if we increase the
w5, the total error will increase!
w5new
= w5old
− 0.082
186. ∂Etotal
∂w5
= 1 * 0.751 * 0.186 * 0.5933 = 0.082
0.40 (w5)
0.0.05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Etotal = 0.307
How does w5 influence the total error?
Backpropagation
algorithm 0.5933 out(h1)
So we update w5:
Thus, we make a bit smaller step
w5
Etotal
step size
If value of the gradient is positive, then if we increase the
w5, the total error will increase!
w5new
= w5old
− η * 0.082
187. w5new
= 0.4 − 0.5 * 0.083
∂Etotal
∂w5
= 1 * 0.751 * 0.186 * 0.5933 = 0.082
0.40 (w5)
0.0.05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Etotal = 0.307
How does w5 influence the total error?
Backpropagation
algorithm 0.5933 out(h1)
So we update w5:
Thus, we make a bit smaller step
w5
Etotal
If value of the gradient is positive, then if we increase the
w5, the total error will increase!
188. ∂Etotal
∂w5
= 1 * 0.751 * 0.186 * 0.5933 = 0.082
0.40 (w5)
0.0.05
0.1
TruthAnswer Error
0.751
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.7511 out(o1)
0.7732 out(o2)
0.025
0.282
Etotal = 0.307
How does w5 influence the total error?
Backpropagation
algorithm 0.5933 out(h1)
So we update w5:
Thus, we make a bit smaller step
w5
Etotal
If value of the gradient is positive, then if we increase the
w5, the total error will increase!
w5new
= 0.4 − 0.5 * 0.083 = 0.3585
189. 0.359 (w5)
0.0.05
0.1
TruthAnswer Error
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
out(o1)
0.7732 out(o2)
0.025
Etotal = . . .
How does w5 influence the total error?
Backpropagation
algorithm 0.5933 out(h1)
So we update w5:
Thus, we make a bit smaller step
w5
Etotal
w5new
= 0.4 − 0.5 * 0.083 = 0.3585
190. 0.0.05
0.1
TruthAnswer Error
0.611
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2)
0.6112 out(o1)
0.7732 out(o2)
0.025
How does w5 influence the total error?
Backpropagation
algorithm 0.5933 out(h1)
So we update w5:
Thus, we make a bit smaller step
w5
Etotal
Etotal = . . .
0.359 (w5)
w5new
= 0.4 − 0.5 * 0.083 = 0.3585
191. 0.0.05
0.1
TruthAnswer Error
0.611
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2) 0.7732 out(o2)
0.025
0.187
How does w5 influence the total error?
Backpropagation
algorithm 0.5933 out(h1)
So we update w5:
Thus, we make a bit smaller step
w5
Etotal
Etotal = . . .
0.6112 out(o1)
0.359 (w5)
w5new
= 0.4 − 0.5 * 0.083 = 0.3585
192. Etotal = 0.303
0.2780.745 0.0.05
0.1
TruthAnswer Error
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2) 0.7732 out(o2)
0.025
How does w5 influence the total error?
Backpropagation
algorithm 0.5933 out(h1)
So we update w5:
Thus, we make a bit smaller step
w5
Etotal
0.6112 out(o1)
0.359 (w5)
w5new
= 0.4 − 0.5 * 0.083 = 0.3585
193. 0.0.05
0.1
TruthAnswer Error
0.773
Hidden
layer
Output
layer
0.15 (w1)
0.30 (w4)
0.20(w2)
0.25(w3)
0.55 (w8)
0.45(w6)
0.50(w7)
b
0.35 (b1)
b
0.6 (b2)
x
y
0
0.0
(0.1,0.05)
1.0
Point
coordinates
as input
0.5933 out(h1)
0.5969 out(h2) 0.7732 out(o2)
0.025
How does w5 influence the total error?
Backpropagation
algorithm 0.5933 out(h1) 0.6112 out(o1)
After updating all remaining weights total error = 0.131
Repeating 1000 times decreases it to 0.001
Etotal = 0.303
0.2780.745
0.359 (w5)
194. Why we should understand backpropagation?
https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b
195. Why we should understand backpropagation?
-8 -6 -4 -2 2 4 6 8
0.2
0.4
0.6
0.8
1.0
x
sig(x)
Vanishing gradients on sigmoids
https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b
196. Why we should understand backpropagation?
-8 -6 -4 -2 2 4 6 8
0.2
0.4
0.6
0.8
1.0
x
sig(x)
-8 -6 -4 -2 2 4 6 8
0.2
0.4
0.6
0.8
1.0
x
sig′(x)
Vanishing gradients on sigmoids
https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b
197. Why we should understand backpropagation?
-8 -6 -4 -2 2 4 6 8
0.2
0.4
0.6
0.8
1.0
x
sig(x)
-8 -6 -4 -2 2 4 6 8
0.2
0.4
0.6
0.8
1.0
x
sig′(x)
Vanishing gradients on sigmoids
Derivative is zero at tails
https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b
198. Why we should understand backpropagation?
-8 -6 -4 -2 2 4 6 8
0.2
0.4
0.6
0.8
1.0
x
sig(x)
-8 -6 -4 -2 2 4 6 8
0.2
0.4
0.6
0.8
1.0
x
sig′(x)
Vanishing gradients on sigmoids
Derivative is zero at tails
https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b
Recall the chain rule:
∂Etotal
∂w5
=
∂Etotal
∂Eo1
*
∂Eo1
∂outo1
*
∂outo1
∂ino1
*
∂ino1
∂w5
199. Why we should understand backpropagation?
-8 -6 -4 -2 2 4 6 8
0.2
0.4
0.6
0.8
1.0
x
sig(x)
-8 -6 -4 -2 2 4 6 8
0.2
0.4
0.6
0.8
1.0
x
sig′(x)
Vanishing gradients on sigmoids
Derivative is zero at tails
https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b
Recall the chain rule:
∂Etotal
∂w5
=
∂Etotal
∂Eo1
*
∂Eo1
∂outo1
*
∂outo1
∂ino1
*
∂ino1
∂w5
So, if x is too high (e.g. >8) or too low (e.g. < -8) then
∂outo1
∂ino1
= 0 and thus
∂Etotal
∂w5
= 0
200. Why we should understand backpropagation?
-8 -6 -4 -2 2 4 6 8
0.2
0.4
0.6
0.8
1.0
x
sig(x)
-8 -6 -4 -2 2 4 6 8
0.2
0.4
0.6
0.8
1.0
x
sig′(x)
Vanishing gradients on sigmoids
2 4 6 8-8 -6 -4 -2
2
4
6
8
10
x
Dying ReLus
max(0, x)
Derivative is zero at tails
https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b
201. Why we should understand backpropagation?
-8 -6 -4 -2 2 4 6 8
0.2
0.4
0.6
0.8
1.0
x
sig(x)
-8 -6 -4 -2 2 4 6 8
0.2
0.4
0.6
0.8
1.0
x
sig′(x)
Vanishing gradients on sigmoids
2 4 6 8-8 -6 -4 -2
2
4
6
8
10
x
Dying ReLus
max(0, x)
2 4 6 8-8 -6 -4 -2
0.2
0.4
0.6
0.8
1.0
x
max′(0, x)
Derivative is zero at tails
https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b
202. Why we should understand backpropagation?
-8 -6 -4 -2 2 4 6 8
0.2
0.4
0.6
0.8
1.0
x
sig(x)
-8 -6 -4 -2 2 4 6 8
0.2
0.4
0.6
0.8
1.0
x
sig′(x)
Vanishing gradients on sigmoids
2 4 6 8-8 -6 -4 -2
2
4
6
8
10
x
Dying ReLus
max(0, x)
2 4 6 8-8 -6 -4 -2
0.2
0.4
0.6
0.8
1.0
x
max′(0, x)
Derivative is zero at tails
Derivative is zero
https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b
210. Resources:
Brandon Rohrer’s youtube video: How Convolutional Neural Networks
work (https://youtu.be/FmpDIaiMIeA)
Brandon Rohrer’s youtube video: How Deep Neural Networks Work
(https://youtu.be/ILsA4nyG7I0)
Stanford University CS231n Convolutional Neural Networks for Visual
Recognition (github.io version): http://cs231n.github.io/
Matt Mazur’s: A Step by Step Backpropagation Example (https://
mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/)
Andrej Karphaty’s blog post: Yes you should understand backprop
(https://medium.com/@karpathy/yes-you-should-understand-backprop-
e2f06eab496b)
Raul Vicente’s lecture: From brain to Deep Learning and back (https://
www.uttv.ee/naita?id=23585)
Mayank Agarwal’s blog: Back Propagation in Convolutional Neural
Networks — Intuition and Code (https://becominghuman.ai/back-
propagation-in-convolutional-neural-networks-intuition-and-
code-714ef1c38199)