5. Solutions in general
๐ฅ๐ = ๐ฅ1, ๐ฅ2, ๐ฅ3, ๐ฅ4, โฆ , ๐ฅ๐, โฆ ๐ โ ๐
๐ฆ๐ = ๐ฆ1, ๐ฆ2, โฆ , ๐ฆ ๐, โฆ ๐ โ ๐
๐น: ๐ โ ๐
Classification
๐ฆ1 = 1,0,0
๐ฆ2 = 0,0,1
๐ฆ3 = 0,1,0
๐ฆ4 = 0,1,0
Index of sample in dataset
sample of class โ0โ
sample of class โ2โ
sample of class โ2โ
sample of class โ1โ
Regression
๐ฆ1 = 0.3
๐ฆ2 = 0.2
๐ฆ3 = 1.0
๐ฆ4 = 0.65
6. What is artificial Neural Networks?
Is it biology?
Simulation of biological neural networks (synapses, axons,
chains, layers, etc.) is a good abstraction for understanding
topology.
Bio NN is only inspiration and illustration. Nothing more!
7. What is artificial Neural Networks?
Letโs imagine black box!
F
inputs
params
outputs
General form:
๐๐ข๐ก๐๐ข๐ก๐ = ๐น ๐๐๐๐ข๐ก๐ , ๐๐๐๐๐๐
Steps:
1) choose โformโ of F
2) find params
8. What is artificial Neural Networks?
Itโs a simple math!
free parameters
activation function
๐ ๐ =
๐=1
๐
๐ค๐๐ ๐ฅ๐ + ๐๐
๐ฆ๐ = ๐ ๐ ๐
Output of i-th neuron:
9. What is artificial Neural Networks?
Itโs a simple math!
activation: ๐ฆ = ๐ ๐ค๐ฅ + ๐ = ๐ ๐๐๐๐๐๐(๐ค๐ฅ + ๐)
10. What is artificial Neural Networks?
Itโs a simple math!
activation: ๐ฆ = ๐ ๐ค๐ฅ + ๐ = ๐ ๐๐๐๐๐๐(๐ค๐ฅ + ๐)
11. What is artificial Neural Networks?
Itโs a simple math!
activation: ๐ฆ = ๐ ๐ค๐ฅ + ๐ = ๐ ๐๐๐๐๐๐(๐ค๐ฅ + ๐)
12. What is artificial Neural Networks?
Itโs a simple math!
activation: ๐ฆ = ๐ ๐ค๐ฅ + ๐ = ๐ ๐๐๐๐๐๐(๐ค๐ฅ + ๐)
13. What is artificial Neural Networks?
Itโs a simple math!
activation: ๐ฆ = ๐ ๐ค๐ฅ + ๐ = ๐ ๐๐๐๐๐๐(๐ค๐ฅ + ๐)
14. What is artificial Neural Networks?
Itโs a simple math!
n inputs
m neurons
in hidden layer
๐ ๐ =
๐=1
๐
๐ค๐๐ ๐ฅ๐ + ๐๐
๐ฆ๐ = ๐ ๐ ๐
Output of i-th neuron:
Output of k-th layer:
1) ๐ ๐ = ๐๐ ๐ ๐ + ๐ต ๐ =
=
๐ค11 ๐ค12 โฏ ๐ค1๐
๐ค21 ๐ค21 โฏ ๐ค21
โฏ โฏ โฏ โฏ
๐ค ๐1 ๐ค ๐2 โฏ ๐ค ๐๐ ๐
๐ฅ1
๐ฅ2
๐ฅ3
โฎ
๐ฅ ๐ ๐
+
๐1
๐2
๐3
โฎ
๐ ๐ ๐
2) ๐๐ = ๐๐ ๐ ๐
apply element-wise
Kolmagorov & Arnold function superposition
Form of F:
15. Neural networks. Overview
โข Common principles
โ Structure
โ Learning
โข Shallow and Deep NN
โข Additional methods
โ Conventional
โ Voodoo
16. How to find parameters
W and B?
Supervised learning:
Training set (pairs of variables and responses):
๐; ๐ ๐, ๐ = 1. . ๐
Find: ๐โ
, ๐ตโ
= ๐๐๐๐๐๐
๐,๐ต
๐ฟ ๐น ๐ , ๐
Cost function (loss, error):
logloss: L ๐น ๐ , ๐ =
1
๐ ๐=1
๐
๐=1
๐
๐ฆ๐.๐ log ๐๐,๐
rmse: L ๐น ๐ , ๐ =
1
๐ ๐=1
๐
๐น ๐๐ โ ๐๐ 2
โ1โ if in i-th sample is
class j else โ0โ
previously scaled:
๐๐,๐ = ๐๐,๐ ๐ ๐๐,๐
Just an examples.
Cost function depend on
problem (classification,
regression) and domain
knowledge
17. Training or optimization algorithm
So, we have model cost ๐ฟ (or error of prediction)
And we want to update weights in order to minimize ๐ณ:
๐คโ = ๐ค + ๐ผฮ๐ค
In accordance to gradient descent: ฮ๐ค = โ๐ป๐ฟ
Itโs clear for network with only one layer (we have
predicted outputs and targets, so can evaluate ๐ฟ).
But how to find ๐๐ for hidden layers?
18. Meet โError Back Propagationโ
Find ฮ๐ค for each layer from the last to the first
as influence of weights to cost:
โ๐ค๐,๐ =
๐๐ฟ
๐๐ค๐,๐
and:
๐๐ฟ
๐๐ค๐,๐
=
๐๐ฟ
๐๐๐
๐๐๐
๐๐ ๐
๐๐ ๐
๐๐ค๐,๐
20. Gradient Descent
in real life
Recall gradient descent:
๐คโ
= ๐ค + ๐ผฮ๐ค
๐ผ is a โstepโ coefficient. In term of ML โ learning rate.
Recall cost function:
๐ฟ =
1
๐
๐
โฆ
GD modification: update ๐ค for each sample.
Sum along all samples,
And what if ๐ = 106 or more?
Typical: ๐ผ = 0.01. . 0.1
21. Gradient Descent
Stochastic & Minibatch
โBatchโ GD
(L for full set)
need a lot of memory
Stochastic GD
(L for each sample)
fast, but fluctuation
Minibatch GD
(L for subsets)
less memory & less fluctuations
Size of minibatch depends on HW Typical: minibatch=32โฆ256
22. Termination criteria
By epochs count
max number of iterations along all data set
By value of gradient
when gradient is equal to 0 than minimum, but small gradient => very
slow learning
When cost didnโt change during several epochs
if error is not change than training procedure is not converges
Early stopping
Stop when โvalidationโ score starts increase
even when โtrainโ score continue decreasing
Typical: epochs=50โฆ200
23. Neural networks. Overview
โข Common principles
โ Structure
โ Learning
โข Shallow and Deep NN
โข Additional methods
โ Conventional
โ Voodoo
24. What about โformโ of F?
Network topology
โShallowโ networks 1, 2 hidden layers => not
enough parameters => pure separation abilities
โDeepโ networks is a NN with 2..10 layers
โVery deepโ networks is a NN with >10 layers
25. Deep learning. Problems
โข Big networks => Too huge
separating ability => Overfitting
โข Vanishing gradient problem
during training
โข Complex errorโs surface => Local
minimum
โข Curse of dimensionality => memory
& computations
๐(๐โ1)
๐(๐)
dim ๐(๐)
= ๐ ๐โ1
โ ๐(๐)
26. Neural networks. Overview
โข Common principles
โ Structure
โ Learning
โข Shallow and Deep NN
โข Additional methods
โ Conventional
โ Voodoo
27. Additional methods
Conventional
โข Momentum (prevent the variations on error surface)
โ๐ค(๐ก)
= โ๐ผ๐ป๐ฟ ๐ค ๐ก
+ ๐ฝโ๐ค(๐กโ1)
๐๐๐๐๐๐ก๐ข๐
โข LR decay (make smaller steps near optimum)
๐ผ(๐ก)
= ๐๐ผ(๐กโ1)
, 0 < ๐ < 1
โข Weight Decay(prevent weight growing, and smooth F)
๐ฟโ
= ๐ฟ + ๐ ๐ค(๐ก)
L1 or L2 regularization often used
Typical: ๐ฝ = 0.9
Typical: apply LR decay (๐ = 0.1) each 10..100 epochs
Typical: ๐ฟ2 with ๐ = 0.0005
28. Neural networks. Overview
โข Common principles
โ Structure
โ Learning
โข Shallow and Deep NN
โข Additional methods
โ Conventional
โ Voodoo