5. Background
Purpose: predict Y from X;
What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn);
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/58
6. Background
Purpose: predict Y from X;
What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn);
What we want: estimate unknown Y from new X: xn+1, . . . , xm.
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/58
7. Background
Purpose: predict Y from X;
What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn);
What we want: estimate unknown Y from new X: xn+1, . . . , xm.
X can be:
numeric variables;
or factors;
or a combination of numeric variables and factors.
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/58
8. Background
Purpose: predict Y from X;
What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn);
What we want: estimate unknown Y from new X: xn+1, . . . , xm.
X can be:
numeric variables;
or factors;
or a combination of numeric variables and factors.
Y can be:
a numeric variable (Y ∈ R) ⇒ (supervised) regression régression;
a factor ⇒ (supervised) classification discrimination.
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/58
9. Basics
From (xi, yi)i, definition of a machine, Φn
s.t.:
ˆynew = Φn
(xnew).
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
10. Basics
From (xi, yi)i, definition of a machine, Φn
s.t.:
ˆynew = Φn
(xnew).
if Y is numeric, Φn
is called a regression function fonction de
classification;
if Y is a factor, Φn
is called a classifier classifieur;
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
11. Basics
From (xi, yi)i, definition of a machine, Φn
s.t.:
ˆynew = Φn
(xnew).
if Y is numeric, Φn
is called a regression function fonction de
classification;
if Y is a factor, Φn
is called a classifier classifieur;
Φn
is said to be trained or learned from the observations (xi, yi)i.
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
12. Basics
From (xi, yi)i, definition of a machine, Φn
s.t.:
ˆynew = Φn
(xnew).
if Y is numeric, Φn
is called a regression function fonction de
classification;
if Y is a factor, Φn
is called a classifier classifieur;
Φn
is said to be trained or learned from the observations (xi, yi)i.
Desirable properties
accuracy to the observations: predictions made on known data are
close to observed values;
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
13. Basics
From (xi, yi)i, definition of a machine, Φn
s.t.:
ˆynew = Φn
(xnew).
if Y is numeric, Φn
is called a regression function fonction de
classification;
if Y is a factor, Φn
is called a classifier classifieur;
Φn
is said to be trained or learned from the observations (xi, yi)i.
Desirable properties
accuracy to the observations: predictions made on known data are
close to observed values;
generalization ability: predictions made on new data are also
accurate.
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
14. Basics
From (xi, yi)i, definition of a machine, Φn
s.t.:
ˆynew = Φn
(xnew).
if Y is numeric, Φn
is called a regression function fonction de
classification;
if Y is a factor, Φn
is called a classifier classifieur;
Φn
is said to be trained or learned from the observations (xi, yi)i.
Desirable properties
accuracy to the observations: predictions made on known data are
close to observed values;
generalization ability: predictions made on new data are also
accurate.
Conflicting objectives!!
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
15. Underfitting/Overfitting sous/sur - apprentissage
Function x → y to be estimated
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
16. Underfitting/Overfitting sous/sur - apprentissage
Observations we might have
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
17. Underfitting/Overfitting sous/sur - apprentissage
Observations we do have
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
18. Underfitting/Overfitting sous/sur - apprentissage
First estimation from the observations: underfitting
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
19. Underfitting/Overfitting sous/sur - apprentissage
Second estimation from the observations: accurate estimation
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
20. Underfitting/Overfitting sous/sur - apprentissage
Third estimation from the observations: overfitting
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
22. Errors
training error (measures the accuracy to the observations)
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
23. Errors
training error (measures the accuracy to the observations)
if y is a factor: misclassification rate
{ˆyi yi, i = 1, . . . , n}
n
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
24. Errors
training error (measures the accuracy to the observations)
if y is a factor: misclassification rate
{ˆyi yi, i = 1, . . . , n}
n
if y is numeric: mean square error (MSE)
1
n
n
i=1
(ˆyi − yi)2
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
25. Errors
training error (measures the accuracy to the observations)
if y is a factor: misclassification rate
{ˆyi yi, i = 1, . . . , n}
n
if y is numeric: mean square error (MSE)
1
n
n
i=1
(ˆyi − yi)2
or root mean square error (RMSE) or pseudo-R2
: 1−MSE/Var((yi)i)
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
26. Errors
training error (measures the accuracy to the observations)
if y is a factor: misclassification rate
{ˆyi yi, i = 1, . . . , n}
n
if y is numeric: mean square error (MSE)
1
n
n
i=1
(ˆyi − yi)2
or root mean square error (RMSE) or pseudo-R2
: 1−MSE/Var((yi)i)
test error: a way to prevent overfitting (estimates the generalization
error) is the simple validation
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
27. Errors
training error (measures the accuracy to the observations)
if y is a factor: misclassification rate
{ˆyi yi, i = 1, . . . , n}
n
if y is numeric: mean square error (MSE)
1
n
n
i=1
(ˆyi − yi)2
or root mean square error (RMSE) or pseudo-R2
: 1−MSE/Var((yi)i)
test error: a way to prevent overfitting (estimates the generalization
error) is the simple validation
1 split the data into training/test sets (usually 80%/20%)
2 train Φn
from the training dataset
3 calculate the test error from the remaining data
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
32. Bias / Variance trade-off
This problem is also related to the well known bias / variance trade-off
bias: error that comes from erroneous assumptions in the learning
algorithm (average error of the predictor)
variance: error that comes from sensitivity to small fluctuations in the
training set (variance of the predictor)
Nathalie Villa-Vialaneix | Introduction to statistical learning 9/58
33. Bias / Variance trade-off
This problem is also related to the well known bias / variance trade-off
bias: error that comes from erroneous assumptions in the learning
algorithm (average error of the predictor)
variance: error that comes from sensitivity to small fluctuations in the
training set (variance of the predictor)
Overall error is: E(MSE) = Bias2
+ Variance
Nathalie Villa-Vialaneix | Introduction to statistical learning 9/58
34. Consistency in the parametric/non parametric case
Example in the parametric framework (linear methods)
an assumption is made on the form of the relation between X and Y:
Y = βT
X +
β is estimated from the observations (x1, y1), . . . , (xn, yn) by a given
method which calculates a βn
.
The estimation is said to be consistent if βn n→+∞
−−−−−−→ β under (possibly)
technical assumptions on X, , Y.
Nathalie Villa-Vialaneix | Introduction to statistical learning 10/58
35. Consistency in the parametric/non parametric case
Example in the nonparametric framework
the form of the relation between X and Y is unknown:
Y = Φ(X) +
Φ is estimated from the observations (x1, y1), . . . , (xn, yn) by a given
method which calculates a Φn
.
The estimation is said to be consistent if Φn n→+∞
−−−−−−→ Φ under (possibly)
technical assumptions on X, , Y.
Nathalie Villa-Vialaneix | Introduction to statistical learning 10/58
36. Consistency from the statistical learning perspective
[Vapnik, 1995]
Question: Are we really interested in estimating Φ or...
Nathalie Villa-Vialaneix | Introduction to statistical learning 11/58
37. Consistency from the statistical learning perspective
[Vapnik, 1995]
Question: Are we really interested in estimating Φ or...
... rather in having the smallest prediction error?
Statistical learning perspective: a method that builds a machine Φn
from
the observations is said to be (universally) consistent if, given a risk
function R : R × R → R+ (which calculates an error),
E (R(Φn
(X), Y))
n→+∞
−−−−−−→ inf
Φ:X→R
E (R(Φ(X), Y)) ,
for any distribution of (X, Y) ∈ X × R.
Definitions: L∗ = infΦ:X→R E (R(Φ(X), Y)) and LΦ = E (R(Φ(X), Y)).
Nathalie Villa-Vialaneix | Introduction to statistical learning 11/58
38. Desirable properties from a mathematical perspective
Simplified framework: X ∈ X and Y ∈ {−1, 1} (binary classification)
Learning process: choose a machine Φn
in a class of functions
C ⊂ {Φ : X → R} (e.g., C is the set of all functions that can be build using a
SVM).
Error decomposition
LΦn
− L∗
≤ LΦn
− inf
Φ∈C
LΦ + inf
Φ∈C
LΦ − L∗
with
infΦ∈C LΦ − L∗ is the richness of C (i.e., C must be rich to ensure that
this term is small);
Nathalie Villa-Vialaneix | Introduction to statistical learning 12/58
39. Desirable properties from a mathematical perspective
Simplified framework: X ∈ X and Y ∈ {−1, 1} (binary classification)
Learning process: choose a machine Φn
in a class of functions
C ⊂ {Φ : X → R} (e.g., C is the set of all functions that can be build using a
SVM).
Error decomposition
LΦn
− L∗
≤ LΦn
− inf
Φ∈C
LΦ + inf
Φ∈C
LΦ − L∗
with
infΦ∈C LΦ − L∗ is the richness of C (i.e., C must be rich to ensure that
this term is small);
LΦn
− infΦ∈C LΦ ≤ 2 supΦ∈C |Ln
Φ − LΦ|, Ln
Φ = 1
n
n
i=1 R(Φ(xi), yi) is
the generalization capability of C (i.e., in the worst case, the empirical
error must be close to the true error: C must not be too rich to ensure
that this term is small).
Nathalie Villa-Vialaneix | Introduction to statistical learning 12/58
40. Outline
1 Introduction
Background and notations
Underfitting / Overfitting
Consistency
2 CART and random forests
Introduction to CART
Learning
Prediction
Overview of random forests
Bootstrap/Bagging
Random forest
3 SVM
4 (not deep) Neural networks
Seminal references
Multi-layer perceptrons
Theoretical properties of perceptrons
Learning perceptrons
Learning in practice
Nathalie Villa-Vialaneix | Introduction to statistical learning 13/58
41. Overview
CART: Classification And Regression Trees introduced by
[Breiman et al., 1984].
Nathalie Villa-Vialaneix | Introduction to statistical learning 14/58
42. Overview
CART: Classification And Regression Trees introduced by
[Breiman et al., 1984].
Advantages
classification OR regression (i.e., Y can be a numeric variable or a
factor);
non parametric method: no prior assumption needed;
can deal with a large number of input variables, either numeric
variables or factors (a variable selection is included in the method);
provide an intuitive interpretation.
Nathalie Villa-Vialaneix | Introduction to statistical learning 14/58
43. Overview
CART: Classification And Regression Trees introduced by
[Breiman et al., 1984].
Advantages
classification OR regression (i.e., Y can be a numeric variable or a
factor);
non parametric method: no prior assumption needed;
can deal with a large number of input variables, either numeric
variables or factors (a variable selection is included in the method);
provide an intuitive interpretation.
Drawbacks
require a large training dataset to be efficient;
as a consequence, are often too simple to provide accurate
predictions.
Nathalie Villa-Vialaneix | Introduction to statistical learning 14/58
44. Example
X = (Gender, Age, Height) and Y = Weight
Y1 Y2 Y3
Y4 Y5
Height<1.60m Height>1.60m
Gender=M Gender=F Age<30 Age>30
Gender=M Gender=F
Root
a split
a node
a leaf
(terminal node)
Nathalie Villa-Vialaneix | Introduction to statistical learning 15/58
45. CART learning process
Algorithm
Start from root
repeat
move to a “new” node
if the node is homogeneous or small enough then
STOP
else
split the node into two child nodes with maximal “homogeneity”
end if
until all nodes are processed
Nathalie Villa-Vialaneix | Introduction to statistical learning 16/58
46. Further details
Homogeneity?
if Y is a numeric variable, variance of (yi)i for the observations
assigned to the node (Gini index is also sometimes used);
Nathalie Villa-Vialaneix | Introduction to statistical learning 17/58
47. Further details
Homogeneity?
if Y is a numeric variable, variance of (yi)i for the observations
assigned to the node (Gini index is also sometimes used);
if Y is a factor, node purity: % of observations assigned to the node
whose Y values are not the node majority class.
Nathalie Villa-Vialaneix | Introduction to statistical learning 17/58
48. Further details
Homogeneity?
if Y is a numeric variable, variance of (yi)i for the observations
assigned to the node (Gini index is also sometimes used);
if Y is a factor, node purity: % of observations assigned to the node
whose Y values are not the node majority class.
Stopping criteria?
Minimum size node (generally 1 or 5)
Minimum node purity or variance
Maximum tree depth
Nathalie Villa-Vialaneix | Introduction to statistical learning 17/58
49. Further details
Homogeneity?
if Y is a numeric variable, variance of (yi)i for the observations
assigned to the node (Gini index is also sometimes used);
if Y is a factor, node purity: % of observations assigned to the node
whose Y values are not the node majority class.
Stopping criteria?
Minimum size node (generally 1 or 5)
Minimum node purity or variance
Maximum tree depth
Hyperparameters can be tuned by cross-validation using a grid search.
Nathalie Villa-Vialaneix | Introduction to statistical learning 17/58
50. Further details
Homogeneity?
if Y is a numeric variable, variance of (yi)i for the observations
assigned to the node (Gini index is also sometimes used);
if Y is a factor, node purity: % of observations assigned to the node
whose Y values are not the node majority class.
Stopping criteria?
Minimum size node (generally 1 or 5)
Minimum node purity or variance
Maximum tree depth
Hyperparameters can be tuned by cross-validation using a grid search.
An alternative approach is pruning...
Nathalie Villa-Vialaneix | Introduction to statistical learning 17/58
51. Making new predictions
A new observation, xnew
is assigned to a leaf (straightforward)
Nathalie Villa-Vialaneix | Introduction to statistical learning 18/58
52. Making new predictions
A new observation, xnew
is assigned to a leaf (straightforward)
the corresponding predicted ˆynew is
if Y is numeric, the mean value of the observations (training set)
assigned to the same leaf
Nathalie Villa-Vialaneix | Introduction to statistical learning 18/58
53. Making new predictions
A new observation, xnew
is assigned to a leaf (straightforward)
the corresponding predicted ˆynew is
if Y is numeric, the mean value of the observations (training set)
assigned to the same leaf
if Y is a factor, the majority class of the observations (training set)
assigned to the same leaf
Nathalie Villa-Vialaneix | Introduction to statistical learning 18/58
55. Advantages/Drawbacks
Random Forest: introduced by [Breiman, 2001]
Advantages
classification OR regression (i.e., Y can be a numeric variable or a
factor)
non parametric method (no prior assumption needed) and accurate
can deal with a large number of input variables, either numeric
variables or factors
can deal with small samples
Nathalie Villa-Vialaneix | Introduction to statistical learning 19/58
56. Advantages/Drawbacks
Random Forest: introduced by [Breiman, 2001]
Advantages
classification OR regression (i.e., Y can be a numeric variable or a
factor)
non parametric method (no prior assumption needed) and accurate
can deal with a large number of input variables, either numeric
variables or factors
can deal with small samples
Drawbacks
black box model
is only supported by a few mathematical results (consistency...) until
now
Nathalie Villa-Vialaneix | Introduction to statistical learning 19/58
57. Basic description
A fact: When the sample size is small, you might be unable to estimate
properly
Nathalie Villa-Vialaneix | Introduction to statistical learning 20/58
58. Basic description
A fact: When the sample size is small, you might be unable to estimate
properly
This issue is commonly tackled by bootstrapping and, more specifically,
bagging (Boostrap Aggregating) ⇒ it reduces the variance of the estimator
Nathalie Villa-Vialaneix | Introduction to statistical learning 20/58
59. Basic description
A fact: When the sample size is small, you might be unable to estimate
properly
This issue is commonly tackled by bootstrapping and, more specifically,
bagging (Boostrap Aggregating) ⇒ it reduces the variance of the estimator
Bagging: combination of simple (and underefficient) regression (or
classification) functions
Nathalie Villa-Vialaneix | Introduction to statistical learning 20/58
60. Basic description
A fact: When the sample size is small, you might be unable to estimate
properly
This issue is commonly tackled by bootstrapping and, more specifically,
bagging (Boostrap Aggregating) ⇒ it reduces the variance of the estimator
Bagging: combination of simple (and underefficient) regression (or
classification) functions
Random forest CART bagging
Nathalie Villa-Vialaneix | Introduction to statistical learning 20/58
61. Bootstrap
Bootstrap sample: random sampling (with replacement) of the training
dataset - samples have the same size than the original dataset
Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
62. Bootstrap
Bootstrap sample: random sampling (with replacement) of the training
dataset - samples have the same size than the original dataset
General (and robust) approach to solve several problems:
Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
63. Bootstrap
Bootstrap sample: random sampling (with replacement) of the training
dataset - samples have the same size than the original dataset
General (and robust) approach to solve several problems:
Estimating confidence intervals (of X with no prior assumption on the
distribution of X)
1 Build P bootstrap samples from (xi)i
2 Use them to estimate X P times
3 The confidence interval is based on the percentiles of the empirical
distribution of X
Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
64. Bootstrap
Bootstrap sample: random sampling (with replacement) of the training
dataset - samples have the same size than the original dataset
General (and robust) approach to solve several problems:
Estimating confidence intervals (of X with no prior assumption on the
distribution of X)
Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
65. Bootstrap
Bootstrap sample: random sampling (with replacement) of the training
dataset - samples have the same size than the original dataset
General (and robust) approach to solve several problems:
Estimating confidence intervals (of X with no prior assumption on the
distribution of X)
Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
66. Bootstrap
Bootstrap sample: random sampling (with replacement) of the training
dataset - samples have the same size than the original dataset
General (and robust) approach to solve several problems:
Estimating confidence intervals (of X with no prior assumption on the
distribution of X)
Also useful to estimate p-values, residuals, ...
Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
67. Bagging
Average the estimates of the regression (or the classification) function
obtained from B bootstrap samples
Nathalie Villa-Vialaneix | Introduction to statistical learning 22/58
68. Bagging
Average the estimates of the regression (or the classification) function
obtained from B bootstrap samples
Bagging with regression trees
1: for b = 1, . . . , B do
2: Build a bootstrap sample ξb
3: Train a regression tree from ξb, ˆφb
4: end for
5: Estimate the regression function by
ˆΦn
(x) =
1
B
B
b=1
ˆφb(x)
Nathalie Villa-Vialaneix | Introduction to statistical learning 22/58
69. Bagging
Average the estimates of the regression (or the classification) function
obtained from B bootstrap samples
Bagging with regression trees
1: for b = 1, . . . , B do
2: Build a bootstrap sample ξb
3: Train a regression tree from ξb, ˆφb
4: end for
5: Estimate the regression function by
ˆΦn
(x) =
1
B
B
b=1
ˆφb(x)
For classification, the predicted class is the majority vote class.
Nathalie Villa-Vialaneix | Introduction to statistical learning 22/58
70. Random forests
CART bagging with additional variations
1 each node is based on a random (and different) subset of q variables
(an advisable choice for q is
√
p for classification and p/3 for
regression)
Nathalie Villa-Vialaneix | Introduction to statistical learning 23/58
71. Random forests
CART bagging with additional variations
1 each node is based on a random (and different) subset of q variables
(an advisable choice for q is
√
p for classification and p/3 for
regression)
2 the tree is fully developed (overfitted)
Hyperparameters
those of the CART algorithm
those that are specific to the random forest: q, number of trees
Nathalie Villa-Vialaneix | Introduction to statistical learning 23/58
72. Random forests
CART bagging with additional variations
1 each node is based on a random (and different) subset of q variables
(an advisable choice for q is
√
p for classification and p/3 for
regression)
2 the tree is fully developed (overfitted)
Hyperparameters
those of the CART algorithm
those that are specific to the random forest: q, number of trees
Random forests are not very sensitive to hyper-parameter settings: default
values for q and 500/1000 trees should work in most cases.
Nathalie Villa-Vialaneix | Introduction to statistical learning 23/58
73. Additional tools
OOB (Out-Of Bags) error: error based on the observations not
included in the “bag”
Nathalie Villa-Vialaneix | Introduction to statistical learning 24/58
74. Additional tools
OOB (Out-Of Bags) error: error based on the observations not
included in the “bag”
Stabilization of OOB error is a good indication that there is enough
trees in the forest
Nathalie Villa-Vialaneix | Introduction to statistical learning 24/58
75. Additional tools
OOB (Out-Of Bags) error: error based on the observations not
included in the “bag”
Importance of a variable to help interpretation: for a given variable Xj
1: randomize the values of the variable
2: make predictions from this new dataset
3: the importance is the mean decrease in accuracy (MSE or
misclassification rate)
Nathalie Villa-Vialaneix | Introduction to statistical learning 24/58
76. Outline
1 Introduction
Background and notations
Underfitting / Overfitting
Consistency
2 CART and random forests
Introduction to CART
Learning
Prediction
Overview of random forests
Bootstrap/Bagging
Random forest
3 SVM
4 (not deep) Neural networks
Seminal references
Multi-layer perceptrons
Theoretical properties of perceptrons
Learning perceptrons
Learning in practice
Nathalie Villa-Vialaneix | Introduction to statistical learning 25/58
77. Basic introduction
Binary classification problem: X ∈ H et Y ∈ {−1; 1}
A training set is given: (x1, y1), . . . , (xn, yn)
Nathalie Villa-Vialaneix | Introduction to statistical learning 26/58
78. Basic introduction
Binary classification problem: X ∈ H et Y ∈ {−1; 1}
A training set is given: (x1, y1), . . . , (xn, yn)
SVM is a method based on kernels. It is universally consistent method,
given that the kernel is universal [Steinwart, 2002].
Extensions to the regression case exist (SVR or LS-SVM) that are also
universally consistent when the kernel is universal.
Nathalie Villa-Vialaneix | Introduction to statistical learning 26/58
82. Optimal margin classification
w
margin: 1
w 2
Support Vector
w is chosen such that:
minw w 2
(the margin is the largest),
under the constraints: yi( w, xi + b) ≥ 1, 1 ≤ i ≤ n (the separation
between the two classes is perfect).
⇒ ensures a good generalization capability.
Nathalie Villa-Vialaneix | Introduction to statistical learning 27/58
86. Soft margin classification
w
margin: 1
w 2
Support Vector
w is chosen such that:
minw,ξ w 2
+ C n
i=1 ξi (the margin is the largest),
under the constraints: yi( w, xi + b) ≥ 1 − ξi, 1 ≤ i ≤ n,
ξi ≥ 0, 1 ≤ i ≤ n.
(the separation between the two classes is almost perfect).
⇒ allowing a few errors improves the richness of the class.
Nathalie Villa-Vialaneix | Introduction to statistical learning 28/58
87. Non linear SVM
Original space X
Nathalie Villa-Vialaneix | Introduction to statistical learning 29/58
88. Non linear SVM
Original space X Feature space H
Ψ (non linear)
Nathalie Villa-Vialaneix | Introduction to statistical learning 29/58
89. Non linear SVM
Original space X Feature space H
Ψ (non linear)
Nathalie Villa-Vialaneix | Introduction to statistical learning 29/58
90. Non linear SVM
Original space X Feature space H
Ψ (non linear)
w ∈ H is chosen such that (PC,H ):
minw,ξ w 2
H
+ C n
i=1 ξi (the margin in the feature space is the
largest),
under the constraints: yi( w, Ψ(xi) H + b) ≥ 1 − ξi, 1 ≤ i ≤ n,
ξi ≥ 0, 1 ≤ i ≤ n.
(the separation between the two classes in the feature space is
almost perfect).
Nathalie Villa-Vialaneix | Introduction to statistical learning 29/58
91. SVM from different points of view
A regularization problem: (PC,H ) ⇔
(P2
λ,H ) : min
w∈H
1
n
n
i=1
R(fw(xi), yi)
error term
+λ w 2
H
penalization term
,
where fw(x) = Ψ(x), w H and R(ˆy, y) = max(0, 1 − ˆyy) (hinge loss
function)
errors versus ˆy for y = 1:
blue: hinge loss;
green: misclassification error.
Nathalie Villa-Vialaneix | Introduction to statistical learning 30/58
92. SVM from different points of view
A regularization problem: (PC,H ) ⇔
(P2
λ,H ) : min
w∈H
1
n
n
i=1
R(fw(xi), yi)
error term
+λ w 2
H
penalization term
,
where fw(x) = Ψ(x), w H and R(ˆy, y) = max(0, 1 − ˆyy) (hinge loss
function)
A dual problem: (PC,H ) ⇔
(DC,X) : maxα∈Rn
n
i=1 αi − n
i=1
n
j=1 αiαjyiyj Ψ(xi), Ψ(xj) H ,
with N
i=1 αiyi = 0,
0 ≤ αi ≤ C, 1 ≤ i ≤ n.
Nathalie Villa-Vialaneix | Introduction to statistical learning 30/58
93. SVM from different points of view
A regularization problem: (PC,H ) ⇔
(P2
λ,H ) : min
w∈H
1
n
n
i=1
R(fw(xi), yi)
error term
+λ w 2
H
penalization term
,
where fw(x) = Ψ(x), w H and R(ˆy, y) = max(0, 1 − ˆyy) (hinge loss
function)
A dual problem: (PC,H ) ⇔
(DC,X) : maxα∈Rn
n
i=1 αi − n
i=1
n
j=1 αiαjyiyjK(xi, xj),
with N
i=1 αiyi = 0,
0 ≤ αi ≤ C, 1 ≤ i ≤ n.
There is no need to know Ψ and H:
choose a function K with a few good properties;
use it as the dot product in H:
∀ u, v ∈ H, K(u, v) = Ψ(u), Ψ(v) H .
Nathalie Villa-Vialaneix | Introduction to statistical learning 30/58
94. Which kernels?
Minimum properties that a kernel should fulfilled
symmetry: K(u, u ) = K(u , u)
positivity: ∀ N ∈ N, ∀ (αi) ⊂ RN
, ∀ (xi) ⊂ XN
, i,j αiαjK(xi, xj) ≥ 0.
[Aronszajn, 1950]: ∃ a Hilbert space (H, ., . H ) and a function Ψ : X → H
such that:
∀ u, v ∈ H, K(u, v) = Ψ(u), Ψ(v) H
Nathalie Villa-Vialaneix | Introduction to statistical learning 31/58
95. Which kernels?
Minimum properties that a kernel should fulfilled
symmetry: K(u, u ) = K(u , u)
positivity: ∀ N ∈ N, ∀ (αi) ⊂ RN
, ∀ (xi) ⊂ XN
, i,j αiαjK(xi, xj) ≥ 0.
[Aronszajn, 1950]: ∃ a Hilbert space (H, ., . H ) and a function Ψ : X → H
such that:
∀ u, v ∈ H, K(u, v) = Ψ(u), Ψ(v) H
Examples
the Gaussian kernel: ∀ x, x ∈ Rd
, K(x, x ) = e−γ x−x 2
(it is universal
for all bounded subset of Rd
);
the linear kernel: ∀ x, x ∈ Rd
, K(x, x ) = xT
(x ) (it is not universal).
Nathalie Villa-Vialaneix | Introduction to statistical learning 31/58
96. In summary, how does the solution write????
Φn
(x) =
i
αiyiK(xi, x)
where only a few αi 0. i such that αi 0 are the support vectors!
Nathalie Villa-Vialaneix | Introduction to statistical learning 32/58
97. I’m almost dead with all these stuffs on my mind!!!
What in practice?
data(iris)
iris <- iris[iris$Species%in%c("versicolor","virginica"),]
plot(iris$Petal.Length, iris$Petal.Width, col=iris$Species ,
pch=19)
legend("topleft", pch=19, col=c(2,3),
legend=c("versicolor", "virginica"))
Nathalie Villa-Vialaneix | Introduction to statistical learning 33/58
98. I’m almost dead with all these stuffs on my mind!!!
What in practice?
library(e1071)
res.tune <- tune.svm(Species ~ ., data=iris, kernel="linear",
cost = 2^(-1:4))
# Parameter tuning of ’svm’:
# - sampling method: 10fold cross validation
# - best parameters:
# cost
# 0.5
# - best performance: 0.05
res.tune$best.model
# Call:
# best.svm(x = Species ~ ., data = iris, cost = 2^(-1:4),
# kernel = "linear")
# Parameters:
# SVM-Type: C-classification
# SVM-Kernel: linear
# cost: 0.5
# gamma: 0.25
# Number of Support Vectors: 21
Nathalie Villa-Vialaneix | Introduction to statistical learning 34/58
99. I’m almost dead with all these stuffs on my mind!!!
What in practice?
table(res.tune$best.model$fitted, iris$Species)
% setosa versicolor virginica
% setosa 0 0 0
% versicolor 0 45 0
% virginica 0 5 50
plot(res.tune$best.model, data=iris, Petal.Width~Petal.Length,
slice = list(Sepal.Width = 2.872, Sepal.Length = 6.262))
Nathalie Villa-Vialaneix | Introduction to statistical learning 35/58
100. I’m almost dead with all these stuffs on my mind!!!
What in practice?
res.tune <- tune.svm(Species ~ ., data=iris, gamma = 2^(-1:1),
cost = 2^(2:4))
# Parameter tuning of ’svm’:
# - sampling method: 10fold cross validation
# - best parameters:
# gamma cost
# 0.5 4
# - best performance: 0.08
res.tune$best.model
# Call:
# best.svm(x = Species ~ ., data = iris, gamma = 2^(-1:1),
# cost = 2^(2:4))
# Parameters:
# SVM-Type: C-classification
# SVM-Kernel: radial
# cost: 4
# gamma: 0.5
# Number of Support Vectors: 32
Nathalie Villa-Vialaneix | Introduction to statistical learning 36/58
101. I’m almost dead with all these stuffs on my mind!!!
What in practice?
table(res.tune$best.model$fitted, iris$Species)
% setosa versicolor virginica
% setosa 0 0 0
% versicolor 0 49 0
% virginica 0 1 50
plot(res.tune$best.model, data=iris, Petal.Width~Petal.Length,
slice = list(Sepal.Width = 2.872, Sepal.Length = 6.262))
Nathalie Villa-Vialaneix | Introduction to statistical learning 37/58
102. Outline
1 Introduction
Background and notations
Underfitting / Overfitting
Consistency
2 CART and random forests
Introduction to CART
Learning
Prediction
Overview of random forests
Bootstrap/Bagging
Random forest
3 SVM
4 (not deep) Neural networks
Seminal references
Multi-layer perceptrons
Theoretical properties of perceptrons
Learning perceptrons
Learning in practice
Nathalie Villa-Vialaneix | Introduction to statistical learning 38/58
103. What are (artificial) neural networks?
Common properties
(artificial) “Neural networks”: general name for supervised and
unsupervised methods developed in (vague) analogy to the brain
Nathalie Villa-Vialaneix | Introduction to statistical learning 39/58
104. What are (artificial) neural networks?
Common properties
(artificial) “Neural networks”: general name for supervised and
unsupervised methods developed in (vague) analogy to the brain
combination (network) of simple elements (neurons)
Nathalie Villa-Vialaneix | Introduction to statistical learning 39/58
105. What are (artificial) neural networks?
Common properties
(artificial) “Neural networks”: general name for supervised and
unsupervised methods developed in (vague) analogy to the brain
combination (network) of simple elements (neurons)
Example of graphical representation:
INPUTS
OUTPUTS
Nathalie Villa-Vialaneix | Introduction to statistical learning 39/58
106. Different types of neural networks
A neural network is defined by:
1 the network structure;
2 the neuron type.
Nathalie Villa-Vialaneix | Introduction to statistical learning 40/58
107. Different types of neural networks
A neural network is defined by:
1 the network structure;
2 the neuron type.
Standard examples
Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to
supervised problems (classification and regression);
Nathalie Villa-Vialaneix | Introduction to statistical learning 40/58
108. Different types of neural networks
A neural network is defined by:
1 the network structure;
2 the neuron type.
Standard examples
Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to
supervised problems (classification and regression);
Radial basis function networks (RBF): same purpose but based on
local smoothing;
Nathalie Villa-Vialaneix | Introduction to statistical learning 40/58
109. Different types of neural networks
A neural network is defined by:
1 the network structure;
2 the neuron type.
Standard examples
Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to
supervised problems (classification and regression);
Radial basis function networks (RBF): same purpose but based on
local smoothing;
Self-organizing maps (SOM also sometimes called Kohonen’s maps)
or Topographic maps: dedicated to unsupervised problems
(clustering), self-organized;
. . .
Nathalie Villa-Vialaneix | Introduction to statistical learning 40/58
110. Different types of neural networks
A neural network is defined by:
1 the network structure;
2 the neuron type.
Standard examples
Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to
supervised problems (classification and regression);
Radial basis function networks (RBF): same purpose but based on
local smoothing;
Self-organizing maps (SOM also sometimes called Kohonen’s maps)
or Topographic maps: dedicated to unsupervised problems
(clustering), self-organized;
. . .
In this talk, focus on MLP.
Nathalie Villa-Vialaneix | Introduction to statistical learning 40/58
111. MLP: Advantages/Drawbacks
Advantages
classification OR regression (i.e., Y can be a numeric variable or a
factor);
non parametric method: flexible;
good theoretical properties.
Nathalie Villa-Vialaneix | Introduction to statistical learning 41/58
112. MLP: Advantages/Drawbacks
Advantages
classification OR regression (i.e., Y can be a numeric variable or a
factor);
non parametric method: flexible;
good theoretical properties.
Drawbacks
hard to train (high computational cost, especially when d is large);
overfit easily;
“black box” models (hard to interpret).
Nathalie Villa-Vialaneix | Introduction to statistical learning 41/58
113. References
Advised references:
[Bishop, 1995, Ripley, 1996] overview of the topic from a learning (more
than statistical) perspective
[Devroye et al., 1996, Györfi et al., 2002] in dedicated chapters present
statistical properties of perceptrons
Nathalie Villa-Vialaneix | Introduction to statistical learning 42/58
114. Analogy to the brain
1 a neuron collects signals
from neighboring
neurons through its
dendrites
connexions which frequently lead to activating a neuron are enforced (tend
to have an increasing impact on the destination neuron)
Nathalie Villa-Vialaneix | Introduction to statistical learning 43/58
115. Analogy to the brain
1 a neuron collects signals
from neighboring
neurons through its
dendrites
2 when total signal is
above a given threshold,
the neuron is activated
connexions which frequently lead to activating a neuron are enforced (tend
to have an increasing impact on the destination neuron)
Nathalie Villa-Vialaneix | Introduction to statistical learning 43/58
116. Analogy to the brain
1 a neuron collects signals
from neighboring
neurons through its
dendrites
2 when total signal is
above a given threshold,
the neuron is activated
3 ... and a signal is sent to
other neurons through
the axon
connexions which frequently lead to activating a neuron are enforced (tend
to have an increasing impact on the destination neuron)
Nathalie Villa-Vialaneix | Introduction to statistical learning 43/58
117. First model of artificial neuron
[Mc Culloch and Pitts, 1943, Rosenblatt, 1958, Rosenblatt, 1962]
x(1)
x(2)
x(p)
f(x)Σ+
w1
w2
wp
w0
f : x ∈ Rp
→ 1 p
j=1
wjx(j)+w0 ≥ 0
Nathalie Villa-Vialaneix | Introduction to statistical learning 44/58
118. (artificial) Perceptron
Layers
MLP have one input layer (x ∈ Rp
), one output layer (y ∈ R or
∈ {1, . . . , K − 1} values) and several hidden layers;
no connections within a layer;
connections between two consecutive layers (feedforward).
Example (regression, y ∈ R):
INPUTS
x = (x(1)
, . . . , x(p)
)
Nathalie Villa-Vialaneix | Introduction to statistical learning 45/58
119. (artificial) Perceptron
Layers
MLP have one input layer (x ∈ Rp
), one output layer (y ∈ R or
∈ {1, . . . , K − 1} values) and several hidden layers;
no connections within a layer;
connections between two consecutive layers (feedforward).
Example (regression, y ∈ R):
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1
weights w
(1)
jk
Nathalie Villa-Vialaneix | Introduction to statistical learning 45/58
120. (artificial) Perceptron
Layers
MLP have one input layer (x ∈ Rp
), one output layer (y ∈ R or
∈ {1, . . . , K − 1} values) and several hidden layers;
no connections within a layer;
connections between two consecutive layers (feedforward).
Example (regression, y ∈ R):
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1
Nathalie Villa-Vialaneix | Introduction to statistical learning 45/58
121. (artificial) Perceptron
Layers
MLP have one input layer (x ∈ Rp
), one output layer (y ∈ R or
∈ {1, . . . , K − 1} values) and several hidden layers;
no connections within a layer;
connections between two consecutive layers (feedforward).
Example (regression, y ∈ R):
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1 Layer 2
Nathalie Villa-Vialaneix | Introduction to statistical learning 45/58
122. (artificial) Perceptron
Layers
MLP have one input layer (x ∈ Rp
), one output layer (y ∈ R or
∈ {1, . . . , K − 1} values) and several hidden layers;
no connections within a layer;
connections between two consecutive layers (feedforward).
Example (regression, y ∈ R):
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1 Layer 2 y
OUTPUTS
2 hidden layer MLP
Nathalie Villa-Vialaneix | Introduction to statistical learning 45/58
123. A neuron in MLP
v1
v2
v3
×w1
×w2
×w3
+
w0 (Bias Biais)
Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
124. A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
125. A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
Standard activation functions fonctions de transfert / d’activation
Biologically inspired: Heaviside function
h(z) =
0 if z < 0
1 otherwise.
Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
126. A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
Standard activation functions
Main issue with the Heaviside function: not continuous!
Identity
h(z) = z
Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
127. A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
Standard activation functions
But identity activation function gives linear model if used with one hidden
layer: not flexible enough
Logistic function
h(z) = 1
1+exp(−z)
Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
128. A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
Standard activation functions
Another popular activation function (useful to model positive real numbers)
Rectified linear (ReLU)
h(z) = max(0, z)
Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
129. A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
General sigmoid
sigmoid: nondecreasing function h : R → R such that
lim
z→+∞
h(z) = 1 lim
z→−∞
h(z) = 0
Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
130. Focus on one-hidden-layer perceptrons
Regression case
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
f(x)
f(x) =
Q
k=1
w
(2)
k
hk x w
(1)
k
+ w
(0)
k
+ w
(2)
0
, with hk a (logistic) sigmoid
Nathalie Villa-Vialaneix | Introduction to statistical learning 47/58
131. Focus on one-hidden-layer perceptrons
Binary classification case
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
ψ(x)
ψ(x) = h0
Q
k=1
w
(2)
k
hk x w
(1)
k
+ w
(0)
k
+ w
(2)
0
with h0 logistic sigmoid or identity.
Nathalie Villa-Vialaneix | Introduction to statistical learning 47/58
132. Focus on one-hidden-layer perceptrons
Binary classification case
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
ψ(x)
decision with:
f(x) =
0 if ψ(x) < 1/2
1 otherwise
Nathalie Villa-Vialaneix | Introduction to statistical learning 47/58
133. Focus on one-hidden-layer perceptrons
Extension to any classification problem in {1, . . . , K − 1}
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
ψ(x)
Straightforward extension to multiple classes with a multiple output
perceptron (number of output units equal to K) and a maximum probability
rule for the decision.
Nathalie Villa-Vialaneix | Introduction to statistical learning 47/58
134. Theoretical properties of perceptrons
This section answers two questions:
1 can we approximate any function g : [0, 1]p
→ R arbitrary well with a
perceptron?
Nathalie Villa-Vialaneix | Introduction to statistical learning 48/58
135. Theoretical properties of perceptrons
This section answers two questions:
1 can we approximate any function g : [0, 1]p
→ R arbitrary well with a
perceptron?
2 when a perceptron is trained with i.i.d. observations from an arbitrary
random variable pair (X, Y), is it consistent? (i.e., does it reach the
minimum possible error asymptotically when the number of
observations grows to infinity?)
Nathalie Villa-Vialaneix | Introduction to statistical learning 48/58
136. Illustration of the universal approximation property
Simple examples
a function to approximate: g : [0, 1] → sin 1
x+0.1
Nathalie Villa-Vialaneix | Introduction to statistical learning 49/58
137. Illustration of the universal approximation property
Simple examples
a function to approximate: g : [0, 1] → sin 1
x+0.1
trying to approximate (how this is performed is explained later in this talk) this
function with MLP having different numbers of neurons on their
hidden layer
Nathalie Villa-Vialaneix | Introduction to statistical learning 49/58
138. Universal property from a theoretical point of view
Set of MLPs with a given size:
PQ
(h) =
x ∈ Rp
→
Q
k=1
w
(2)
k
h x w
(1)
k
+ w
(0)
k
+ w
(2)
0
: w
(2)
k
, w
(0)
k
∈ R, w
(1)
k
∈ Rp
Nathalie Villa-Vialaneix | Introduction to statistical learning 50/58
139. Universal property from a theoretical point of view
Set of MLPs with a given size:
PQ
(h) =
x ∈ Rp
→
Q
k=1
w
(2)
k
h x w
(1)
k
+ w
(0)
k
+ w
(2)
0
: w
(2)
k
, w
(0)
k
∈ R, w
(1)
k
∈ Rp
Set of all MLPs: P(h) = ∪Q∈NPQ
(h)
Nathalie Villa-Vialaneix | Introduction to statistical learning 50/58
140. Universal property from a theoretical point of view
Set of MLPs with a given size:
PQ
(h) =
x ∈ Rp
→
Q
k=1
w
(2)
k
h x w
(1)
k
+ w
(0)
k
+ w
(2)
0
: w
(2)
k
, w
(0)
k
∈ R, w
(1)
k
∈ Rp
Set of all MLPs: P(h) = ∪Q∈NPQ
(h)
Universal approximation [Pinkus, 1999]
If h is a non polynomial continuous function, then, for any continuous
function g : [0, 1]p
→ R and any > 0, ∃ f ∈ P(h) such that:
sup
x∈[0,1]p
|f(x) − g(x)| ≤ .
Nathalie Villa-Vialaneix | Introduction to statistical learning 50/58
141. Remarks on universal approximation
continuity of the activation function is not required (see
[Devroye et al., 1996] for a result with arbitrary sigmoids)
other versions of this property are given in
[Hornik, 1991, Hornik, 1993, Stinchcombe, 1999] for different functional
spaces for g
none of the spaces PQ
(h), for a fixed Q, has this property
this result can be used to show that perceptron are consistent
whenever Q log(n)/n
n→+∞
−−−−−−→ 0
[Farago and Lugosi, 1993, Devroye et al., 1996]
Nathalie Villa-Vialaneix | Introduction to statistical learning 51/58
142. Empirical error minimization
Given i.i.d. observations of (X, Y), (Xi, Yi), how to choose the weights w?
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
fw(x)
Nathalie Villa-Vialaneix | Introduction to statistical learning 52/58
143. Empirical error minimization
Given i.i.d. observations of (X, Y), (Xi, Yi), how to choose the weights w?
Standard approach: minimize the empirical L2 risk:
Rn(w) =
n
i=1
[fw(Xi) − Yi]2
with
Yi ∈ R for the regression case
Yi ∈ {0, 1} for the classification case, with the associated decision rule
x → 1{fw (x)≤1/2}.
Nathalie Villa-Vialaneix | Introduction to statistical learning 52/58
144. Empirical error minimization
Given i.i.d. observations of (X, Y), (Xi, Yi), how to choose the weights w?
Standard approach: minimize the empirical L2 risk:
Rn(w) =
n
i=1
[fw(Xi) − Yi]2
with
Yi ∈ R for the regression case
Yi ∈ {0, 1} for the classification case, with the associated decision rule
x → 1{fw (x)≤1/2}.
But: ˆRn(w) is not convex in w ⇒ general optimization problem
Nathalie Villa-Vialaneix | Introduction to statistical learning 52/58
145. Optimization with gradient descent
Method: initialize (randomly or with some prior knowledge) the weights
w(0) ∈ RQp+2Q+1
Batch approach: for t = 1, . . . , T
w(t + 1) = w(t) − µ(t) w
ˆRn(w(t));
Nathalie Villa-Vialaneix | Introduction to statistical learning 53/58
146. Optimization with gradient descent
Method: initialize (randomly or with some prior knowledge) the weights
w(0) ∈ RQp+2Q+1
Batch approach: for t = 1, . . . , T
w(t + 1) = w(t) − µ(t) w
ˆRn(w(t));
online (or stochastic) approach: write
ˆRn(w) =
n
i=1
[fw(Xi) − Yi]2
=Ei
and for t = 1, . . . , T, randomly pick i ∈ {1, . . . , n} and update:
w(t + 1) = w(t) − µ(t) wEi(w(t)).
Nathalie Villa-Vialaneix | Introduction to statistical learning 53/58
147. Discussion about practical choices for this approach
batch version converges (in an optimization point of view) to a local
minimum of the error for a good choice of µ(t) but convergence can
be slow
stochastic version is usually very inefficient but is useful for large
datasets (n large)
more efficient algorithms exist to solve the optimization task. The one
implemented in the R package nnet uses higher order derivatives
(BFGS algorithm)
in all cases, solutions returned are, at best, local minima which
strongly depends on the initialization: using more than one
initialization state is advised
Nathalie Villa-Vialaneix | Introduction to statistical learning 54/58
148. Gradient backpropagation method
[Rumelhart and Mc Clelland, 1986]
The gradient backpropagation rétropropagation du gradient principle is
used to easily calculate gradients in perceptrons (or in other types of
neural network):
Nathalie Villa-Vialaneix | Introduction to statistical learning 55/58
149. Gradient backpropagation method
[Rumelhart and Mc Clelland, 1986]
The gradient backpropagation rétropropagation du gradient principle is
used to easily calculate gradients in perceptrons (or in other types of
neural network):
This way, stochastic gradient descent alternates:
a forward step which aims at calculating outputs from all observations
Xi given a value of the weights w
a backward step in which the gradient backpropagation principle is
used to obtain the gradient for the current weights w
Nathalie Villa-Vialaneix | Introduction to statistical learning 55/58
157. Initialization and stopping of the training algorithm
1 How to initialize weights? Standard choices w
(1)
jk
∼ N(0, 1/
√
p) and
w
(2)
k
∼ N(0, 1/
√
Q)
Nathalie Villa-Vialaneix | Introduction to statistical learning 57/58
158. Initialization and stopping of the training algorithm
1 How to initialize weights? Standard choices w
(1)
jk
∼ N(0, 1/
√
p) and
w
(2)
k
∼ N(0, 1/
√
Q)
2 When to stop the algorithm? (gradient descent or alike) Standard
choices:
bounded T
target value of the error ˆRn(w)
target value of the evolution ˆRn(w(t)) − ˆRn(w(t + 1))
Nathalie Villa-Vialaneix | Introduction to statistical learning 57/58
159. Initialization and stopping of the training algorithm
1 How to initialize weights? Standard choices w
(1)
jk
∼ N(0, 1/
√
p) and
w
(2)
k
∼ N(0, 1/
√
Q)
In the R package nnet, weights are sampled uniformly between
[−0.5, 0.5] or between − 1
maxi X
(j)
i
, 1
maxi X
(j)
i
if X(j) is large.
2 When to stop the algorithm? (gradient descent or alike) Standard
choices:
bounded T
target value of the error ˆRn(w)
target value of the evolution ˆRn(w(t)) − ˆRn(w(t + 1))
In the R package nnet, a combination of the three criteria is used and
tunable.
Nathalie Villa-Vialaneix | Introduction to statistical learning 57/58
160. Strategies to avoid overfitting
Properly tune Q with a CV or a bootstrap estimation of the
generalization ability of the method
Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
161. Strategies to avoid overfitting
Properly tune Q with a CV or a bootstrap estimation of the
generalization ability of the method
Early stopping: for Q large enough, use a part of the data as a
validation set and stops the training (gradient descent) when the
empirical error calculated on this dataset starts to increase
Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
162. Strategies to avoid overfitting
Properly tune Q with a CV or a bootstrap estimation of the
generalization ability of the method
Early stopping: for Q large enough, use a part of the data as a
validation set and stops the training (gradient descent) when the
empirical error calculated on this dataset starts to increase
Weight decay: for Q large enough, penalize the empirical risk with a
function of the weights, e.g.,
ˆRn(w) + λw w
Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
163. Strategies to avoid overfitting
Properly tune Q with a CV or a bootstrap estimation of the
generalization ability of the method
Early stopping: for Q large enough, use a part of the data as a
validation set and stops the training (gradient descent) when the
empirical error calculated on this dataset starts to increase
Weight decay: for Q large enough, penalize the empirical risk with a
function of the weights, e.g.,
ˆRn(w) + λw w
Noise injection: modify the input data with a random noise during the
training
Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
164. References
Aronszajn, N. (1950).
Theory of reproducing kernels.
Transactions of the American Mathematical Society, 68(3):337–404.
Bishop, C. (1995).
Neural Networks for Pattern Recognition.
Oxford University Press, New York, USA.
Breiman, L. (2001).
Random forests.
Machine Learning, 45(1):5–32.
Breiman, L., Friedman, J., Olsen, R., and Stone, C. (1984).
Classification and Regression Trees.
Chapman and Hall, Boca Raton, Florida, USA.
Devroye, L., Györfi, L., and Lugosi, G. (1996).
A Probabilistic Theory for Pattern Recognition.
Springer-Verlag, New York, NY, USA.
Farago, A. and Lugosi, G. (1993).
Strong universal consistency of neural network classifiers.
IEEE Transactions on Information Theory, 39(4):1146–1151.
Györfi, L., Kohler, M., Krzy˙zak, A., and Walk, H. (2002).
A Distribution-Free Theory of Nonparametric Regression.
Springer-Verlag, New York, NY, USA.
Hornik, K. (1991).
Approximation capabilities of multilayer feedfoward networks.
Neural Networks, 4(2):251–257.
Hornik, K. (1993).
Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
165. Some new results on neural network approximation.
Neural Networks, 6(8):1069–1072.
Mc Culloch, W. and Pitts, W. (1943).
A logical calculus of ideas immanent in nervous activity.
Bulletin of Mathematical Biophysics, 5(4):115–133.
Pinkus, A. (1999).
Approximation theory of the MLP model in neural networks.
Acta Numerica, 8:143–195.
Ripley, B. (1996).
Pattern Recognition and Neural Networks.
Cambridge University Press.
Rosenblatt, F. (1958).
The perceptron: a probabilistic model for information storage and organization in the brain.
Psychological Review, 65:386–408.
Rosenblatt, F. (1962).
Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms.
Spartan Books, Washington, DC, USA.
Rumelhart, D. and Mc Clelland, J. (1986).
Parallel Distributed Processing: Exploration in the MicroStructure of Cognition.
MIT Press, Cambridge, MA, USA.
Steinwart, I. (2002).
Support vector machines are universally consistent.
Journal of Complexity, 18:768–791.
Stinchcombe, M. (1999).
Neural network approximation of continuous functionals and continuous functions on compactifications.
Neural Network, 12(3):467–477.
Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
166. Vapnik, V. (1995).
The Nature of Statistical Learning Theory.
Springer Verlag, New York, USA.
Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58