SlideShare ist ein Scribd-Unternehmen logo
1 von 166
Downloaden Sie, um offline zu lesen
A short introduction to statistical learning
Nathalie Villa-Vialaneix
nathalie.villa-vialaneix@inra.fr
http://www.nathalievilla.org
GT “SPARKGEN”
March 12th, 2018 - INRA, Toulouse
Nathalie Villa-Vialaneix | Introduction to statistical learning 1/58
Outline
1 Introduction
Background and notations
Underfitting / Overfitting
Consistency
2 CART and random forests
Introduction to CART
Learning
Prediction
Overview of random forests
Bootstrap/Bagging
Random forest
3 SVM
4 (not deep) Neural networks
Seminal references
Multi-layer perceptrons
Theoretical properties of perceptrons
Learning perceptrons
Learning in practice
Nathalie Villa-Vialaneix | Introduction to statistical learning 2/58
Outline
1 Introduction
Background and notations
Underfitting / Overfitting
Consistency
2 CART and random forests
Introduction to CART
Learning
Prediction
Overview of random forests
Bootstrap/Bagging
Random forest
3 SVM
4 (not deep) Neural networks
Seminal references
Multi-layer perceptrons
Theoretical properties of perceptrons
Learning perceptrons
Learning in practice
Nathalie Villa-Vialaneix | Introduction to statistical learning 3/58
Background
Purpose: predict Y from X;
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/58
Background
Purpose: predict Y from X;
What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn);
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/58
Background
Purpose: predict Y from X;
What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn);
What we want: estimate unknown Y from new X: xn+1, . . . , xm.
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/58
Background
Purpose: predict Y from X;
What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn);
What we want: estimate unknown Y from new X: xn+1, . . . , xm.
X can be:
numeric variables;
or factors;
or a combination of numeric variables and factors.
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/58
Background
Purpose: predict Y from X;
What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn);
What we want: estimate unknown Y from new X: xn+1, . . . , xm.
X can be:
numeric variables;
or factors;
or a combination of numeric variables and factors.
Y can be:
a numeric variable (Y ∈ R) ⇒ (supervised) regression régression;
a factor ⇒ (supervised) classification discrimination.
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/58
Basics
From (xi, yi)i, definition of a machine, Φn
s.t.:
ˆynew = Φn
(xnew).
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
Basics
From (xi, yi)i, definition of a machine, Φn
s.t.:
ˆynew = Φn
(xnew).
if Y is numeric, Φn
is called a regression function fonction de
classification;
if Y is a factor, Φn
is called a classifier classifieur;
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
Basics
From (xi, yi)i, definition of a machine, Φn
s.t.:
ˆynew = Φn
(xnew).
if Y is numeric, Φn
is called a regression function fonction de
classification;
if Y is a factor, Φn
is called a classifier classifieur;
Φn
is said to be trained or learned from the observations (xi, yi)i.
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
Basics
From (xi, yi)i, definition of a machine, Φn
s.t.:
ˆynew = Φn
(xnew).
if Y is numeric, Φn
is called a regression function fonction de
classification;
if Y is a factor, Φn
is called a classifier classifieur;
Φn
is said to be trained or learned from the observations (xi, yi)i.
Desirable properties
accuracy to the observations: predictions made on known data are
close to observed values;
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
Basics
From (xi, yi)i, definition of a machine, Φn
s.t.:
ˆynew = Φn
(xnew).
if Y is numeric, Φn
is called a regression function fonction de
classification;
if Y is a factor, Φn
is called a classifier classifieur;
Φn
is said to be trained or learned from the observations (xi, yi)i.
Desirable properties
accuracy to the observations: predictions made on known data are
close to observed values;
generalization ability: predictions made on new data are also
accurate.
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
Basics
From (xi, yi)i, definition of a machine, Φn
s.t.:
ˆynew = Φn
(xnew).
if Y is numeric, Φn
is called a regression function fonction de
classification;
if Y is a factor, Φn
is called a classifier classifieur;
Φn
is said to be trained or learned from the observations (xi, yi)i.
Desirable properties
accuracy to the observations: predictions made on known data are
close to observed values;
generalization ability: predictions made on new data are also
accurate.
Conflicting objectives!!
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
Underfitting/Overfitting sous/sur - apprentissage
Function x → y to be estimated
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
Underfitting/Overfitting sous/sur - apprentissage
Observations we might have
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
Underfitting/Overfitting sous/sur - apprentissage
Observations we do have
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
Underfitting/Overfitting sous/sur - apprentissage
First estimation from the observations: underfitting
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
Underfitting/Overfitting sous/sur - apprentissage
Second estimation from the observations: accurate estimation
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
Underfitting/Overfitting sous/sur - apprentissage
Third estimation from the observations: overfitting
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
Underfitting/Overfitting sous/sur - apprentissage
Summary
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
Errors
training error (measures the accuracy to the observations)
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
Errors
training error (measures the accuracy to the observations)
if y is a factor: misclassification rate
{ˆyi yi, i = 1, . . . , n}
n
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
Errors
training error (measures the accuracy to the observations)
if y is a factor: misclassification rate
{ˆyi yi, i = 1, . . . , n}
n
if y is numeric: mean square error (MSE)
1
n
n
i=1
(ˆyi − yi)2
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
Errors
training error (measures the accuracy to the observations)
if y is a factor: misclassification rate
{ˆyi yi, i = 1, . . . , n}
n
if y is numeric: mean square error (MSE)
1
n
n
i=1
(ˆyi − yi)2
or root mean square error (RMSE) or pseudo-R2
: 1−MSE/Var((yi)i)
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
Errors
training error (measures the accuracy to the observations)
if y is a factor: misclassification rate
{ˆyi yi, i = 1, . . . , n}
n
if y is numeric: mean square error (MSE)
1
n
n
i=1
(ˆyi − yi)2
or root mean square error (RMSE) or pseudo-R2
: 1−MSE/Var((yi)i)
test error: a way to prevent overfitting (estimates the generalization
error) is the simple validation
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
Errors
training error (measures the accuracy to the observations)
if y is a factor: misclassification rate
{ˆyi yi, i = 1, . . . , n}
n
if y is numeric: mean square error (MSE)
1
n
n
i=1
(ˆyi − yi)2
or root mean square error (RMSE) or pseudo-R2
: 1−MSE/Var((yi)i)
test error: a way to prevent overfitting (estimates the generalization
error) is the simple validation
1 split the data into training/test sets (usually 80%/20%)
2 train Φn
from the training dataset
3 calculate the test error from the remaining data
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
Example
Observations
Nathalie Villa-Vialaneix | Introduction to statistical learning 8/58
Example
Training/Test datasets
Nathalie Villa-Vialaneix | Introduction to statistical learning 8/58
Example
Training/Test errors
Nathalie Villa-Vialaneix | Introduction to statistical learning 8/58
Example
Summary
Nathalie Villa-Vialaneix | Introduction to statistical learning 8/58
Bias / Variance trade-off
This problem is also related to the well known bias / variance trade-off
bias: error that comes from erroneous assumptions in the learning
algorithm (average error of the predictor)
variance: error that comes from sensitivity to small fluctuations in the
training set (variance of the predictor)
Nathalie Villa-Vialaneix | Introduction to statistical learning 9/58
Bias / Variance trade-off
This problem is also related to the well known bias / variance trade-off
bias: error that comes from erroneous assumptions in the learning
algorithm (average error of the predictor)
variance: error that comes from sensitivity to small fluctuations in the
training set (variance of the predictor)
Overall error is: E(MSE) = Bias2
+ Variance
Nathalie Villa-Vialaneix | Introduction to statistical learning 9/58
Consistency in the parametric/non parametric case
Example in the parametric framework (linear methods)
an assumption is made on the form of the relation between X and Y:
Y = βT
X +
β is estimated from the observations (x1, y1), . . . , (xn, yn) by a given
method which calculates a βn
.
The estimation is said to be consistent if βn n→+∞
−−−−−−→ β under (possibly)
technical assumptions on X, , Y.
Nathalie Villa-Vialaneix | Introduction to statistical learning 10/58
Consistency in the parametric/non parametric case
Example in the nonparametric framework
the form of the relation between X and Y is unknown:
Y = Φ(X) +
Φ is estimated from the observations (x1, y1), . . . , (xn, yn) by a given
method which calculates a Φn
.
The estimation is said to be consistent if Φn n→+∞
−−−−−−→ Φ under (possibly)
technical assumptions on X, , Y.
Nathalie Villa-Vialaneix | Introduction to statistical learning 10/58
Consistency from the statistical learning perspective
[Vapnik, 1995]
Question: Are we really interested in estimating Φ or...
Nathalie Villa-Vialaneix | Introduction to statistical learning 11/58
Consistency from the statistical learning perspective
[Vapnik, 1995]
Question: Are we really interested in estimating Φ or...
... rather in having the smallest prediction error?
Statistical learning perspective: a method that builds a machine Φn
from
the observations is said to be (universally) consistent if, given a risk
function R : R × R → R+ (which calculates an error),
E (R(Φn
(X), Y))
n→+∞
−−−−−−→ inf
Φ:X→R
E (R(Φ(X), Y)) ,
for any distribution of (X, Y) ∈ X × R.
Definitions: L∗ = infΦ:X→R E (R(Φ(X), Y)) and LΦ = E (R(Φ(X), Y)).
Nathalie Villa-Vialaneix | Introduction to statistical learning 11/58
Desirable properties from a mathematical perspective
Simplified framework: X ∈ X and Y ∈ {−1, 1} (binary classification)
Learning process: choose a machine Φn
in a class of functions
C ⊂ {Φ : X → R} (e.g., C is the set of all functions that can be build using a
SVM).
Error decomposition
LΦn
− L∗
≤ LΦn
− inf
Φ∈C
LΦ + inf
Φ∈C
LΦ − L∗
with
infΦ∈C LΦ − L∗ is the richness of C (i.e., C must be rich to ensure that
this term is small);
Nathalie Villa-Vialaneix | Introduction to statistical learning 12/58
Desirable properties from a mathematical perspective
Simplified framework: X ∈ X and Y ∈ {−1, 1} (binary classification)
Learning process: choose a machine Φn
in a class of functions
C ⊂ {Φ : X → R} (e.g., C is the set of all functions that can be build using a
SVM).
Error decomposition
LΦn
− L∗
≤ LΦn
− inf
Φ∈C
LΦ + inf
Φ∈C
LΦ − L∗
with
infΦ∈C LΦ − L∗ is the richness of C (i.e., C must be rich to ensure that
this term is small);
LΦn
− infΦ∈C LΦ ≤ 2 supΦ∈C |Ln
Φ − LΦ|, Ln
Φ = 1
n
n
i=1 R(Φ(xi), yi) is
the generalization capability of C (i.e., in the worst case, the empirical
error must be close to the true error: C must not be too rich to ensure
that this term is small).
Nathalie Villa-Vialaneix | Introduction to statistical learning 12/58
Outline
1 Introduction
Background and notations
Underfitting / Overfitting
Consistency
2 CART and random forests
Introduction to CART
Learning
Prediction
Overview of random forests
Bootstrap/Bagging
Random forest
3 SVM
4 (not deep) Neural networks
Seminal references
Multi-layer perceptrons
Theoretical properties of perceptrons
Learning perceptrons
Learning in practice
Nathalie Villa-Vialaneix | Introduction to statistical learning 13/58
Overview
CART: Classification And Regression Trees introduced by
[Breiman et al., 1984].
Nathalie Villa-Vialaneix | Introduction to statistical learning 14/58
Overview
CART: Classification And Regression Trees introduced by
[Breiman et al., 1984].
Advantages
classification OR regression (i.e., Y can be a numeric variable or a
factor);
non parametric method: no prior assumption needed;
can deal with a large number of input variables, either numeric
variables or factors (a variable selection is included in the method);
provide an intuitive interpretation.
Nathalie Villa-Vialaneix | Introduction to statistical learning 14/58
Overview
CART: Classification And Regression Trees introduced by
[Breiman et al., 1984].
Advantages
classification OR regression (i.e., Y can be a numeric variable or a
factor);
non parametric method: no prior assumption needed;
can deal with a large number of input variables, either numeric
variables or factors (a variable selection is included in the method);
provide an intuitive interpretation.
Drawbacks
require a large training dataset to be efficient;
as a consequence, are often too simple to provide accurate
predictions.
Nathalie Villa-Vialaneix | Introduction to statistical learning 14/58
Example
X = (Gender, Age, Height) and Y = Weight
Y1 Y2 Y3
Y4 Y5
Height<1.60m Height>1.60m
Gender=M Gender=F Age<30 Age>30
Gender=M Gender=F
Root
a split
a node
a leaf
(terminal node)
Nathalie Villa-Vialaneix | Introduction to statistical learning 15/58
CART learning process
Algorithm
Start from root
repeat
move to a “new” node
if the node is homogeneous or small enough then
STOP
else
split the node into two child nodes with maximal “homogeneity”
end if
until all nodes are processed
Nathalie Villa-Vialaneix | Introduction to statistical learning 16/58
Further details
Homogeneity?
if Y is a numeric variable, variance of (yi)i for the observations
assigned to the node (Gini index is also sometimes used);
Nathalie Villa-Vialaneix | Introduction to statistical learning 17/58
Further details
Homogeneity?
if Y is a numeric variable, variance of (yi)i for the observations
assigned to the node (Gini index is also sometimes used);
if Y is a factor, node purity: % of observations assigned to the node
whose Y values are not the node majority class.
Nathalie Villa-Vialaneix | Introduction to statistical learning 17/58
Further details
Homogeneity?
if Y is a numeric variable, variance of (yi)i for the observations
assigned to the node (Gini index is also sometimes used);
if Y is a factor, node purity: % of observations assigned to the node
whose Y values are not the node majority class.
Stopping criteria?
Minimum size node (generally 1 or 5)
Minimum node purity or variance
Maximum tree depth
Nathalie Villa-Vialaneix | Introduction to statistical learning 17/58
Further details
Homogeneity?
if Y is a numeric variable, variance of (yi)i for the observations
assigned to the node (Gini index is also sometimes used);
if Y is a factor, node purity: % of observations assigned to the node
whose Y values are not the node majority class.
Stopping criteria?
Minimum size node (generally 1 or 5)
Minimum node purity or variance
Maximum tree depth
Hyperparameters can be tuned by cross-validation using a grid search.
Nathalie Villa-Vialaneix | Introduction to statistical learning 17/58
Further details
Homogeneity?
if Y is a numeric variable, variance of (yi)i for the observations
assigned to the node (Gini index is also sometimes used);
if Y is a factor, node purity: % of observations assigned to the node
whose Y values are not the node majority class.
Stopping criteria?
Minimum size node (generally 1 or 5)
Minimum node purity or variance
Maximum tree depth
Hyperparameters can be tuned by cross-validation using a grid search.
An alternative approach is pruning...
Nathalie Villa-Vialaneix | Introduction to statistical learning 17/58
Making new predictions
A new observation, xnew
is assigned to a leaf (straightforward)
Nathalie Villa-Vialaneix | Introduction to statistical learning 18/58
Making new predictions
A new observation, xnew
is assigned to a leaf (straightforward)
the corresponding predicted ˆynew is
if Y is numeric, the mean value of the observations (training set)
assigned to the same leaf
Nathalie Villa-Vialaneix | Introduction to statistical learning 18/58
Making new predictions
A new observation, xnew
is assigned to a leaf (straightforward)
the corresponding predicted ˆynew is
if Y is numeric, the mean value of the observations (training set)
assigned to the same leaf
if Y is a factor, the majority class of the observations (training set)
assigned to the same leaf
Nathalie Villa-Vialaneix | Introduction to statistical learning 18/58
Advantages/Drawbacks
Random Forest: introduced by [Breiman, 2001]
Nathalie Villa-Vialaneix | Introduction to statistical learning 19/58
Advantages/Drawbacks
Random Forest: introduced by [Breiman, 2001]
Advantages
classification OR regression (i.e., Y can be a numeric variable or a
factor)
non parametric method (no prior assumption needed) and accurate
can deal with a large number of input variables, either numeric
variables or factors
can deal with small samples
Nathalie Villa-Vialaneix | Introduction to statistical learning 19/58
Advantages/Drawbacks
Random Forest: introduced by [Breiman, 2001]
Advantages
classification OR regression (i.e., Y can be a numeric variable or a
factor)
non parametric method (no prior assumption needed) and accurate
can deal with a large number of input variables, either numeric
variables or factors
can deal with small samples
Drawbacks
black box model
is only supported by a few mathematical results (consistency...) until
now
Nathalie Villa-Vialaneix | Introduction to statistical learning 19/58
Basic description
A fact: When the sample size is small, you might be unable to estimate
properly
Nathalie Villa-Vialaneix | Introduction to statistical learning 20/58
Basic description
A fact: When the sample size is small, you might be unable to estimate
properly
This issue is commonly tackled by bootstrapping and, more specifically,
bagging (Boostrap Aggregating) ⇒ it reduces the variance of the estimator
Nathalie Villa-Vialaneix | Introduction to statistical learning 20/58
Basic description
A fact: When the sample size is small, you might be unable to estimate
properly
This issue is commonly tackled by bootstrapping and, more specifically,
bagging (Boostrap Aggregating) ⇒ it reduces the variance of the estimator
Bagging: combination of simple (and underefficient) regression (or
classification) functions
Nathalie Villa-Vialaneix | Introduction to statistical learning 20/58
Basic description
A fact: When the sample size is small, you might be unable to estimate
properly
This issue is commonly tackled by bootstrapping and, more specifically,
bagging (Boostrap Aggregating) ⇒ it reduces the variance of the estimator
Bagging: combination of simple (and underefficient) regression (or
classification) functions
Random forest CART bagging
Nathalie Villa-Vialaneix | Introduction to statistical learning 20/58
Bootstrap
Bootstrap sample: random sampling (with replacement) of the training
dataset - samples have the same size than the original dataset
Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
Bootstrap
Bootstrap sample: random sampling (with replacement) of the training
dataset - samples have the same size than the original dataset
General (and robust) approach to solve several problems:
Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
Bootstrap
Bootstrap sample: random sampling (with replacement) of the training
dataset - samples have the same size than the original dataset
General (and robust) approach to solve several problems:
Estimating confidence intervals (of X with no prior assumption on the
distribution of X)
1 Build P bootstrap samples from (xi)i
2 Use them to estimate X P times
3 The confidence interval is based on the percentiles of the empirical
distribution of X
Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
Bootstrap
Bootstrap sample: random sampling (with replacement) of the training
dataset - samples have the same size than the original dataset
General (and robust) approach to solve several problems:
Estimating confidence intervals (of X with no prior assumption on the
distribution of X)
Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
Bootstrap
Bootstrap sample: random sampling (with replacement) of the training
dataset - samples have the same size than the original dataset
General (and robust) approach to solve several problems:
Estimating confidence intervals (of X with no prior assumption on the
distribution of X)
Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
Bootstrap
Bootstrap sample: random sampling (with replacement) of the training
dataset - samples have the same size than the original dataset
General (and robust) approach to solve several problems:
Estimating confidence intervals (of X with no prior assumption on the
distribution of X)
Also useful to estimate p-values, residuals, ...
Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
Bagging
Average the estimates of the regression (or the classification) function
obtained from B bootstrap samples
Nathalie Villa-Vialaneix | Introduction to statistical learning 22/58
Bagging
Average the estimates of the regression (or the classification) function
obtained from B bootstrap samples
Bagging with regression trees
1: for b = 1, . . . , B do
2: Build a bootstrap sample ξb
3: Train a regression tree from ξb, ˆφb
4: end for
5: Estimate the regression function by
ˆΦn
(x) =
1
B
B
b=1
ˆφb(x)
Nathalie Villa-Vialaneix | Introduction to statistical learning 22/58
Bagging
Average the estimates of the regression (or the classification) function
obtained from B bootstrap samples
Bagging with regression trees
1: for b = 1, . . . , B do
2: Build a bootstrap sample ξb
3: Train a regression tree from ξb, ˆφb
4: end for
5: Estimate the regression function by
ˆΦn
(x) =
1
B
B
b=1
ˆφb(x)
For classification, the predicted class is the majority vote class.
Nathalie Villa-Vialaneix | Introduction to statistical learning 22/58
Random forests
CART bagging with additional variations
1 each node is based on a random (and different) subset of q variables
(an advisable choice for q is
√
p for classification and p/3 for
regression)
Nathalie Villa-Vialaneix | Introduction to statistical learning 23/58
Random forests
CART bagging with additional variations
1 each node is based on a random (and different) subset of q variables
(an advisable choice for q is
√
p for classification and p/3 for
regression)
2 the tree is fully developed (overfitted)
Hyperparameters
those of the CART algorithm
those that are specific to the random forest: q, number of trees
Nathalie Villa-Vialaneix | Introduction to statistical learning 23/58
Random forests
CART bagging with additional variations
1 each node is based on a random (and different) subset of q variables
(an advisable choice for q is
√
p for classification and p/3 for
regression)
2 the tree is fully developed (overfitted)
Hyperparameters
those of the CART algorithm
those that are specific to the random forest: q, number of trees
Random forests are not very sensitive to hyper-parameter settings: default
values for q and 500/1000 trees should work in most cases.
Nathalie Villa-Vialaneix | Introduction to statistical learning 23/58
Additional tools
OOB (Out-Of Bags) error: error based on the observations not
included in the “bag”
Nathalie Villa-Vialaneix | Introduction to statistical learning 24/58
Additional tools
OOB (Out-Of Bags) error: error based on the observations not
included in the “bag”
Stabilization of OOB error is a good indication that there is enough
trees in the forest
Nathalie Villa-Vialaneix | Introduction to statistical learning 24/58
Additional tools
OOB (Out-Of Bags) error: error based on the observations not
included in the “bag”
Importance of a variable to help interpretation: for a given variable Xj
1: randomize the values of the variable
2: make predictions from this new dataset
3: the importance is the mean decrease in accuracy (MSE or
misclassification rate)
Nathalie Villa-Vialaneix | Introduction to statistical learning 24/58
Outline
1 Introduction
Background and notations
Underfitting / Overfitting
Consistency
2 CART and random forests
Introduction to CART
Learning
Prediction
Overview of random forests
Bootstrap/Bagging
Random forest
3 SVM
4 (not deep) Neural networks
Seminal references
Multi-layer perceptrons
Theoretical properties of perceptrons
Learning perceptrons
Learning in practice
Nathalie Villa-Vialaneix | Introduction to statistical learning 25/58
Basic introduction
Binary classification problem: X ∈ H et Y ∈ {−1; 1}
A training set is given: (x1, y1), . . . , (xn, yn)
Nathalie Villa-Vialaneix | Introduction to statistical learning 26/58
Basic introduction
Binary classification problem: X ∈ H et Y ∈ {−1; 1}
A training set is given: (x1, y1), . . . , (xn, yn)
SVM is a method based on kernels. It is universally consistent method,
given that the kernel is universal [Steinwart, 2002].
Extensions to the regression case exist (SVR or LS-SVM) that are also
universally consistent when the kernel is universal.
Nathalie Villa-Vialaneix | Introduction to statistical learning 26/58
Optimal margin classification
Nathalie Villa-Vialaneix | Introduction to statistical learning 27/58
Optimal margin classification
Nathalie Villa-Vialaneix | Introduction to statistical learning 27/58
Optimal margin classification
w
margin: 1
w 2
Support Vector
Nathalie Villa-Vialaneix | Introduction to statistical learning 27/58
Optimal margin classification
w
margin: 1
w 2
Support Vector
w is chosen such that:
minw w 2
(the margin is the largest),
under the constraints: yi( w, xi + b) ≥ 1, 1 ≤ i ≤ n (the separation
between the two classes is perfect).
⇒ ensures a good generalization capability.
Nathalie Villa-Vialaneix | Introduction to statistical learning 27/58
Soft margin classification
Nathalie Villa-Vialaneix | Introduction to statistical learning 28/58
Soft margin classification
Nathalie Villa-Vialaneix | Introduction to statistical learning 28/58
Soft margin classification
w
margin: 1
w 2
Support Vector
Nathalie Villa-Vialaneix | Introduction to statistical learning 28/58
Soft margin classification
w
margin: 1
w 2
Support Vector
w is chosen such that:
minw,ξ w 2
+ C n
i=1 ξi (the margin is the largest),
under the constraints: yi( w, xi + b) ≥ 1 − ξi, 1 ≤ i ≤ n,
ξi ≥ 0, 1 ≤ i ≤ n.
(the separation between the two classes is almost perfect).
⇒ allowing a few errors improves the richness of the class.
Nathalie Villa-Vialaneix | Introduction to statistical learning 28/58
Non linear SVM
Original space X
Nathalie Villa-Vialaneix | Introduction to statistical learning 29/58
Non linear SVM
Original space X Feature space H
Ψ (non linear)
Nathalie Villa-Vialaneix | Introduction to statistical learning 29/58
Non linear SVM
Original space X Feature space H
Ψ (non linear)
Nathalie Villa-Vialaneix | Introduction to statistical learning 29/58
Non linear SVM
Original space X Feature space H
Ψ (non linear)
w ∈ H is chosen such that (PC,H ):
minw,ξ w 2
H
+ C n
i=1 ξi (the margin in the feature space is the
largest),
under the constraints: yi( w, Ψ(xi) H + b) ≥ 1 − ξi, 1 ≤ i ≤ n,
ξi ≥ 0, 1 ≤ i ≤ n.
(the separation between the two classes in the feature space is
almost perfect).
Nathalie Villa-Vialaneix | Introduction to statistical learning 29/58
SVM from different points of view
A regularization problem: (PC,H ) ⇔
(P2
λ,H ) : min
w∈H
1
n
n
i=1
R(fw(xi), yi)
error term
+λ w 2
H
penalization term
,
where fw(x) = Ψ(x), w H and R(ˆy, y) = max(0, 1 − ˆyy) (hinge loss
function)
errors versus ˆy for y = 1:
blue: hinge loss;
green: misclassification error.
Nathalie Villa-Vialaneix | Introduction to statistical learning 30/58
SVM from different points of view
A regularization problem: (PC,H ) ⇔
(P2
λ,H ) : min
w∈H
1
n
n
i=1
R(fw(xi), yi)
error term
+λ w 2
H
penalization term
,
where fw(x) = Ψ(x), w H and R(ˆy, y) = max(0, 1 − ˆyy) (hinge loss
function)
A dual problem: (PC,H ) ⇔
(DC,X) : maxα∈Rn
n
i=1 αi − n
i=1
n
j=1 αiαjyiyj Ψ(xi), Ψ(xj) H ,
with N
i=1 αiyi = 0,
0 ≤ αi ≤ C, 1 ≤ i ≤ n.
Nathalie Villa-Vialaneix | Introduction to statistical learning 30/58
SVM from different points of view
A regularization problem: (PC,H ) ⇔
(P2
λ,H ) : min
w∈H
1
n
n
i=1
R(fw(xi), yi)
error term
+λ w 2
H
penalization term
,
where fw(x) = Ψ(x), w H and R(ˆy, y) = max(0, 1 − ˆyy) (hinge loss
function)
A dual problem: (PC,H ) ⇔
(DC,X) : maxα∈Rn
n
i=1 αi − n
i=1
n
j=1 αiαjyiyjK(xi, xj),
with N
i=1 αiyi = 0,
0 ≤ αi ≤ C, 1 ≤ i ≤ n.
There is no need to know Ψ and H:
choose a function K with a few good properties;
use it as the dot product in H:
∀ u, v ∈ H, K(u, v) = Ψ(u), Ψ(v) H .
Nathalie Villa-Vialaneix | Introduction to statistical learning 30/58
Which kernels?
Minimum properties that a kernel should fulfilled
symmetry: K(u, u ) = K(u , u)
positivity: ∀ N ∈ N, ∀ (αi) ⊂ RN
, ∀ (xi) ⊂ XN
, i,j αiαjK(xi, xj) ≥ 0.
[Aronszajn, 1950]: ∃ a Hilbert space (H, ., . H ) and a function Ψ : X → H
such that:
∀ u, v ∈ H, K(u, v) = Ψ(u), Ψ(v) H
Nathalie Villa-Vialaneix | Introduction to statistical learning 31/58
Which kernels?
Minimum properties that a kernel should fulfilled
symmetry: K(u, u ) = K(u , u)
positivity: ∀ N ∈ N, ∀ (αi) ⊂ RN
, ∀ (xi) ⊂ XN
, i,j αiαjK(xi, xj) ≥ 0.
[Aronszajn, 1950]: ∃ a Hilbert space (H, ., . H ) and a function Ψ : X → H
such that:
∀ u, v ∈ H, K(u, v) = Ψ(u), Ψ(v) H
Examples
the Gaussian kernel: ∀ x, x ∈ Rd
, K(x, x ) = e−γ x−x 2
(it is universal
for all bounded subset of Rd
);
the linear kernel: ∀ x, x ∈ Rd
, K(x, x ) = xT
(x ) (it is not universal).
Nathalie Villa-Vialaneix | Introduction to statistical learning 31/58
In summary, how does the solution write????
Φn
(x) =
i
αiyiK(xi, x)
where only a few αi 0. i such that αi 0 are the support vectors!
Nathalie Villa-Vialaneix | Introduction to statistical learning 32/58
I’m almost dead with all these stuffs on my mind!!!
What in practice?
data(iris)
iris <- iris[iris$Species%in%c("versicolor","virginica"),]
plot(iris$Petal.Length, iris$Petal.Width, col=iris$Species ,
pch=19)
legend("topleft", pch=19, col=c(2,3),
legend=c("versicolor", "virginica"))
Nathalie Villa-Vialaneix | Introduction to statistical learning 33/58
I’m almost dead with all these stuffs on my mind!!!
What in practice?
library(e1071)
res.tune <- tune.svm(Species ~ ., data=iris, kernel="linear",
cost = 2^(-1:4))
# Parameter tuning of ’svm’:
# - sampling method: 10fold cross validation
# - best parameters:
# cost
# 0.5
# - best performance: 0.05
res.tune$best.model
# Call:
# best.svm(x = Species ~ ., data = iris, cost = 2^(-1:4),
# kernel = "linear")
# Parameters:
# SVM-Type: C-classification
# SVM-Kernel: linear
# cost: 0.5
# gamma: 0.25
# Number of Support Vectors: 21
Nathalie Villa-Vialaneix | Introduction to statistical learning 34/58
I’m almost dead with all these stuffs on my mind!!!
What in practice?
table(res.tune$best.model$fitted, iris$Species)
% setosa versicolor virginica
% setosa 0 0 0
% versicolor 0 45 0
% virginica 0 5 50
plot(res.tune$best.model, data=iris, Petal.Width~Petal.Length,
slice = list(Sepal.Width = 2.872, Sepal.Length = 6.262))
Nathalie Villa-Vialaneix | Introduction to statistical learning 35/58
I’m almost dead with all these stuffs on my mind!!!
What in practice?
res.tune <- tune.svm(Species ~ ., data=iris, gamma = 2^(-1:1),
cost = 2^(2:4))
# Parameter tuning of ’svm’:
# - sampling method: 10fold cross validation
# - best parameters:
# gamma cost
# 0.5 4
# - best performance: 0.08
res.tune$best.model
# Call:
# best.svm(x = Species ~ ., data = iris, gamma = 2^(-1:1),
# cost = 2^(2:4))
# Parameters:
# SVM-Type: C-classification
# SVM-Kernel: radial
# cost: 4
# gamma: 0.5
# Number of Support Vectors: 32
Nathalie Villa-Vialaneix | Introduction to statistical learning 36/58
I’m almost dead with all these stuffs on my mind!!!
What in practice?
table(res.tune$best.model$fitted, iris$Species)
% setosa versicolor virginica
% setosa 0 0 0
% versicolor 0 49 0
% virginica 0 1 50
plot(res.tune$best.model, data=iris, Petal.Width~Petal.Length,
slice = list(Sepal.Width = 2.872, Sepal.Length = 6.262))
Nathalie Villa-Vialaneix | Introduction to statistical learning 37/58
Outline
1 Introduction
Background and notations
Underfitting / Overfitting
Consistency
2 CART and random forests
Introduction to CART
Learning
Prediction
Overview of random forests
Bootstrap/Bagging
Random forest
3 SVM
4 (not deep) Neural networks
Seminal references
Multi-layer perceptrons
Theoretical properties of perceptrons
Learning perceptrons
Learning in practice
Nathalie Villa-Vialaneix | Introduction to statistical learning 38/58
What are (artificial) neural networks?
Common properties
(artificial) “Neural networks”: general name for supervised and
unsupervised methods developed in (vague) analogy to the brain
Nathalie Villa-Vialaneix | Introduction to statistical learning 39/58
What are (artificial) neural networks?
Common properties
(artificial) “Neural networks”: general name for supervised and
unsupervised methods developed in (vague) analogy to the brain
combination (network) of simple elements (neurons)
Nathalie Villa-Vialaneix | Introduction to statistical learning 39/58
What are (artificial) neural networks?
Common properties
(artificial) “Neural networks”: general name for supervised and
unsupervised methods developed in (vague) analogy to the brain
combination (network) of simple elements (neurons)
Example of graphical representation:
INPUTS
OUTPUTS
Nathalie Villa-Vialaneix | Introduction to statistical learning 39/58
Different types of neural networks
A neural network is defined by:
1 the network structure;
2 the neuron type.
Nathalie Villa-Vialaneix | Introduction to statistical learning 40/58
Different types of neural networks
A neural network is defined by:
1 the network structure;
2 the neuron type.
Standard examples
Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to
supervised problems (classification and regression);
Nathalie Villa-Vialaneix | Introduction to statistical learning 40/58
Different types of neural networks
A neural network is defined by:
1 the network structure;
2 the neuron type.
Standard examples
Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to
supervised problems (classification and regression);
Radial basis function networks (RBF): same purpose but based on
local smoothing;
Nathalie Villa-Vialaneix | Introduction to statistical learning 40/58
Different types of neural networks
A neural network is defined by:
1 the network structure;
2 the neuron type.
Standard examples
Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to
supervised problems (classification and regression);
Radial basis function networks (RBF): same purpose but based on
local smoothing;
Self-organizing maps (SOM also sometimes called Kohonen’s maps)
or Topographic maps: dedicated to unsupervised problems
(clustering), self-organized;
. . .
Nathalie Villa-Vialaneix | Introduction to statistical learning 40/58
Different types of neural networks
A neural network is defined by:
1 the network structure;
2 the neuron type.
Standard examples
Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to
supervised problems (classification and regression);
Radial basis function networks (RBF): same purpose but based on
local smoothing;
Self-organizing maps (SOM also sometimes called Kohonen’s maps)
or Topographic maps: dedicated to unsupervised problems
(clustering), self-organized;
. . .
In this talk, focus on MLP.
Nathalie Villa-Vialaneix | Introduction to statistical learning 40/58
MLP: Advantages/Drawbacks
Advantages
classification OR regression (i.e., Y can be a numeric variable or a
factor);
non parametric method: flexible;
good theoretical properties.
Nathalie Villa-Vialaneix | Introduction to statistical learning 41/58
MLP: Advantages/Drawbacks
Advantages
classification OR regression (i.e., Y can be a numeric variable or a
factor);
non parametric method: flexible;
good theoretical properties.
Drawbacks
hard to train (high computational cost, especially when d is large);
overfit easily;
“black box” models (hard to interpret).
Nathalie Villa-Vialaneix | Introduction to statistical learning 41/58
References
Advised references:
[Bishop, 1995, Ripley, 1996] overview of the topic from a learning (more
than statistical) perspective
[Devroye et al., 1996, Györfi et al., 2002] in dedicated chapters present
statistical properties of perceptrons
Nathalie Villa-Vialaneix | Introduction to statistical learning 42/58
Analogy to the brain
1 a neuron collects signals
from neighboring
neurons through its
dendrites
connexions which frequently lead to activating a neuron are enforced (tend
to have an increasing impact on the destination neuron)
Nathalie Villa-Vialaneix | Introduction to statistical learning 43/58
Analogy to the brain
1 a neuron collects signals
from neighboring
neurons through its
dendrites
2 when total signal is
above a given threshold,
the neuron is activated
connexions which frequently lead to activating a neuron are enforced (tend
to have an increasing impact on the destination neuron)
Nathalie Villa-Vialaneix | Introduction to statistical learning 43/58
Analogy to the brain
1 a neuron collects signals
from neighboring
neurons through its
dendrites
2 when total signal is
above a given threshold,
the neuron is activated
3 ... and a signal is sent to
other neurons through
the axon
connexions which frequently lead to activating a neuron are enforced (tend
to have an increasing impact on the destination neuron)
Nathalie Villa-Vialaneix | Introduction to statistical learning 43/58
First model of artificial neuron
[Mc Culloch and Pitts, 1943, Rosenblatt, 1958, Rosenblatt, 1962]
x(1)
x(2)
x(p)
f(x)Σ+
w1
w2
wp
w0
f : x ∈ Rp
→ 1 p
j=1
wjx(j)+w0 ≥ 0
Nathalie Villa-Vialaneix | Introduction to statistical learning 44/58
(artificial) Perceptron
Layers
MLP have one input layer (x ∈ Rp
), one output layer (y ∈ R or
∈ {1, . . . , K − 1} values) and several hidden layers;
no connections within a layer;
connections between two consecutive layers (feedforward).
Example (regression, y ∈ R):
INPUTS
x = (x(1)
, . . . , x(p)
)
Nathalie Villa-Vialaneix | Introduction to statistical learning 45/58
(artificial) Perceptron
Layers
MLP have one input layer (x ∈ Rp
), one output layer (y ∈ R or
∈ {1, . . . , K − 1} values) and several hidden layers;
no connections within a layer;
connections between two consecutive layers (feedforward).
Example (regression, y ∈ R):
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1
weights w
(1)
jk
Nathalie Villa-Vialaneix | Introduction to statistical learning 45/58
(artificial) Perceptron
Layers
MLP have one input layer (x ∈ Rp
), one output layer (y ∈ R or
∈ {1, . . . , K − 1} values) and several hidden layers;
no connections within a layer;
connections between two consecutive layers (feedforward).
Example (regression, y ∈ R):
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1
Nathalie Villa-Vialaneix | Introduction to statistical learning 45/58
(artificial) Perceptron
Layers
MLP have one input layer (x ∈ Rp
), one output layer (y ∈ R or
∈ {1, . . . , K − 1} values) and several hidden layers;
no connections within a layer;
connections between two consecutive layers (feedforward).
Example (regression, y ∈ R):
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1 Layer 2
Nathalie Villa-Vialaneix | Introduction to statistical learning 45/58
(artificial) Perceptron
Layers
MLP have one input layer (x ∈ Rp
), one output layer (y ∈ R or
∈ {1, . . . , K − 1} values) and several hidden layers;
no connections within a layer;
connections between two consecutive layers (feedforward).
Example (regression, y ∈ R):
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1 Layer 2 y
OUTPUTS
2 hidden layer MLP
Nathalie Villa-Vialaneix | Introduction to statistical learning 45/58
A neuron in MLP
v1
v2
v3
×w1
×w2
×w3
+
w0 (Bias Biais)
Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
Standard activation functions fonctions de transfert / d’activation
Biologically inspired: Heaviside function
h(z) =
0 if z < 0
1 otherwise.
Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
Standard activation functions
Main issue with the Heaviside function: not continuous!
Identity
h(z) = z
Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
Standard activation functions
But identity activation function gives linear model if used with one hidden
layer: not flexible enough
Logistic function
h(z) = 1
1+exp(−z)
Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
Standard activation functions
Another popular activation function (useful to model positive real numbers)
Rectified linear (ReLU)
h(z) = max(0, z)
Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
General sigmoid
sigmoid: nondecreasing function h : R → R such that
lim
z→+∞
h(z) = 1 lim
z→−∞
h(z) = 0
Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
Focus on one-hidden-layer perceptrons
Regression case
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
f(x)
f(x) =
Q
k=1
w
(2)
k
hk x w
(1)
k
+ w
(0)
k
+ w
(2)
0
, with hk a (logistic) sigmoid
Nathalie Villa-Vialaneix | Introduction to statistical learning 47/58
Focus on one-hidden-layer perceptrons
Binary classification case
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
ψ(x)
ψ(x) = h0


Q
k=1
w
(2)
k
hk x w
(1)
k
+ w
(0)
k
+ w
(2)
0


with h0 logistic sigmoid or identity.
Nathalie Villa-Vialaneix | Introduction to statistical learning 47/58
Focus on one-hidden-layer perceptrons
Binary classification case
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
ψ(x)
decision with:
f(x) =
0 if ψ(x) < 1/2
1 otherwise
Nathalie Villa-Vialaneix | Introduction to statistical learning 47/58
Focus on one-hidden-layer perceptrons
Extension to any classification problem in {1, . . . , K − 1}
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
ψ(x)
Straightforward extension to multiple classes with a multiple output
perceptron (number of output units equal to K) and a maximum probability
rule for the decision.
Nathalie Villa-Vialaneix | Introduction to statistical learning 47/58
Theoretical properties of perceptrons
This section answers two questions:
1 can we approximate any function g : [0, 1]p
→ R arbitrary well with a
perceptron?
Nathalie Villa-Vialaneix | Introduction to statistical learning 48/58
Theoretical properties of perceptrons
This section answers two questions:
1 can we approximate any function g : [0, 1]p
→ R arbitrary well with a
perceptron?
2 when a perceptron is trained with i.i.d. observations from an arbitrary
random variable pair (X, Y), is it consistent? (i.e., does it reach the
minimum possible error asymptotically when the number of
observations grows to infinity?)
Nathalie Villa-Vialaneix | Introduction to statistical learning 48/58
Illustration of the universal approximation property
Simple examples
a function to approximate: g : [0, 1] → sin 1
x+0.1
Nathalie Villa-Vialaneix | Introduction to statistical learning 49/58
Illustration of the universal approximation property
Simple examples
a function to approximate: g : [0, 1] → sin 1
x+0.1
trying to approximate (how this is performed is explained later in this talk) this
function with MLP having different numbers of neurons on their
hidden layer
Nathalie Villa-Vialaneix | Introduction to statistical learning 49/58
Universal property from a theoretical point of view
Set of MLPs with a given size:
PQ
(h) =



x ∈ Rp
→
Q
k=1
w
(2)
k
h x w
(1)
k
+ w
(0)
k
+ w
(2)
0
: w
(2)
k
, w
(0)
k
∈ R, w
(1)
k
∈ Rp



Nathalie Villa-Vialaneix | Introduction to statistical learning 50/58
Universal property from a theoretical point of view
Set of MLPs with a given size:
PQ
(h) =



x ∈ Rp
→
Q
k=1
w
(2)
k
h x w
(1)
k
+ w
(0)
k
+ w
(2)
0
: w
(2)
k
, w
(0)
k
∈ R, w
(1)
k
∈ Rp



Set of all MLPs: P(h) = ∪Q∈NPQ
(h)
Nathalie Villa-Vialaneix | Introduction to statistical learning 50/58
Universal property from a theoretical point of view
Set of MLPs with a given size:
PQ
(h) =



x ∈ Rp
→
Q
k=1
w
(2)
k
h x w
(1)
k
+ w
(0)
k
+ w
(2)
0
: w
(2)
k
, w
(0)
k
∈ R, w
(1)
k
∈ Rp



Set of all MLPs: P(h) = ∪Q∈NPQ
(h)
Universal approximation [Pinkus, 1999]
If h is a non polynomial continuous function, then, for any continuous
function g : [0, 1]p
→ R and any > 0, ∃ f ∈ P(h) such that:
sup
x∈[0,1]p
|f(x) − g(x)| ≤ .
Nathalie Villa-Vialaneix | Introduction to statistical learning 50/58
Remarks on universal approximation
continuity of the activation function is not required (see
[Devroye et al., 1996] for a result with arbitrary sigmoids)
other versions of this property are given in
[Hornik, 1991, Hornik, 1993, Stinchcombe, 1999] for different functional
spaces for g
none of the spaces PQ
(h), for a fixed Q, has this property
this result can be used to show that perceptron are consistent
whenever Q log(n)/n
n→+∞
−−−−−−→ 0
[Farago and Lugosi, 1993, Devroye et al., 1996]
Nathalie Villa-Vialaneix | Introduction to statistical learning 51/58
Empirical error minimization
Given i.i.d. observations of (X, Y), (Xi, Yi), how to choose the weights w?
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
fw(x)
Nathalie Villa-Vialaneix | Introduction to statistical learning 52/58
Empirical error minimization
Given i.i.d. observations of (X, Y), (Xi, Yi), how to choose the weights w?
Standard approach: minimize the empirical L2 risk:
Rn(w) =
n
i=1
[fw(Xi) − Yi]2
with
Yi ∈ R for the regression case
Yi ∈ {0, 1} for the classification case, with the associated decision rule
x → 1{fw (x)≤1/2}.
Nathalie Villa-Vialaneix | Introduction to statistical learning 52/58
Empirical error minimization
Given i.i.d. observations of (X, Y), (Xi, Yi), how to choose the weights w?
Standard approach: minimize the empirical L2 risk:
Rn(w) =
n
i=1
[fw(Xi) − Yi]2
with
Yi ∈ R for the regression case
Yi ∈ {0, 1} for the classification case, with the associated decision rule
x → 1{fw (x)≤1/2}.
But: ˆRn(w) is not convex in w ⇒ general optimization problem
Nathalie Villa-Vialaneix | Introduction to statistical learning 52/58
Optimization with gradient descent
Method: initialize (randomly or with some prior knowledge) the weights
w(0) ∈ RQp+2Q+1
Batch approach: for t = 1, . . . , T
w(t + 1) = w(t) − µ(t) w
ˆRn(w(t));
Nathalie Villa-Vialaneix | Introduction to statistical learning 53/58
Optimization with gradient descent
Method: initialize (randomly or with some prior knowledge) the weights
w(0) ∈ RQp+2Q+1
Batch approach: for t = 1, . . . , T
w(t + 1) = w(t) − µ(t) w
ˆRn(w(t));
online (or stochastic) approach: write
ˆRn(w) =
n
i=1
[fw(Xi) − Yi]2
=Ei
and for t = 1, . . . , T, randomly pick i ∈ {1, . . . , n} and update:
w(t + 1) = w(t) − µ(t) wEi(w(t)).
Nathalie Villa-Vialaneix | Introduction to statistical learning 53/58
Discussion about practical choices for this approach
batch version converges (in an optimization point of view) to a local
minimum of the error for a good choice of µ(t) but convergence can
be slow
stochastic version is usually very inefficient but is useful for large
datasets (n large)
more efficient algorithms exist to solve the optimization task. The one
implemented in the R package nnet uses higher order derivatives
(BFGS algorithm)
in all cases, solutions returned are, at best, local minima which
strongly depends on the initialization: using more than one
initialization state is advised
Nathalie Villa-Vialaneix | Introduction to statistical learning 54/58
Gradient backpropagation method
[Rumelhart and Mc Clelland, 1986]
The gradient backpropagation rétropropagation du gradient principle is
used to easily calculate gradients in perceptrons (or in other types of
neural network):
Nathalie Villa-Vialaneix | Introduction to statistical learning 55/58
Gradient backpropagation method
[Rumelhart and Mc Clelland, 1986]
The gradient backpropagation rétropropagation du gradient principle is
used to easily calculate gradients in perceptrons (or in other types of
neural network):
This way, stochastic gradient descent alternates:
a forward step which aims at calculating outputs from all observations
Xi given a value of the weights w
a backward step in which the gradient backpropagation principle is
used to obtain the gradient for the current weights w
Nathalie Villa-Vialaneix | Introduction to statistical learning 55/58
Backpropagation in practice
x(1)
x(2)
x(p)
+
+
w(1)
w(2)
w
(0)
1
w
(0)
Q
fw(x)
initialize weights
Nathalie Villa-Vialaneix | Introduction to statistical learning 56/58
Backpropagation in practice
x(1)
x(2)
x(p)
+
+
w(1)
w(2)
w
(0)
1
w
(0)
Q
fw(x)
Forward step: for all k, calculate a
(1)
k
= Xi
w
(1)
k
+ w
(0)
k
Nathalie Villa-Vialaneix | Introduction to statistical learning 56/58
Backpropagation in practice
x(1)
x(2)
x(p)
+
+
w(1)
w(2)
w
(0)
1
w
(0)
Q
fw(x)
Forward step: for all k, calculate z
(1)
k
= hk (a
(1)
k
)
Nathalie Villa-Vialaneix | Introduction to statistical learning 56/58
Backpropagation in practice
x(1)
x(2)
x(p)
+
+
w(1)
w(2)
w
(0)
1
w
(0)
Q
fw(x) = a(2)
Forward step: calculate a(2) = Q
k=1
w
(2)
k
z
(1)
k
+ w
(2)
0
Nathalie Villa-Vialaneix | Introduction to statistical learning 56/58
Backpropagation in practice
x(1)
x(2)
x(p)
+
+
w(1)
w
(0)
1
w
(0)
Q
w(2)
fw(x) = a(2)
Backward step: calculate ∂Ei
∂w
(2)
k
= δ(2) × zk with
δ(2)
= ∂Ei
∂a(2) =
[h0(a(2))−Yi ]2
∂a(2) = 2h0(a(2)
) × [h0(a(2)
) − Yi]
Nathalie Villa-Vialaneix | Introduction to statistical learning 56/58
Backpropagation in practice
x(1)
x(2)
x(p)
+
+
w(2)
w
(0)
1
w
(0)
Q
w(1)
fw(x)
Backward step: ∂Ei
∂w
(1)
kj
= δ
(1)
k
× X
(j)
i
with
δ
(1)
k
= ∂Ei
∂a
(1)
k
= ∂Ei
∂a(2) × ∂a(2)
∂a
(1)
k
= δ(2)
× w
(2)
k
hk (a
(1)
k
)
Nathalie Villa-Vialaneix | Introduction to statistical learning 56/58
Backpropagation in practice
x(1)
x(2)
x(p)
+
+
w(1)
w(2)
w
(0)
1
w
(0)
Q
fw(x)
Backward step: ∂Ei
∂w
(0)
k
= δ
(1)
k
Nathalie Villa-Vialaneix | Introduction to statistical learning 56/58
Initialization and stopping of the training algorithm
1 How to initialize weights? Standard choices w
(1)
jk
∼ N(0, 1/
√
p) and
w
(2)
k
∼ N(0, 1/
√
Q)
Nathalie Villa-Vialaneix | Introduction to statistical learning 57/58
Initialization and stopping of the training algorithm
1 How to initialize weights? Standard choices w
(1)
jk
∼ N(0, 1/
√
p) and
w
(2)
k
∼ N(0, 1/
√
Q)
2 When to stop the algorithm? (gradient descent or alike) Standard
choices:
bounded T
target value of the error ˆRn(w)
target value of the evolution ˆRn(w(t)) − ˆRn(w(t + 1))
Nathalie Villa-Vialaneix | Introduction to statistical learning 57/58
Initialization and stopping of the training algorithm
1 How to initialize weights? Standard choices w
(1)
jk
∼ N(0, 1/
√
p) and
w
(2)
k
∼ N(0, 1/
√
Q)
In the R package nnet, weights are sampled uniformly between
[−0.5, 0.5] or between − 1
maxi X
(j)
i
, 1
maxi X
(j)
i
if X(j) is large.
2 When to stop the algorithm? (gradient descent or alike) Standard
choices:
bounded T
target value of the error ˆRn(w)
target value of the evolution ˆRn(w(t)) − ˆRn(w(t + 1))
In the R package nnet, a combination of the three criteria is used and
tunable.
Nathalie Villa-Vialaneix | Introduction to statistical learning 57/58
Strategies to avoid overfitting
Properly tune Q with a CV or a bootstrap estimation of the
generalization ability of the method
Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
Strategies to avoid overfitting
Properly tune Q with a CV or a bootstrap estimation of the
generalization ability of the method
Early stopping: for Q large enough, use a part of the data as a
validation set and stops the training (gradient descent) when the
empirical error calculated on this dataset starts to increase
Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
Strategies to avoid overfitting
Properly tune Q with a CV or a bootstrap estimation of the
generalization ability of the method
Early stopping: for Q large enough, use a part of the data as a
validation set and stops the training (gradient descent) when the
empirical error calculated on this dataset starts to increase
Weight decay: for Q large enough, penalize the empirical risk with a
function of the weights, e.g.,
ˆRn(w) + λw w
Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
Strategies to avoid overfitting
Properly tune Q with a CV or a bootstrap estimation of the
generalization ability of the method
Early stopping: for Q large enough, use a part of the data as a
validation set and stops the training (gradient descent) when the
empirical error calculated on this dataset starts to increase
Weight decay: for Q large enough, penalize the empirical risk with a
function of the weights, e.g.,
ˆRn(w) + λw w
Noise injection: modify the input data with a random noise during the
training
Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
References
Aronszajn, N. (1950).
Theory of reproducing kernels.
Transactions of the American Mathematical Society, 68(3):337–404.
Bishop, C. (1995).
Neural Networks for Pattern Recognition.
Oxford University Press, New York, USA.
Breiman, L. (2001).
Random forests.
Machine Learning, 45(1):5–32.
Breiman, L., Friedman, J., Olsen, R., and Stone, C. (1984).
Classification and Regression Trees.
Chapman and Hall, Boca Raton, Florida, USA.
Devroye, L., Györfi, L., and Lugosi, G. (1996).
A Probabilistic Theory for Pattern Recognition.
Springer-Verlag, New York, NY, USA.
Farago, A. and Lugosi, G. (1993).
Strong universal consistency of neural network classifiers.
IEEE Transactions on Information Theory, 39(4):1146–1151.
Györfi, L., Kohler, M., Krzy˙zak, A., and Walk, H. (2002).
A Distribution-Free Theory of Nonparametric Regression.
Springer-Verlag, New York, NY, USA.
Hornik, K. (1991).
Approximation capabilities of multilayer feedfoward networks.
Neural Networks, 4(2):251–257.
Hornik, K. (1993).
Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
Some new results on neural network approximation.
Neural Networks, 6(8):1069–1072.
Mc Culloch, W. and Pitts, W. (1943).
A logical calculus of ideas immanent in nervous activity.
Bulletin of Mathematical Biophysics, 5(4):115–133.
Pinkus, A. (1999).
Approximation theory of the MLP model in neural networks.
Acta Numerica, 8:143–195.
Ripley, B. (1996).
Pattern Recognition and Neural Networks.
Cambridge University Press.
Rosenblatt, F. (1958).
The perceptron: a probabilistic model for information storage and organization in the brain.
Psychological Review, 65:386–408.
Rosenblatt, F. (1962).
Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms.
Spartan Books, Washington, DC, USA.
Rumelhart, D. and Mc Clelland, J. (1986).
Parallel Distributed Processing: Exploration in the MicroStructure of Cognition.
MIT Press, Cambridge, MA, USA.
Steinwart, I. (2002).
Support vector machines are universally consistent.
Journal of Complexity, 18:768–791.
Stinchcombe, M. (1999).
Neural network approximation of continuous functionals and continuous functions on compactifications.
Neural Network, 12(3):467–477.
Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
Vapnik, V. (1995).
The Nature of Statistical Learning Theory.
Springer Verlag, New York, USA.
Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58

Weitere ähnliche Inhalte

Was ist angesagt?

Kernel methods and variable selection for exploratory analysis and multi-omic...
Kernel methods and variable selection for exploratory analysis and multi-omic...Kernel methods and variable selection for exploratory analysis and multi-omic...
Kernel methods and variable selection for exploratory analysis and multi-omic...tuxette
 
Graph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype PredictionGraph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype Predictiontuxette
 
Learning from (dis)similarity data
Learning from (dis)similarity dataLearning from (dis)similarity data
Learning from (dis)similarity datatuxette
 
Investigating the 3D structure of the genome with Hi-C data analysis
Investigating the 3D structure of the genome with Hi-C data analysisInvestigating the 3D structure of the genome with Hi-C data analysis
Investigating the 3D structure of the genome with Hi-C data analysistuxette
 
About functional SIR
About functional SIRAbout functional SIR
About functional SIRtuxette
 
Convolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernelsConvolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernelstuxette
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
 
Kernel methods in machine learning
Kernel methods in machine learningKernel methods in machine learning
Kernel methods in machine learningbutest
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
 
Prototype-based models in machine learning
Prototype-based models in machine learningPrototype-based models in machine learning
Prototype-based models in machine learningUniversity of Groningen
 
The Advancement and Challenges in Computational Physics - Phdassistance
The Advancement and Challenges in Computational Physics - PhdassistanceThe Advancement and Challenges in Computational Physics - Phdassistance
The Advancement and Challenges in Computational Physics - PhdassistancePhD Assistance
 
Kernel Methods and Relational Learning in Computational Biology
Kernel Methods and Relational Learning in Computational BiologyKernel Methods and Relational Learning in Computational Biology
Kernel Methods and Relational Learning in Computational BiologyMichiel Stock
 
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNINGRECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNINGbutest
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selectionchenhm
 
Learning for Optimization: EDAs, probabilistic modelling, or ...
Learning for Optimization: EDAs, probabilistic modelling, or ...Learning for Optimization: EDAs, probabilistic modelling, or ...
Learning for Optimization: EDAs, probabilistic modelling, or ...butest
 
Intuition – Based Teaching Mathematics for Engineers
Intuition – Based Teaching Mathematics for EngineersIntuition – Based Teaching Mathematics for Engineers
Intuition – Based Teaching Mathematics for EngineersIDES Editor
 
DESIGN SUITABLE FEED FORWARD NEURAL NETWORK TO SOLVE TROESCH'S PROBLEM
DESIGN SUITABLE FEED FORWARD NEURAL NETWORK TO SOLVE TROESCH'S PROBLEMDESIGN SUITABLE FEED FORWARD NEURAL NETWORK TO SOLVE TROESCH'S PROBLEM
DESIGN SUITABLE FEED FORWARD NEURAL NETWORK TO SOLVE TROESCH'S PROBLEMLuma Tawfiq
 

Was ist angesagt? (20)

Kernel methods and variable selection for exploratory analysis and multi-omic...
Kernel methods and variable selection for exploratory analysis and multi-omic...Kernel methods and variable selection for exploratory analysis and multi-omic...
Kernel methods and variable selection for exploratory analysis and multi-omic...
 
Graph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype PredictionGraph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype Prediction
 
Learning from (dis)similarity data
Learning from (dis)similarity dataLearning from (dis)similarity data
Learning from (dis)similarity data
 
Investigating the 3D structure of the genome with Hi-C data analysis
Investigating the 3D structure of the genome with Hi-C data analysisInvestigating the 3D structure of the genome with Hi-C data analysis
Investigating the 3D structure of the genome with Hi-C data analysis
 
About functional SIR
About functional SIRAbout functional SIR
About functional SIR
 
Convolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernelsConvolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernels
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
Kernel methods in machine learning
Kernel methods in machine learningKernel methods in machine learning
Kernel methods in machine learning
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
Prototype-based models in machine learning
Prototype-based models in machine learningPrototype-based models in machine learning
Prototype-based models in machine learning
 
The Advancement and Challenges in Computational Physics - Phdassistance
The Advancement and Challenges in Computational Physics - PhdassistanceThe Advancement and Challenges in Computational Physics - Phdassistance
The Advancement and Challenges in Computational Physics - Phdassistance
 
Kernel Methods and Relational Learning in Computational Biology
Kernel Methods and Relational Learning in Computational BiologyKernel Methods and Relational Learning in Computational Biology
Kernel Methods and Relational Learning in Computational Biology
 
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNINGRECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selection
 
Learning for Optimization: EDAs, probabilistic modelling, or ...
Learning for Optimization: EDAs, probabilistic modelling, or ...Learning for Optimization: EDAs, probabilistic modelling, or ...
Learning for Optimization: EDAs, probabilistic modelling, or ...
 
Biehl hanze-2021
Biehl hanze-2021Biehl hanze-2021
Biehl hanze-2021
 
Intuition – Based Teaching Mathematics for Engineers
Intuition – Based Teaching Mathematics for EngineersIntuition – Based Teaching Mathematics for Engineers
Intuition – Based Teaching Mathematics for Engineers
 
ICPC08a.ppt
ICPC08a.pptICPC08a.ppt
ICPC08a.ppt
 
DESIGN SUITABLE FEED FORWARD NEURAL NETWORK TO SOLVE TROESCH'S PROBLEM
DESIGN SUITABLE FEED FORWARD NEURAL NETWORK TO SOLVE TROESCH'S PROBLEMDESIGN SUITABLE FEED FORWARD NEURAL NETWORK TO SOLVE TROESCH'S PROBLEM
DESIGN SUITABLE FEED FORWARD NEURAL NETWORK TO SOLVE TROESCH'S PROBLEM
 
WILF2011 - slides
WILF2011 - slidesWILF2011 - slides
WILF2011 - slides
 

Ähnlich wie A short introduction to statistical learning

A short introduction to statistical learning
A short introduction to statistical learningA short introduction to statistical learning
A short introduction to statistical learningtuxette
 
A comparison of learning methods to predict N2O fluxes and N leaching
A comparison of learning methods to predict N2O fluxes and N leachingA comparison of learning methods to predict N2O fluxes and N leaching
A comparison of learning methods to predict N2O fluxes and N leachingtuxette
 
About functional SIR
About functional SIRAbout functional SIR
About functional SIRtuxette
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade offVARUN KUMAR
 
Machine learning algorithms and business use cases
Machine learning algorithms and business use casesMachine learning algorithms and business use cases
Machine learning algorithms and business use casesSridhar Ratakonda
 
Introduction to Machine Learning Lectures
Introduction to Machine Learning LecturesIntroduction to Machine Learning Lectures
Introduction to Machine Learning Lecturesssuserfece35
 
Predictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big dataPredictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big dataArthur Charpentier
 
Module-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data scienceModule-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data sciencepujashri1975
 
1608 probability and statistics in engineering
1608 probability and statistics in engineering1608 probability and statistics in engineering
1608 probability and statistics in engineeringDr Fereidoun Dejahang
 
13ClassifierPerformance.pdf
13ClassifierPerformance.pdf13ClassifierPerformance.pdf
13ClassifierPerformance.pdfssuserdce5c21
 
Multiple comparison problem
Multiple comparison problemMultiple comparison problem
Multiple comparison problemJiri Haviger
 
Composing graphical models with neural networks for structured representatio...
Composing graphical models with  neural networks for structured representatio...Composing graphical models with  neural networks for structured representatio...
Composing graphical models with neural networks for structured representatio...Jeongmin Cha
 
Machine Learning, Financial Engineering and Quantitative Investing
Machine Learning, Financial Engineering and Quantitative InvestingMachine Learning, Financial Engineering and Quantitative Investing
Machine Learning, Financial Engineering and Quantitative InvestingShengyuan Wang Steven
 
Unit 1 - Mean Median Mode - 18MAB303T - PPT - Part 1.pdf
Unit 1 - Mean Median Mode - 18MAB303T - PPT - Part 1.pdfUnit 1 - Mean Median Mode - 18MAB303T - PPT - Part 1.pdf
Unit 1 - Mean Median Mode - 18MAB303T - PPT - Part 1.pdfAravindS199
 

Ähnlich wie A short introduction to statistical learning (20)

A short introduction to statistical learning
A short introduction to statistical learningA short introduction to statistical learning
A short introduction to statistical learning
 
A comparison of learning methods to predict N2O fluxes and N leaching
A comparison of learning methods to predict N2O fluxes and N leachingA comparison of learning methods to predict N2O fluxes and N leaching
A comparison of learning methods to predict N2O fluxes and N leaching
 
About functional SIR
About functional SIRAbout functional SIR
About functional SIR
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade off
 
MUMS Opening Workshop - Model Uncertainty and Uncertain Quantification - Merl...
MUMS Opening Workshop - Model Uncertainty and Uncertain Quantification - Merl...MUMS Opening Workshop - Model Uncertainty and Uncertain Quantification - Merl...
MUMS Opening Workshop - Model Uncertainty and Uncertain Quantification - Merl...
 
Machine learning algorithms and business use cases
Machine learning algorithms and business use casesMachine learning algorithms and business use cases
Machine learning algorithms and business use cases
 
Introduction to Machine Learning Lectures
Introduction to Machine Learning LecturesIntroduction to Machine Learning Lectures
Introduction to Machine Learning Lectures
 
Predictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big dataPredictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big data
 
Module-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data scienceModule-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data science
 
Basen Network
Basen NetworkBasen Network
Basen Network
 
1608 probability and statistics in engineering
1608 probability and statistics in engineering1608 probability and statistics in engineering
1608 probability and statistics in engineering
 
Lausanne 2019 #1
Lausanne 2019 #1Lausanne 2019 #1
Lausanne 2019 #1
 
13ClassifierPerformance.pdf
13ClassifierPerformance.pdf13ClassifierPerformance.pdf
13ClassifierPerformance.pdf
 
Bayesian_Decision_Theory-3.pdf
Bayesian_Decision_Theory-3.pdfBayesian_Decision_Theory-3.pdf
Bayesian_Decision_Theory-3.pdf
 
Multiple comparison problem
Multiple comparison problemMultiple comparison problem
Multiple comparison problem
 
Composing graphical models with neural networks for structured representatio...
Composing graphical models with  neural networks for structured representatio...Composing graphical models with  neural networks for structured representatio...
Composing graphical models with neural networks for structured representatio...
 
2019 Fall Series: Postdoc Seminars - Special Guest Lecture, There is a Kernel...
2019 Fall Series: Postdoc Seminars - Special Guest Lecture, There is a Kernel...2019 Fall Series: Postdoc Seminars - Special Guest Lecture, There is a Kernel...
2019 Fall Series: Postdoc Seminars - Special Guest Lecture, There is a Kernel...
 
Machine Learning, Financial Engineering and Quantitative Investing
Machine Learning, Financial Engineering and Quantitative InvestingMachine Learning, Financial Engineering and Quantitative Investing
Machine Learning, Financial Engineering and Quantitative Investing
 
Bayesian computation with INLA
Bayesian computation with INLABayesian computation with INLA
Bayesian computation with INLA
 
Unit 1 - Mean Median Mode - 18MAB303T - PPT - Part 1.pdf
Unit 1 - Mean Median Mode - 18MAB303T - PPT - Part 1.pdfUnit 1 - Mean Median Mode - 18MAB303T - PPT - Part 1.pdf
Unit 1 - Mean Median Mode - 18MAB303T - PPT - Part 1.pdf
 

Mehr von tuxette

Racines en haut et feuilles en bas : les arbres en maths
Racines en haut et feuilles en bas : les arbres en mathsRacines en haut et feuilles en bas : les arbres en maths
Racines en haut et feuilles en bas : les arbres en mathstuxette
 
Méthodes à noyaux pour l’intégration de données hétérogènes
Méthodes à noyaux pour l’intégration de données hétérogènesMéthodes à noyaux pour l’intégration de données hétérogènes
Méthodes à noyaux pour l’intégration de données hétérogènestuxette
 
Méthodologies d'intégration de données omiques
Méthodologies d'intégration de données omiquesMéthodologies d'intégration de données omiques
Méthodologies d'intégration de données omiquestuxette
 
Projets autour de l'Hi-C
Projets autour de l'Hi-CProjets autour de l'Hi-C
Projets autour de l'Hi-Ctuxette
 
Can deep learning learn chromatin structure from sequence?
Can deep learning learn chromatin structure from sequence?Can deep learning learn chromatin structure from sequence?
Can deep learning learn chromatin structure from sequence?tuxette
 
Multi-omics data integration methods: kernel and other machine learning appro...
Multi-omics data integration methods: kernel and other machine learning appro...Multi-omics data integration methods: kernel and other machine learning appro...
Multi-omics data integration methods: kernel and other machine learning appro...tuxette
 
ASTERICS : une application pour intégrer des données omiques
ASTERICS : une application pour intégrer des données omiquesASTERICS : une application pour intégrer des données omiques
ASTERICS : une application pour intégrer des données omiquestuxette
 
Autour des projets Idefics et MetaboWean
Autour des projets Idefics et MetaboWeanAutour des projets Idefics et MetaboWean
Autour des projets Idefics et MetaboWeantuxette
 
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...tuxette
 
Apprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiquesApprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiquestuxette
 
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...tuxette
 
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...tuxette
 
Journal club: Validation of cluster analysis results on validation data
Journal club: Validation of cluster analysis results on validation dataJournal club: Validation of cluster analysis results on validation data
Journal club: Validation of cluster analysis results on validation datatuxette
 
Overfitting or overparametrization?
Overfitting or overparametrization?Overfitting or overparametrization?
Overfitting or overparametrization?tuxette
 
Selective inference and single-cell differential analysis
Selective inference and single-cell differential analysisSelective inference and single-cell differential analysis
Selective inference and single-cell differential analysistuxette
 
SOMbrero : un package R pour les cartes auto-organisatrices
SOMbrero : un package R pour les cartes auto-organisatricesSOMbrero : un package R pour les cartes auto-organisatrices
SOMbrero : un package R pour les cartes auto-organisatricestuxette
 
A short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelsA short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelstuxette
 
Présentation du projet ASTERICS
Présentation du projet ASTERICSPrésentation du projet ASTERICS
Présentation du projet ASTERICStuxette
 
Présentation du projet ASTERICS
Présentation du projet ASTERICSPrésentation du projet ASTERICS
Présentation du projet ASTERICStuxette
 
A review on structure learning in GNN
A review on structure learning in GNNA review on structure learning in GNN
A review on structure learning in GNNtuxette
 

Mehr von tuxette (20)

Racines en haut et feuilles en bas : les arbres en maths
Racines en haut et feuilles en bas : les arbres en mathsRacines en haut et feuilles en bas : les arbres en maths
Racines en haut et feuilles en bas : les arbres en maths
 
Méthodes à noyaux pour l’intégration de données hétérogènes
Méthodes à noyaux pour l’intégration de données hétérogènesMéthodes à noyaux pour l’intégration de données hétérogènes
Méthodes à noyaux pour l’intégration de données hétérogènes
 
Méthodologies d'intégration de données omiques
Méthodologies d'intégration de données omiquesMéthodologies d'intégration de données omiques
Méthodologies d'intégration de données omiques
 
Projets autour de l'Hi-C
Projets autour de l'Hi-CProjets autour de l'Hi-C
Projets autour de l'Hi-C
 
Can deep learning learn chromatin structure from sequence?
Can deep learning learn chromatin structure from sequence?Can deep learning learn chromatin structure from sequence?
Can deep learning learn chromatin structure from sequence?
 
Multi-omics data integration methods: kernel and other machine learning appro...
Multi-omics data integration methods: kernel and other machine learning appro...Multi-omics data integration methods: kernel and other machine learning appro...
Multi-omics data integration methods: kernel and other machine learning appro...
 
ASTERICS : une application pour intégrer des données omiques
ASTERICS : une application pour intégrer des données omiquesASTERICS : une application pour intégrer des données omiques
ASTERICS : une application pour intégrer des données omiques
 
Autour des projets Idefics et MetaboWean
Autour des projets Idefics et MetaboWeanAutour des projets Idefics et MetaboWean
Autour des projets Idefics et MetaboWean
 
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
 
Apprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiquesApprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiques
 
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
 
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
 
Journal club: Validation of cluster analysis results on validation data
Journal club: Validation of cluster analysis results on validation dataJournal club: Validation of cluster analysis results on validation data
Journal club: Validation of cluster analysis results on validation data
 
Overfitting or overparametrization?
Overfitting or overparametrization?Overfitting or overparametrization?
Overfitting or overparametrization?
 
Selective inference and single-cell differential analysis
Selective inference and single-cell differential analysisSelective inference and single-cell differential analysis
Selective inference and single-cell differential analysis
 
SOMbrero : un package R pour les cartes auto-organisatrices
SOMbrero : un package R pour les cartes auto-organisatricesSOMbrero : un package R pour les cartes auto-organisatrices
SOMbrero : un package R pour les cartes auto-organisatrices
 
A short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelsA short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction models
 
Présentation du projet ASTERICS
Présentation du projet ASTERICSPrésentation du projet ASTERICS
Présentation du projet ASTERICS
 
Présentation du projet ASTERICS
Présentation du projet ASTERICSPrésentation du projet ASTERICS
Présentation du projet ASTERICS
 
A review on structure learning in GNN
A review on structure learning in GNNA review on structure learning in GNN
A review on structure learning in GNN
 

Kürzlich hochgeladen

Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 

Kürzlich hochgeladen (20)

Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 

A short introduction to statistical learning

  • 1. A short introduction to statistical learning Nathalie Villa-Vialaneix nathalie.villa-vialaneix@inra.fr http://www.nathalievilla.org GT “SPARKGEN” March 12th, 2018 - INRA, Toulouse Nathalie Villa-Vialaneix | Introduction to statistical learning 1/58
  • 2. Outline 1 Introduction Background and notations Underfitting / Overfitting Consistency 2 CART and random forests Introduction to CART Learning Prediction Overview of random forests Bootstrap/Bagging Random forest 3 SVM 4 (not deep) Neural networks Seminal references Multi-layer perceptrons Theoretical properties of perceptrons Learning perceptrons Learning in practice Nathalie Villa-Vialaneix | Introduction to statistical learning 2/58
  • 3. Outline 1 Introduction Background and notations Underfitting / Overfitting Consistency 2 CART and random forests Introduction to CART Learning Prediction Overview of random forests Bootstrap/Bagging Random forest 3 SVM 4 (not deep) Neural networks Seminal references Multi-layer perceptrons Theoretical properties of perceptrons Learning perceptrons Learning in practice Nathalie Villa-Vialaneix | Introduction to statistical learning 3/58
  • 4. Background Purpose: predict Y from X; Nathalie Villa-Vialaneix | Introduction to statistical learning 4/58
  • 5. Background Purpose: predict Y from X; What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn); Nathalie Villa-Vialaneix | Introduction to statistical learning 4/58
  • 6. Background Purpose: predict Y from X; What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn); What we want: estimate unknown Y from new X: xn+1, . . . , xm. Nathalie Villa-Vialaneix | Introduction to statistical learning 4/58
  • 7. Background Purpose: predict Y from X; What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn); What we want: estimate unknown Y from new X: xn+1, . . . , xm. X can be: numeric variables; or factors; or a combination of numeric variables and factors. Nathalie Villa-Vialaneix | Introduction to statistical learning 4/58
  • 8. Background Purpose: predict Y from X; What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn); What we want: estimate unknown Y from new X: xn+1, . . . , xm. X can be: numeric variables; or factors; or a combination of numeric variables and factors. Y can be: a numeric variable (Y ∈ R) ⇒ (supervised) regression régression; a factor ⇒ (supervised) classification discrimination. Nathalie Villa-Vialaneix | Introduction to statistical learning 4/58
  • 9. Basics From (xi, yi)i, definition of a machine, Φn s.t.: ˆynew = Φn (xnew). Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
  • 10. Basics From (xi, yi)i, definition of a machine, Φn s.t.: ˆynew = Φn (xnew). if Y is numeric, Φn is called a regression function fonction de classification; if Y is a factor, Φn is called a classifier classifieur; Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
  • 11. Basics From (xi, yi)i, definition of a machine, Φn s.t.: ˆynew = Φn (xnew). if Y is numeric, Φn is called a regression function fonction de classification; if Y is a factor, Φn is called a classifier classifieur; Φn is said to be trained or learned from the observations (xi, yi)i. Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
  • 12. Basics From (xi, yi)i, definition of a machine, Φn s.t.: ˆynew = Φn (xnew). if Y is numeric, Φn is called a regression function fonction de classification; if Y is a factor, Φn is called a classifier classifieur; Φn is said to be trained or learned from the observations (xi, yi)i. Desirable properties accuracy to the observations: predictions made on known data are close to observed values; Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
  • 13. Basics From (xi, yi)i, definition of a machine, Φn s.t.: ˆynew = Φn (xnew). if Y is numeric, Φn is called a regression function fonction de classification; if Y is a factor, Φn is called a classifier classifieur; Φn is said to be trained or learned from the observations (xi, yi)i. Desirable properties accuracy to the observations: predictions made on known data are close to observed values; generalization ability: predictions made on new data are also accurate. Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
  • 14. Basics From (xi, yi)i, definition of a machine, Φn s.t.: ˆynew = Φn (xnew). if Y is numeric, Φn is called a regression function fonction de classification; if Y is a factor, Φn is called a classifier classifieur; Φn is said to be trained or learned from the observations (xi, yi)i. Desirable properties accuracy to the observations: predictions made on known data are close to observed values; generalization ability: predictions made on new data are also accurate. Conflicting objectives!! Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
  • 15. Underfitting/Overfitting sous/sur - apprentissage Function x → y to be estimated Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
  • 16. Underfitting/Overfitting sous/sur - apprentissage Observations we might have Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
  • 17. Underfitting/Overfitting sous/sur - apprentissage Observations we do have Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
  • 18. Underfitting/Overfitting sous/sur - apprentissage First estimation from the observations: underfitting Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
  • 19. Underfitting/Overfitting sous/sur - apprentissage Second estimation from the observations: accurate estimation Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
  • 20. Underfitting/Overfitting sous/sur - apprentissage Third estimation from the observations: overfitting Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
  • 21. Underfitting/Overfitting sous/sur - apprentissage Summary Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
  • 22. Errors training error (measures the accuracy to the observations) Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
  • 23. Errors training error (measures the accuracy to the observations) if y is a factor: misclassification rate {ˆyi yi, i = 1, . . . , n} n Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
  • 24. Errors training error (measures the accuracy to the observations) if y is a factor: misclassification rate {ˆyi yi, i = 1, . . . , n} n if y is numeric: mean square error (MSE) 1 n n i=1 (ˆyi − yi)2 Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
  • 25. Errors training error (measures the accuracy to the observations) if y is a factor: misclassification rate {ˆyi yi, i = 1, . . . , n} n if y is numeric: mean square error (MSE) 1 n n i=1 (ˆyi − yi)2 or root mean square error (RMSE) or pseudo-R2 : 1−MSE/Var((yi)i) Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
  • 26. Errors training error (measures the accuracy to the observations) if y is a factor: misclassification rate {ˆyi yi, i = 1, . . . , n} n if y is numeric: mean square error (MSE) 1 n n i=1 (ˆyi − yi)2 or root mean square error (RMSE) or pseudo-R2 : 1−MSE/Var((yi)i) test error: a way to prevent overfitting (estimates the generalization error) is the simple validation Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
  • 27. Errors training error (measures the accuracy to the observations) if y is a factor: misclassification rate {ˆyi yi, i = 1, . . . , n} n if y is numeric: mean square error (MSE) 1 n n i=1 (ˆyi − yi)2 or root mean square error (RMSE) or pseudo-R2 : 1−MSE/Var((yi)i) test error: a way to prevent overfitting (estimates the generalization error) is the simple validation 1 split the data into training/test sets (usually 80%/20%) 2 train Φn from the training dataset 3 calculate the test error from the remaining data Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
  • 28. Example Observations Nathalie Villa-Vialaneix | Introduction to statistical learning 8/58
  • 29. Example Training/Test datasets Nathalie Villa-Vialaneix | Introduction to statistical learning 8/58
  • 30. Example Training/Test errors Nathalie Villa-Vialaneix | Introduction to statistical learning 8/58
  • 31. Example Summary Nathalie Villa-Vialaneix | Introduction to statistical learning 8/58
  • 32. Bias / Variance trade-off This problem is also related to the well known bias / variance trade-off bias: error that comes from erroneous assumptions in the learning algorithm (average error of the predictor) variance: error that comes from sensitivity to small fluctuations in the training set (variance of the predictor) Nathalie Villa-Vialaneix | Introduction to statistical learning 9/58
  • 33. Bias / Variance trade-off This problem is also related to the well known bias / variance trade-off bias: error that comes from erroneous assumptions in the learning algorithm (average error of the predictor) variance: error that comes from sensitivity to small fluctuations in the training set (variance of the predictor) Overall error is: E(MSE) = Bias2 + Variance Nathalie Villa-Vialaneix | Introduction to statistical learning 9/58
  • 34. Consistency in the parametric/non parametric case Example in the parametric framework (linear methods) an assumption is made on the form of the relation between X and Y: Y = βT X + β is estimated from the observations (x1, y1), . . . , (xn, yn) by a given method which calculates a βn . The estimation is said to be consistent if βn n→+∞ −−−−−−→ β under (possibly) technical assumptions on X, , Y. Nathalie Villa-Vialaneix | Introduction to statistical learning 10/58
  • 35. Consistency in the parametric/non parametric case Example in the nonparametric framework the form of the relation between X and Y is unknown: Y = Φ(X) + Φ is estimated from the observations (x1, y1), . . . , (xn, yn) by a given method which calculates a Φn . The estimation is said to be consistent if Φn n→+∞ −−−−−−→ Φ under (possibly) technical assumptions on X, , Y. Nathalie Villa-Vialaneix | Introduction to statistical learning 10/58
  • 36. Consistency from the statistical learning perspective [Vapnik, 1995] Question: Are we really interested in estimating Φ or... Nathalie Villa-Vialaneix | Introduction to statistical learning 11/58
  • 37. Consistency from the statistical learning perspective [Vapnik, 1995] Question: Are we really interested in estimating Φ or... ... rather in having the smallest prediction error? Statistical learning perspective: a method that builds a machine Φn from the observations is said to be (universally) consistent if, given a risk function R : R × R → R+ (which calculates an error), E (R(Φn (X), Y)) n→+∞ −−−−−−→ inf Φ:X→R E (R(Φ(X), Y)) , for any distribution of (X, Y) ∈ X × R. Definitions: L∗ = infΦ:X→R E (R(Φ(X), Y)) and LΦ = E (R(Φ(X), Y)). Nathalie Villa-Vialaneix | Introduction to statistical learning 11/58
  • 38. Desirable properties from a mathematical perspective Simplified framework: X ∈ X and Y ∈ {−1, 1} (binary classification) Learning process: choose a machine Φn in a class of functions C ⊂ {Φ : X → R} (e.g., C is the set of all functions that can be build using a SVM). Error decomposition LΦn − L∗ ≤ LΦn − inf Φ∈C LΦ + inf Φ∈C LΦ − L∗ with infΦ∈C LΦ − L∗ is the richness of C (i.e., C must be rich to ensure that this term is small); Nathalie Villa-Vialaneix | Introduction to statistical learning 12/58
  • 39. Desirable properties from a mathematical perspective Simplified framework: X ∈ X and Y ∈ {−1, 1} (binary classification) Learning process: choose a machine Φn in a class of functions C ⊂ {Φ : X → R} (e.g., C is the set of all functions that can be build using a SVM). Error decomposition LΦn − L∗ ≤ LΦn − inf Φ∈C LΦ + inf Φ∈C LΦ − L∗ with infΦ∈C LΦ − L∗ is the richness of C (i.e., C must be rich to ensure that this term is small); LΦn − infΦ∈C LΦ ≤ 2 supΦ∈C |Ln Φ − LΦ|, Ln Φ = 1 n n i=1 R(Φ(xi), yi) is the generalization capability of C (i.e., in the worst case, the empirical error must be close to the true error: C must not be too rich to ensure that this term is small). Nathalie Villa-Vialaneix | Introduction to statistical learning 12/58
  • 40. Outline 1 Introduction Background and notations Underfitting / Overfitting Consistency 2 CART and random forests Introduction to CART Learning Prediction Overview of random forests Bootstrap/Bagging Random forest 3 SVM 4 (not deep) Neural networks Seminal references Multi-layer perceptrons Theoretical properties of perceptrons Learning perceptrons Learning in practice Nathalie Villa-Vialaneix | Introduction to statistical learning 13/58
  • 41. Overview CART: Classification And Regression Trees introduced by [Breiman et al., 1984]. Nathalie Villa-Vialaneix | Introduction to statistical learning 14/58
  • 42. Overview CART: Classification And Regression Trees introduced by [Breiman et al., 1984]. Advantages classification OR regression (i.e., Y can be a numeric variable or a factor); non parametric method: no prior assumption needed; can deal with a large number of input variables, either numeric variables or factors (a variable selection is included in the method); provide an intuitive interpretation. Nathalie Villa-Vialaneix | Introduction to statistical learning 14/58
  • 43. Overview CART: Classification And Regression Trees introduced by [Breiman et al., 1984]. Advantages classification OR regression (i.e., Y can be a numeric variable or a factor); non parametric method: no prior assumption needed; can deal with a large number of input variables, either numeric variables or factors (a variable selection is included in the method); provide an intuitive interpretation. Drawbacks require a large training dataset to be efficient; as a consequence, are often too simple to provide accurate predictions. Nathalie Villa-Vialaneix | Introduction to statistical learning 14/58
  • 44. Example X = (Gender, Age, Height) and Y = Weight Y1 Y2 Y3 Y4 Y5 Height<1.60m Height>1.60m Gender=M Gender=F Age<30 Age>30 Gender=M Gender=F Root a split a node a leaf (terminal node) Nathalie Villa-Vialaneix | Introduction to statistical learning 15/58
  • 45. CART learning process Algorithm Start from root repeat move to a “new” node if the node is homogeneous or small enough then STOP else split the node into two child nodes with maximal “homogeneity” end if until all nodes are processed Nathalie Villa-Vialaneix | Introduction to statistical learning 16/58
  • 46. Further details Homogeneity? if Y is a numeric variable, variance of (yi)i for the observations assigned to the node (Gini index is also sometimes used); Nathalie Villa-Vialaneix | Introduction to statistical learning 17/58
  • 47. Further details Homogeneity? if Y is a numeric variable, variance of (yi)i for the observations assigned to the node (Gini index is also sometimes used); if Y is a factor, node purity: % of observations assigned to the node whose Y values are not the node majority class. Nathalie Villa-Vialaneix | Introduction to statistical learning 17/58
  • 48. Further details Homogeneity? if Y is a numeric variable, variance of (yi)i for the observations assigned to the node (Gini index is also sometimes used); if Y is a factor, node purity: % of observations assigned to the node whose Y values are not the node majority class. Stopping criteria? Minimum size node (generally 1 or 5) Minimum node purity or variance Maximum tree depth Nathalie Villa-Vialaneix | Introduction to statistical learning 17/58
  • 49. Further details Homogeneity? if Y is a numeric variable, variance of (yi)i for the observations assigned to the node (Gini index is also sometimes used); if Y is a factor, node purity: % of observations assigned to the node whose Y values are not the node majority class. Stopping criteria? Minimum size node (generally 1 or 5) Minimum node purity or variance Maximum tree depth Hyperparameters can be tuned by cross-validation using a grid search. Nathalie Villa-Vialaneix | Introduction to statistical learning 17/58
  • 50. Further details Homogeneity? if Y is a numeric variable, variance of (yi)i for the observations assigned to the node (Gini index is also sometimes used); if Y is a factor, node purity: % of observations assigned to the node whose Y values are not the node majority class. Stopping criteria? Minimum size node (generally 1 or 5) Minimum node purity or variance Maximum tree depth Hyperparameters can be tuned by cross-validation using a grid search. An alternative approach is pruning... Nathalie Villa-Vialaneix | Introduction to statistical learning 17/58
  • 51. Making new predictions A new observation, xnew is assigned to a leaf (straightforward) Nathalie Villa-Vialaneix | Introduction to statistical learning 18/58
  • 52. Making new predictions A new observation, xnew is assigned to a leaf (straightforward) the corresponding predicted ˆynew is if Y is numeric, the mean value of the observations (training set) assigned to the same leaf Nathalie Villa-Vialaneix | Introduction to statistical learning 18/58
  • 53. Making new predictions A new observation, xnew is assigned to a leaf (straightforward) the corresponding predicted ˆynew is if Y is numeric, the mean value of the observations (training set) assigned to the same leaf if Y is a factor, the majority class of the observations (training set) assigned to the same leaf Nathalie Villa-Vialaneix | Introduction to statistical learning 18/58
  • 54. Advantages/Drawbacks Random Forest: introduced by [Breiman, 2001] Nathalie Villa-Vialaneix | Introduction to statistical learning 19/58
  • 55. Advantages/Drawbacks Random Forest: introduced by [Breiman, 2001] Advantages classification OR regression (i.e., Y can be a numeric variable or a factor) non parametric method (no prior assumption needed) and accurate can deal with a large number of input variables, either numeric variables or factors can deal with small samples Nathalie Villa-Vialaneix | Introduction to statistical learning 19/58
  • 56. Advantages/Drawbacks Random Forest: introduced by [Breiman, 2001] Advantages classification OR regression (i.e., Y can be a numeric variable or a factor) non parametric method (no prior assumption needed) and accurate can deal with a large number of input variables, either numeric variables or factors can deal with small samples Drawbacks black box model is only supported by a few mathematical results (consistency...) until now Nathalie Villa-Vialaneix | Introduction to statistical learning 19/58
  • 57. Basic description A fact: When the sample size is small, you might be unable to estimate properly Nathalie Villa-Vialaneix | Introduction to statistical learning 20/58
  • 58. Basic description A fact: When the sample size is small, you might be unable to estimate properly This issue is commonly tackled by bootstrapping and, more specifically, bagging (Boostrap Aggregating) ⇒ it reduces the variance of the estimator Nathalie Villa-Vialaneix | Introduction to statistical learning 20/58
  • 59. Basic description A fact: When the sample size is small, you might be unable to estimate properly This issue is commonly tackled by bootstrapping and, more specifically, bagging (Boostrap Aggregating) ⇒ it reduces the variance of the estimator Bagging: combination of simple (and underefficient) regression (or classification) functions Nathalie Villa-Vialaneix | Introduction to statistical learning 20/58
  • 60. Basic description A fact: When the sample size is small, you might be unable to estimate properly This issue is commonly tackled by bootstrapping and, more specifically, bagging (Boostrap Aggregating) ⇒ it reduces the variance of the estimator Bagging: combination of simple (and underefficient) regression (or classification) functions Random forest CART bagging Nathalie Villa-Vialaneix | Introduction to statistical learning 20/58
  • 61. Bootstrap Bootstrap sample: random sampling (with replacement) of the training dataset - samples have the same size than the original dataset Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
  • 62. Bootstrap Bootstrap sample: random sampling (with replacement) of the training dataset - samples have the same size than the original dataset General (and robust) approach to solve several problems: Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
  • 63. Bootstrap Bootstrap sample: random sampling (with replacement) of the training dataset - samples have the same size than the original dataset General (and robust) approach to solve several problems: Estimating confidence intervals (of X with no prior assumption on the distribution of X) 1 Build P bootstrap samples from (xi)i 2 Use them to estimate X P times 3 The confidence interval is based on the percentiles of the empirical distribution of X Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
  • 64. Bootstrap Bootstrap sample: random sampling (with replacement) of the training dataset - samples have the same size than the original dataset General (and robust) approach to solve several problems: Estimating confidence intervals (of X with no prior assumption on the distribution of X) Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
  • 65. Bootstrap Bootstrap sample: random sampling (with replacement) of the training dataset - samples have the same size than the original dataset General (and robust) approach to solve several problems: Estimating confidence intervals (of X with no prior assumption on the distribution of X) Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
  • 66. Bootstrap Bootstrap sample: random sampling (with replacement) of the training dataset - samples have the same size than the original dataset General (and robust) approach to solve several problems: Estimating confidence intervals (of X with no prior assumption on the distribution of X) Also useful to estimate p-values, residuals, ... Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
  • 67. Bagging Average the estimates of the regression (or the classification) function obtained from B bootstrap samples Nathalie Villa-Vialaneix | Introduction to statistical learning 22/58
  • 68. Bagging Average the estimates of the regression (or the classification) function obtained from B bootstrap samples Bagging with regression trees 1: for b = 1, . . . , B do 2: Build a bootstrap sample ξb 3: Train a regression tree from ξb, ˆφb 4: end for 5: Estimate the regression function by ˆΦn (x) = 1 B B b=1 ˆφb(x) Nathalie Villa-Vialaneix | Introduction to statistical learning 22/58
  • 69. Bagging Average the estimates of the regression (or the classification) function obtained from B bootstrap samples Bagging with regression trees 1: for b = 1, . . . , B do 2: Build a bootstrap sample ξb 3: Train a regression tree from ξb, ˆφb 4: end for 5: Estimate the regression function by ˆΦn (x) = 1 B B b=1 ˆφb(x) For classification, the predicted class is the majority vote class. Nathalie Villa-Vialaneix | Introduction to statistical learning 22/58
  • 70. Random forests CART bagging with additional variations 1 each node is based on a random (and different) subset of q variables (an advisable choice for q is √ p for classification and p/3 for regression) Nathalie Villa-Vialaneix | Introduction to statistical learning 23/58
  • 71. Random forests CART bagging with additional variations 1 each node is based on a random (and different) subset of q variables (an advisable choice for q is √ p for classification and p/3 for regression) 2 the tree is fully developed (overfitted) Hyperparameters those of the CART algorithm those that are specific to the random forest: q, number of trees Nathalie Villa-Vialaneix | Introduction to statistical learning 23/58
  • 72. Random forests CART bagging with additional variations 1 each node is based on a random (and different) subset of q variables (an advisable choice for q is √ p for classification and p/3 for regression) 2 the tree is fully developed (overfitted) Hyperparameters those of the CART algorithm those that are specific to the random forest: q, number of trees Random forests are not very sensitive to hyper-parameter settings: default values for q and 500/1000 trees should work in most cases. Nathalie Villa-Vialaneix | Introduction to statistical learning 23/58
  • 73. Additional tools OOB (Out-Of Bags) error: error based on the observations not included in the “bag” Nathalie Villa-Vialaneix | Introduction to statistical learning 24/58
  • 74. Additional tools OOB (Out-Of Bags) error: error based on the observations not included in the “bag” Stabilization of OOB error is a good indication that there is enough trees in the forest Nathalie Villa-Vialaneix | Introduction to statistical learning 24/58
  • 75. Additional tools OOB (Out-Of Bags) error: error based on the observations not included in the “bag” Importance of a variable to help interpretation: for a given variable Xj 1: randomize the values of the variable 2: make predictions from this new dataset 3: the importance is the mean decrease in accuracy (MSE or misclassification rate) Nathalie Villa-Vialaneix | Introduction to statistical learning 24/58
  • 76. Outline 1 Introduction Background and notations Underfitting / Overfitting Consistency 2 CART and random forests Introduction to CART Learning Prediction Overview of random forests Bootstrap/Bagging Random forest 3 SVM 4 (not deep) Neural networks Seminal references Multi-layer perceptrons Theoretical properties of perceptrons Learning perceptrons Learning in practice Nathalie Villa-Vialaneix | Introduction to statistical learning 25/58
  • 77. Basic introduction Binary classification problem: X ∈ H et Y ∈ {−1; 1} A training set is given: (x1, y1), . . . , (xn, yn) Nathalie Villa-Vialaneix | Introduction to statistical learning 26/58
  • 78. Basic introduction Binary classification problem: X ∈ H et Y ∈ {−1; 1} A training set is given: (x1, y1), . . . , (xn, yn) SVM is a method based on kernels. It is universally consistent method, given that the kernel is universal [Steinwart, 2002]. Extensions to the regression case exist (SVR or LS-SVM) that are also universally consistent when the kernel is universal. Nathalie Villa-Vialaneix | Introduction to statistical learning 26/58
  • 79. Optimal margin classification Nathalie Villa-Vialaneix | Introduction to statistical learning 27/58
  • 80. Optimal margin classification Nathalie Villa-Vialaneix | Introduction to statistical learning 27/58
  • 81. Optimal margin classification w margin: 1 w 2 Support Vector Nathalie Villa-Vialaneix | Introduction to statistical learning 27/58
  • 82. Optimal margin classification w margin: 1 w 2 Support Vector w is chosen such that: minw w 2 (the margin is the largest), under the constraints: yi( w, xi + b) ≥ 1, 1 ≤ i ≤ n (the separation between the two classes is perfect). ⇒ ensures a good generalization capability. Nathalie Villa-Vialaneix | Introduction to statistical learning 27/58
  • 83. Soft margin classification Nathalie Villa-Vialaneix | Introduction to statistical learning 28/58
  • 84. Soft margin classification Nathalie Villa-Vialaneix | Introduction to statistical learning 28/58
  • 85. Soft margin classification w margin: 1 w 2 Support Vector Nathalie Villa-Vialaneix | Introduction to statistical learning 28/58
  • 86. Soft margin classification w margin: 1 w 2 Support Vector w is chosen such that: minw,ξ w 2 + C n i=1 ξi (the margin is the largest), under the constraints: yi( w, xi + b) ≥ 1 − ξi, 1 ≤ i ≤ n, ξi ≥ 0, 1 ≤ i ≤ n. (the separation between the two classes is almost perfect). ⇒ allowing a few errors improves the richness of the class. Nathalie Villa-Vialaneix | Introduction to statistical learning 28/58
  • 87. Non linear SVM Original space X Nathalie Villa-Vialaneix | Introduction to statistical learning 29/58
  • 88. Non linear SVM Original space X Feature space H Ψ (non linear) Nathalie Villa-Vialaneix | Introduction to statistical learning 29/58
  • 89. Non linear SVM Original space X Feature space H Ψ (non linear) Nathalie Villa-Vialaneix | Introduction to statistical learning 29/58
  • 90. Non linear SVM Original space X Feature space H Ψ (non linear) w ∈ H is chosen such that (PC,H ): minw,ξ w 2 H + C n i=1 ξi (the margin in the feature space is the largest), under the constraints: yi( w, Ψ(xi) H + b) ≥ 1 − ξi, 1 ≤ i ≤ n, ξi ≥ 0, 1 ≤ i ≤ n. (the separation between the two classes in the feature space is almost perfect). Nathalie Villa-Vialaneix | Introduction to statistical learning 29/58
  • 91. SVM from different points of view A regularization problem: (PC,H ) ⇔ (P2 λ,H ) : min w∈H 1 n n i=1 R(fw(xi), yi) error term +λ w 2 H penalization term , where fw(x) = Ψ(x), w H and R(ˆy, y) = max(0, 1 − ˆyy) (hinge loss function) errors versus ˆy for y = 1: blue: hinge loss; green: misclassification error. Nathalie Villa-Vialaneix | Introduction to statistical learning 30/58
  • 92. SVM from different points of view A regularization problem: (PC,H ) ⇔ (P2 λ,H ) : min w∈H 1 n n i=1 R(fw(xi), yi) error term +λ w 2 H penalization term , where fw(x) = Ψ(x), w H and R(ˆy, y) = max(0, 1 − ˆyy) (hinge loss function) A dual problem: (PC,H ) ⇔ (DC,X) : maxα∈Rn n i=1 αi − n i=1 n j=1 αiαjyiyj Ψ(xi), Ψ(xj) H , with N i=1 αiyi = 0, 0 ≤ αi ≤ C, 1 ≤ i ≤ n. Nathalie Villa-Vialaneix | Introduction to statistical learning 30/58
  • 93. SVM from different points of view A regularization problem: (PC,H ) ⇔ (P2 λ,H ) : min w∈H 1 n n i=1 R(fw(xi), yi) error term +λ w 2 H penalization term , where fw(x) = Ψ(x), w H and R(ˆy, y) = max(0, 1 − ˆyy) (hinge loss function) A dual problem: (PC,H ) ⇔ (DC,X) : maxα∈Rn n i=1 αi − n i=1 n j=1 αiαjyiyjK(xi, xj), with N i=1 αiyi = 0, 0 ≤ αi ≤ C, 1 ≤ i ≤ n. There is no need to know Ψ and H: choose a function K with a few good properties; use it as the dot product in H: ∀ u, v ∈ H, K(u, v) = Ψ(u), Ψ(v) H . Nathalie Villa-Vialaneix | Introduction to statistical learning 30/58
  • 94. Which kernels? Minimum properties that a kernel should fulfilled symmetry: K(u, u ) = K(u , u) positivity: ∀ N ∈ N, ∀ (αi) ⊂ RN , ∀ (xi) ⊂ XN , i,j αiαjK(xi, xj) ≥ 0. [Aronszajn, 1950]: ∃ a Hilbert space (H, ., . H ) and a function Ψ : X → H such that: ∀ u, v ∈ H, K(u, v) = Ψ(u), Ψ(v) H Nathalie Villa-Vialaneix | Introduction to statistical learning 31/58
  • 95. Which kernels? Minimum properties that a kernel should fulfilled symmetry: K(u, u ) = K(u , u) positivity: ∀ N ∈ N, ∀ (αi) ⊂ RN , ∀ (xi) ⊂ XN , i,j αiαjK(xi, xj) ≥ 0. [Aronszajn, 1950]: ∃ a Hilbert space (H, ., . H ) and a function Ψ : X → H such that: ∀ u, v ∈ H, K(u, v) = Ψ(u), Ψ(v) H Examples the Gaussian kernel: ∀ x, x ∈ Rd , K(x, x ) = e−γ x−x 2 (it is universal for all bounded subset of Rd ); the linear kernel: ∀ x, x ∈ Rd , K(x, x ) = xT (x ) (it is not universal). Nathalie Villa-Vialaneix | Introduction to statistical learning 31/58
  • 96. In summary, how does the solution write???? Φn (x) = i αiyiK(xi, x) where only a few αi 0. i such that αi 0 are the support vectors! Nathalie Villa-Vialaneix | Introduction to statistical learning 32/58
  • 97. I’m almost dead with all these stuffs on my mind!!! What in practice? data(iris) iris <- iris[iris$Species%in%c("versicolor","virginica"),] plot(iris$Petal.Length, iris$Petal.Width, col=iris$Species , pch=19) legend("topleft", pch=19, col=c(2,3), legend=c("versicolor", "virginica")) Nathalie Villa-Vialaneix | Introduction to statistical learning 33/58
  • 98. I’m almost dead with all these stuffs on my mind!!! What in practice? library(e1071) res.tune <- tune.svm(Species ~ ., data=iris, kernel="linear", cost = 2^(-1:4)) # Parameter tuning of ’svm’: # - sampling method: 10fold cross validation # - best parameters: # cost # 0.5 # - best performance: 0.05 res.tune$best.model # Call: # best.svm(x = Species ~ ., data = iris, cost = 2^(-1:4), # kernel = "linear") # Parameters: # SVM-Type: C-classification # SVM-Kernel: linear # cost: 0.5 # gamma: 0.25 # Number of Support Vectors: 21 Nathalie Villa-Vialaneix | Introduction to statistical learning 34/58
  • 99. I’m almost dead with all these stuffs on my mind!!! What in practice? table(res.tune$best.model$fitted, iris$Species) % setosa versicolor virginica % setosa 0 0 0 % versicolor 0 45 0 % virginica 0 5 50 plot(res.tune$best.model, data=iris, Petal.Width~Petal.Length, slice = list(Sepal.Width = 2.872, Sepal.Length = 6.262)) Nathalie Villa-Vialaneix | Introduction to statistical learning 35/58
  • 100. I’m almost dead with all these stuffs on my mind!!! What in practice? res.tune <- tune.svm(Species ~ ., data=iris, gamma = 2^(-1:1), cost = 2^(2:4)) # Parameter tuning of ’svm’: # - sampling method: 10fold cross validation # - best parameters: # gamma cost # 0.5 4 # - best performance: 0.08 res.tune$best.model # Call: # best.svm(x = Species ~ ., data = iris, gamma = 2^(-1:1), # cost = 2^(2:4)) # Parameters: # SVM-Type: C-classification # SVM-Kernel: radial # cost: 4 # gamma: 0.5 # Number of Support Vectors: 32 Nathalie Villa-Vialaneix | Introduction to statistical learning 36/58
  • 101. I’m almost dead with all these stuffs on my mind!!! What in practice? table(res.tune$best.model$fitted, iris$Species) % setosa versicolor virginica % setosa 0 0 0 % versicolor 0 49 0 % virginica 0 1 50 plot(res.tune$best.model, data=iris, Petal.Width~Petal.Length, slice = list(Sepal.Width = 2.872, Sepal.Length = 6.262)) Nathalie Villa-Vialaneix | Introduction to statistical learning 37/58
  • 102. Outline 1 Introduction Background and notations Underfitting / Overfitting Consistency 2 CART and random forests Introduction to CART Learning Prediction Overview of random forests Bootstrap/Bagging Random forest 3 SVM 4 (not deep) Neural networks Seminal references Multi-layer perceptrons Theoretical properties of perceptrons Learning perceptrons Learning in practice Nathalie Villa-Vialaneix | Introduction to statistical learning 38/58
  • 103. What are (artificial) neural networks? Common properties (artificial) “Neural networks”: general name for supervised and unsupervised methods developed in (vague) analogy to the brain Nathalie Villa-Vialaneix | Introduction to statistical learning 39/58
  • 104. What are (artificial) neural networks? Common properties (artificial) “Neural networks”: general name for supervised and unsupervised methods developed in (vague) analogy to the brain combination (network) of simple elements (neurons) Nathalie Villa-Vialaneix | Introduction to statistical learning 39/58
  • 105. What are (artificial) neural networks? Common properties (artificial) “Neural networks”: general name for supervised and unsupervised methods developed in (vague) analogy to the brain combination (network) of simple elements (neurons) Example of graphical representation: INPUTS OUTPUTS Nathalie Villa-Vialaneix | Introduction to statistical learning 39/58
  • 106. Different types of neural networks A neural network is defined by: 1 the network structure; 2 the neuron type. Nathalie Villa-Vialaneix | Introduction to statistical learning 40/58
  • 107. Different types of neural networks A neural network is defined by: 1 the network structure; 2 the neuron type. Standard examples Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to supervised problems (classification and regression); Nathalie Villa-Vialaneix | Introduction to statistical learning 40/58
  • 108. Different types of neural networks A neural network is defined by: 1 the network structure; 2 the neuron type. Standard examples Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to supervised problems (classification and regression); Radial basis function networks (RBF): same purpose but based on local smoothing; Nathalie Villa-Vialaneix | Introduction to statistical learning 40/58
  • 109. Different types of neural networks A neural network is defined by: 1 the network structure; 2 the neuron type. Standard examples Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to supervised problems (classification and regression); Radial basis function networks (RBF): same purpose but based on local smoothing; Self-organizing maps (SOM also sometimes called Kohonen’s maps) or Topographic maps: dedicated to unsupervised problems (clustering), self-organized; . . . Nathalie Villa-Vialaneix | Introduction to statistical learning 40/58
  • 110. Different types of neural networks A neural network is defined by: 1 the network structure; 2 the neuron type. Standard examples Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to supervised problems (classification and regression); Radial basis function networks (RBF): same purpose but based on local smoothing; Self-organizing maps (SOM also sometimes called Kohonen’s maps) or Topographic maps: dedicated to unsupervised problems (clustering), self-organized; . . . In this talk, focus on MLP. Nathalie Villa-Vialaneix | Introduction to statistical learning 40/58
  • 111. MLP: Advantages/Drawbacks Advantages classification OR regression (i.e., Y can be a numeric variable or a factor); non parametric method: flexible; good theoretical properties. Nathalie Villa-Vialaneix | Introduction to statistical learning 41/58
  • 112. MLP: Advantages/Drawbacks Advantages classification OR regression (i.e., Y can be a numeric variable or a factor); non parametric method: flexible; good theoretical properties. Drawbacks hard to train (high computational cost, especially when d is large); overfit easily; “black box” models (hard to interpret). Nathalie Villa-Vialaneix | Introduction to statistical learning 41/58
  • 113. References Advised references: [Bishop, 1995, Ripley, 1996] overview of the topic from a learning (more than statistical) perspective [Devroye et al., 1996, Györfi et al., 2002] in dedicated chapters present statistical properties of perceptrons Nathalie Villa-Vialaneix | Introduction to statistical learning 42/58
  • 114. Analogy to the brain 1 a neuron collects signals from neighboring neurons through its dendrites connexions which frequently lead to activating a neuron are enforced (tend to have an increasing impact on the destination neuron) Nathalie Villa-Vialaneix | Introduction to statistical learning 43/58
  • 115. Analogy to the brain 1 a neuron collects signals from neighboring neurons through its dendrites 2 when total signal is above a given threshold, the neuron is activated connexions which frequently lead to activating a neuron are enforced (tend to have an increasing impact on the destination neuron) Nathalie Villa-Vialaneix | Introduction to statistical learning 43/58
  • 116. Analogy to the brain 1 a neuron collects signals from neighboring neurons through its dendrites 2 when total signal is above a given threshold, the neuron is activated 3 ... and a signal is sent to other neurons through the axon connexions which frequently lead to activating a neuron are enforced (tend to have an increasing impact on the destination neuron) Nathalie Villa-Vialaneix | Introduction to statistical learning 43/58
  • 117. First model of artificial neuron [Mc Culloch and Pitts, 1943, Rosenblatt, 1958, Rosenblatt, 1962] x(1) x(2) x(p) f(x)Σ+ w1 w2 wp w0 f : x ∈ Rp → 1 p j=1 wjx(j)+w0 ≥ 0 Nathalie Villa-Vialaneix | Introduction to statistical learning 44/58
  • 118. (artificial) Perceptron Layers MLP have one input layer (x ∈ Rp ), one output layer (y ∈ R or ∈ {1, . . . , K − 1} values) and several hidden layers; no connections within a layer; connections between two consecutive layers (feedforward). Example (regression, y ∈ R): INPUTS x = (x(1) , . . . , x(p) ) Nathalie Villa-Vialaneix | Introduction to statistical learning 45/58
  • 119. (artificial) Perceptron Layers MLP have one input layer (x ∈ Rp ), one output layer (y ∈ R or ∈ {1, . . . , K − 1} values) and several hidden layers; no connections within a layer; connections between two consecutive layers (feedforward). Example (regression, y ∈ R): INPUTS x = (x(1) , . . . , x(p) ) Layer 1 weights w (1) jk Nathalie Villa-Vialaneix | Introduction to statistical learning 45/58
  • 120. (artificial) Perceptron Layers MLP have one input layer (x ∈ Rp ), one output layer (y ∈ R or ∈ {1, . . . , K − 1} values) and several hidden layers; no connections within a layer; connections between two consecutive layers (feedforward). Example (regression, y ∈ R): INPUTS x = (x(1) , . . . , x(p) ) Layer 1 Nathalie Villa-Vialaneix | Introduction to statistical learning 45/58
  • 121. (artificial) Perceptron Layers MLP have one input layer (x ∈ Rp ), one output layer (y ∈ R or ∈ {1, . . . , K − 1} values) and several hidden layers; no connections within a layer; connections between two consecutive layers (feedforward). Example (regression, y ∈ R): INPUTS x = (x(1) , . . . , x(p) ) Layer 1 Layer 2 Nathalie Villa-Vialaneix | Introduction to statistical learning 45/58
  • 122. (artificial) Perceptron Layers MLP have one input layer (x ∈ Rp ), one output layer (y ∈ R or ∈ {1, . . . , K − 1} values) and several hidden layers; no connections within a layer; connections between two consecutive layers (feedforward). Example (regression, y ∈ R): INPUTS x = (x(1) , . . . , x(p) ) Layer 1 Layer 2 y OUTPUTS 2 hidden layer MLP Nathalie Villa-Vialaneix | Introduction to statistical learning 45/58
  • 123. A neuron in MLP v1 v2 v3 ×w1 ×w2 ×w3 + w0 (Bias Biais) Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
  • 124. A neuron in MLP v1 v2 v3 ×w1 ×w2 ×w3 ×w2 ×w1 + w0 Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
  • 125. A neuron in MLP v1 v2 v3 ×w1 ×w2 ×w3 ×w2 ×w1 + w0 Standard activation functions fonctions de transfert / d’activation Biologically inspired: Heaviside function h(z) = 0 if z < 0 1 otherwise. Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
  • 126. A neuron in MLP v1 v2 v3 ×w1 ×w2 ×w3 ×w2 ×w1 + w0 Standard activation functions Main issue with the Heaviside function: not continuous! Identity h(z) = z Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
  • 127. A neuron in MLP v1 v2 v3 ×w1 ×w2 ×w3 ×w2 ×w1 + w0 Standard activation functions But identity activation function gives linear model if used with one hidden layer: not flexible enough Logistic function h(z) = 1 1+exp(−z) Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
  • 128. A neuron in MLP v1 v2 v3 ×w1 ×w2 ×w3 ×w2 ×w1 + w0 Standard activation functions Another popular activation function (useful to model positive real numbers) Rectified linear (ReLU) h(z) = max(0, z) Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
  • 129. A neuron in MLP v1 v2 v3 ×w1 ×w2 ×w3 ×w2 ×w1 + w0 General sigmoid sigmoid: nondecreasing function h : R → R such that lim z→+∞ h(z) = 1 lim z→−∞ h(z) = 0 Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
  • 130. Focus on one-hidden-layer perceptrons Regression case x(1) x(2) x(p) w(1) w(2)+ w (0) 1 + w (0) Q f(x) f(x) = Q k=1 w (2) k hk x w (1) k + w (0) k + w (2) 0 , with hk a (logistic) sigmoid Nathalie Villa-Vialaneix | Introduction to statistical learning 47/58
  • 131. Focus on one-hidden-layer perceptrons Binary classification case x(1) x(2) x(p) w(1) w(2)+ w (0) 1 + w (0) Q ψ(x) ψ(x) = h0   Q k=1 w (2) k hk x w (1) k + w (0) k + w (2) 0   with h0 logistic sigmoid or identity. Nathalie Villa-Vialaneix | Introduction to statistical learning 47/58
  • 132. Focus on one-hidden-layer perceptrons Binary classification case x(1) x(2) x(p) w(1) w(2)+ w (0) 1 + w (0) Q ψ(x) decision with: f(x) = 0 if ψ(x) < 1/2 1 otherwise Nathalie Villa-Vialaneix | Introduction to statistical learning 47/58
  • 133. Focus on one-hidden-layer perceptrons Extension to any classification problem in {1, . . . , K − 1} x(1) x(2) x(p) w(1) w(2)+ w (0) 1 + w (0) Q ψ(x) Straightforward extension to multiple classes with a multiple output perceptron (number of output units equal to K) and a maximum probability rule for the decision. Nathalie Villa-Vialaneix | Introduction to statistical learning 47/58
  • 134. Theoretical properties of perceptrons This section answers two questions: 1 can we approximate any function g : [0, 1]p → R arbitrary well with a perceptron? Nathalie Villa-Vialaneix | Introduction to statistical learning 48/58
  • 135. Theoretical properties of perceptrons This section answers two questions: 1 can we approximate any function g : [0, 1]p → R arbitrary well with a perceptron? 2 when a perceptron is trained with i.i.d. observations from an arbitrary random variable pair (X, Y), is it consistent? (i.e., does it reach the minimum possible error asymptotically when the number of observations grows to infinity?) Nathalie Villa-Vialaneix | Introduction to statistical learning 48/58
  • 136. Illustration of the universal approximation property Simple examples a function to approximate: g : [0, 1] → sin 1 x+0.1 Nathalie Villa-Vialaneix | Introduction to statistical learning 49/58
  • 137. Illustration of the universal approximation property Simple examples a function to approximate: g : [0, 1] → sin 1 x+0.1 trying to approximate (how this is performed is explained later in this talk) this function with MLP having different numbers of neurons on their hidden layer Nathalie Villa-Vialaneix | Introduction to statistical learning 49/58
  • 138. Universal property from a theoretical point of view Set of MLPs with a given size: PQ (h) =    x ∈ Rp → Q k=1 w (2) k h x w (1) k + w (0) k + w (2) 0 : w (2) k , w (0) k ∈ R, w (1) k ∈ Rp    Nathalie Villa-Vialaneix | Introduction to statistical learning 50/58
  • 139. Universal property from a theoretical point of view Set of MLPs with a given size: PQ (h) =    x ∈ Rp → Q k=1 w (2) k h x w (1) k + w (0) k + w (2) 0 : w (2) k , w (0) k ∈ R, w (1) k ∈ Rp    Set of all MLPs: P(h) = ∪Q∈NPQ (h) Nathalie Villa-Vialaneix | Introduction to statistical learning 50/58
  • 140. Universal property from a theoretical point of view Set of MLPs with a given size: PQ (h) =    x ∈ Rp → Q k=1 w (2) k h x w (1) k + w (0) k + w (2) 0 : w (2) k , w (0) k ∈ R, w (1) k ∈ Rp    Set of all MLPs: P(h) = ∪Q∈NPQ (h) Universal approximation [Pinkus, 1999] If h is a non polynomial continuous function, then, for any continuous function g : [0, 1]p → R and any > 0, ∃ f ∈ P(h) such that: sup x∈[0,1]p |f(x) − g(x)| ≤ . Nathalie Villa-Vialaneix | Introduction to statistical learning 50/58
  • 141. Remarks on universal approximation continuity of the activation function is not required (see [Devroye et al., 1996] for a result with arbitrary sigmoids) other versions of this property are given in [Hornik, 1991, Hornik, 1993, Stinchcombe, 1999] for different functional spaces for g none of the spaces PQ (h), for a fixed Q, has this property this result can be used to show that perceptron are consistent whenever Q log(n)/n n→+∞ −−−−−−→ 0 [Farago and Lugosi, 1993, Devroye et al., 1996] Nathalie Villa-Vialaneix | Introduction to statistical learning 51/58
  • 142. Empirical error minimization Given i.i.d. observations of (X, Y), (Xi, Yi), how to choose the weights w? x(1) x(2) x(p) w(1) w(2)+ w (0) 1 + w (0) Q fw(x) Nathalie Villa-Vialaneix | Introduction to statistical learning 52/58
  • 143. Empirical error minimization Given i.i.d. observations of (X, Y), (Xi, Yi), how to choose the weights w? Standard approach: minimize the empirical L2 risk: Rn(w) = n i=1 [fw(Xi) − Yi]2 with Yi ∈ R for the regression case Yi ∈ {0, 1} for the classification case, with the associated decision rule x → 1{fw (x)≤1/2}. Nathalie Villa-Vialaneix | Introduction to statistical learning 52/58
  • 144. Empirical error minimization Given i.i.d. observations of (X, Y), (Xi, Yi), how to choose the weights w? Standard approach: minimize the empirical L2 risk: Rn(w) = n i=1 [fw(Xi) − Yi]2 with Yi ∈ R for the regression case Yi ∈ {0, 1} for the classification case, with the associated decision rule x → 1{fw (x)≤1/2}. But: ˆRn(w) is not convex in w ⇒ general optimization problem Nathalie Villa-Vialaneix | Introduction to statistical learning 52/58
  • 145. Optimization with gradient descent Method: initialize (randomly or with some prior knowledge) the weights w(0) ∈ RQp+2Q+1 Batch approach: for t = 1, . . . , T w(t + 1) = w(t) − µ(t) w ˆRn(w(t)); Nathalie Villa-Vialaneix | Introduction to statistical learning 53/58
  • 146. Optimization with gradient descent Method: initialize (randomly or with some prior knowledge) the weights w(0) ∈ RQp+2Q+1 Batch approach: for t = 1, . . . , T w(t + 1) = w(t) − µ(t) w ˆRn(w(t)); online (or stochastic) approach: write ˆRn(w) = n i=1 [fw(Xi) − Yi]2 =Ei and for t = 1, . . . , T, randomly pick i ∈ {1, . . . , n} and update: w(t + 1) = w(t) − µ(t) wEi(w(t)). Nathalie Villa-Vialaneix | Introduction to statistical learning 53/58
  • 147. Discussion about practical choices for this approach batch version converges (in an optimization point of view) to a local minimum of the error for a good choice of µ(t) but convergence can be slow stochastic version is usually very inefficient but is useful for large datasets (n large) more efficient algorithms exist to solve the optimization task. The one implemented in the R package nnet uses higher order derivatives (BFGS algorithm) in all cases, solutions returned are, at best, local minima which strongly depends on the initialization: using more than one initialization state is advised Nathalie Villa-Vialaneix | Introduction to statistical learning 54/58
  • 148. Gradient backpropagation method [Rumelhart and Mc Clelland, 1986] The gradient backpropagation rétropropagation du gradient principle is used to easily calculate gradients in perceptrons (or in other types of neural network): Nathalie Villa-Vialaneix | Introduction to statistical learning 55/58
  • 149. Gradient backpropagation method [Rumelhart and Mc Clelland, 1986] The gradient backpropagation rétropropagation du gradient principle is used to easily calculate gradients in perceptrons (or in other types of neural network): This way, stochastic gradient descent alternates: a forward step which aims at calculating outputs from all observations Xi given a value of the weights w a backward step in which the gradient backpropagation principle is used to obtain the gradient for the current weights w Nathalie Villa-Vialaneix | Introduction to statistical learning 55/58
  • 150. Backpropagation in practice x(1) x(2) x(p) + + w(1) w(2) w (0) 1 w (0) Q fw(x) initialize weights Nathalie Villa-Vialaneix | Introduction to statistical learning 56/58
  • 151. Backpropagation in practice x(1) x(2) x(p) + + w(1) w(2) w (0) 1 w (0) Q fw(x) Forward step: for all k, calculate a (1) k = Xi w (1) k + w (0) k Nathalie Villa-Vialaneix | Introduction to statistical learning 56/58
  • 152. Backpropagation in practice x(1) x(2) x(p) + + w(1) w(2) w (0) 1 w (0) Q fw(x) Forward step: for all k, calculate z (1) k = hk (a (1) k ) Nathalie Villa-Vialaneix | Introduction to statistical learning 56/58
  • 153. Backpropagation in practice x(1) x(2) x(p) + + w(1) w(2) w (0) 1 w (0) Q fw(x) = a(2) Forward step: calculate a(2) = Q k=1 w (2) k z (1) k + w (2) 0 Nathalie Villa-Vialaneix | Introduction to statistical learning 56/58
  • 154. Backpropagation in practice x(1) x(2) x(p) + + w(1) w (0) 1 w (0) Q w(2) fw(x) = a(2) Backward step: calculate ∂Ei ∂w (2) k = δ(2) × zk with δ(2) = ∂Ei ∂a(2) = [h0(a(2))−Yi ]2 ∂a(2) = 2h0(a(2) ) × [h0(a(2) ) − Yi] Nathalie Villa-Vialaneix | Introduction to statistical learning 56/58
  • 155. Backpropagation in practice x(1) x(2) x(p) + + w(2) w (0) 1 w (0) Q w(1) fw(x) Backward step: ∂Ei ∂w (1) kj = δ (1) k × X (j) i with δ (1) k = ∂Ei ∂a (1) k = ∂Ei ∂a(2) × ∂a(2) ∂a (1) k = δ(2) × w (2) k hk (a (1) k ) Nathalie Villa-Vialaneix | Introduction to statistical learning 56/58
  • 156. Backpropagation in practice x(1) x(2) x(p) + + w(1) w(2) w (0) 1 w (0) Q fw(x) Backward step: ∂Ei ∂w (0) k = δ (1) k Nathalie Villa-Vialaneix | Introduction to statistical learning 56/58
  • 157. Initialization and stopping of the training algorithm 1 How to initialize weights? Standard choices w (1) jk ∼ N(0, 1/ √ p) and w (2) k ∼ N(0, 1/ √ Q) Nathalie Villa-Vialaneix | Introduction to statistical learning 57/58
  • 158. Initialization and stopping of the training algorithm 1 How to initialize weights? Standard choices w (1) jk ∼ N(0, 1/ √ p) and w (2) k ∼ N(0, 1/ √ Q) 2 When to stop the algorithm? (gradient descent or alike) Standard choices: bounded T target value of the error ˆRn(w) target value of the evolution ˆRn(w(t)) − ˆRn(w(t + 1)) Nathalie Villa-Vialaneix | Introduction to statistical learning 57/58
  • 159. Initialization and stopping of the training algorithm 1 How to initialize weights? Standard choices w (1) jk ∼ N(0, 1/ √ p) and w (2) k ∼ N(0, 1/ √ Q) In the R package nnet, weights are sampled uniformly between [−0.5, 0.5] or between − 1 maxi X (j) i , 1 maxi X (j) i if X(j) is large. 2 When to stop the algorithm? (gradient descent or alike) Standard choices: bounded T target value of the error ˆRn(w) target value of the evolution ˆRn(w(t)) − ˆRn(w(t + 1)) In the R package nnet, a combination of the three criteria is used and tunable. Nathalie Villa-Vialaneix | Introduction to statistical learning 57/58
  • 160. Strategies to avoid overfitting Properly tune Q with a CV or a bootstrap estimation of the generalization ability of the method Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
  • 161. Strategies to avoid overfitting Properly tune Q with a CV or a bootstrap estimation of the generalization ability of the method Early stopping: for Q large enough, use a part of the data as a validation set and stops the training (gradient descent) when the empirical error calculated on this dataset starts to increase Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
  • 162. Strategies to avoid overfitting Properly tune Q with a CV or a bootstrap estimation of the generalization ability of the method Early stopping: for Q large enough, use a part of the data as a validation set and stops the training (gradient descent) when the empirical error calculated on this dataset starts to increase Weight decay: for Q large enough, penalize the empirical risk with a function of the weights, e.g., ˆRn(w) + λw w Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
  • 163. Strategies to avoid overfitting Properly tune Q with a CV or a bootstrap estimation of the generalization ability of the method Early stopping: for Q large enough, use a part of the data as a validation set and stops the training (gradient descent) when the empirical error calculated on this dataset starts to increase Weight decay: for Q large enough, penalize the empirical risk with a function of the weights, e.g., ˆRn(w) + λw w Noise injection: modify the input data with a random noise during the training Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
  • 164. References Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337–404. Bishop, C. (1995). Neural Networks for Pattern Recognition. Oxford University Press, New York, USA. Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32. Breiman, L., Friedman, J., Olsen, R., and Stone, C. (1984). Classification and Regression Trees. Chapman and Hall, Boca Raton, Florida, USA. Devroye, L., Györfi, L., and Lugosi, G. (1996). A Probabilistic Theory for Pattern Recognition. Springer-Verlag, New York, NY, USA. Farago, A. and Lugosi, G. (1993). Strong universal consistency of neural network classifiers. IEEE Transactions on Information Theory, 39(4):1146–1151. Györfi, L., Kohler, M., Krzy˙zak, A., and Walk, H. (2002). A Distribution-Free Theory of Nonparametric Regression. Springer-Verlag, New York, NY, USA. Hornik, K. (1991). Approximation capabilities of multilayer feedfoward networks. Neural Networks, 4(2):251–257. Hornik, K. (1993). Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
  • 165. Some new results on neural network approximation. Neural Networks, 6(8):1069–1072. Mc Culloch, W. and Pitts, W. (1943). A logical calculus of ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5(4):115–133. Pinkus, A. (1999). Approximation theory of the MLP model in neural networks. Acta Numerica, 8:143–195. Ripley, B. (1996). Pattern Recognition and Neural Networks. Cambridge University Press. Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–408. Rosenblatt, F. (1962). Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, Washington, DC, USA. Rumelhart, D. and Mc Clelland, J. (1986). Parallel Distributed Processing: Exploration in the MicroStructure of Cognition. MIT Press, Cambridge, MA, USA. Steinwart, I. (2002). Support vector machines are universally consistent. Journal of Complexity, 18:768–791. Stinchcombe, M. (1999). Neural network approximation of continuous functionals and continuous functions on compactifications. Neural Network, 12(3):467–477. Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
  • 166. Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer Verlag, New York, USA. Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58