SlideShare ist ein Scribd-Unternehmen logo
1 von 63
Downloaden Sie, um offline zu lesen
Copyright © Andrew W. Moore Slide 1
Cross-validation for
detecting and preventing
overfitting
Andrew W. Moore
Professor
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~awm
awm@cs.cmu.edu
412-268-7599
Note to other teachers and users of
these slides. Andrew would be delighted
if you found this source material useful in
giving your own lectures. Feel free to use
these slides verbatim, or to modify them
to fit your own needs. PowerPoint
originals are available. If you make use
of a significant portion of these slides in
your own lecture, please include this
message, or the following link to the
source repository of Andrew’s tutorials:
http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully
received.
Copyright © Andrew W. Moore Slide 2
A Regression Problem
x
y
y = f(x) + noise
Can we learn f from this data?
Let’s consider three methods…
Copyright © Andrew W. Moore Slide 3
Linear Regression
x
y
Copyright © Andrew W. Moore Slide 4
Linear Regression
Univariate Linear regression with a constant term:
::
31
73
YX
:
1
3
:
3
7X= y=
x1=(3).. y1=7..
Originally
discussed in the
previous Andrew
Lecture: “Neural
Nets”
Copyright © Andrew W. Moore Slide 5
Linear Regression
Univariate Linear regression with a constant term:
::
31
73
YX
:
1
3
:
3
7X= y=
x1=(3).. y1=7..
1
3
:
1
1
:
3
7
Z= y=
z1=(1,3)..
zk=(1,xk)
y1=7..
Copyright © Andrew W. Moore Slide 6
Linear Regression
Univariate Linear regression with a constant term:
::
31
73
YX
:
1
3
:
3
7X= y=
x1=(3).. y1=7..
1
3
:
1
1
:
3
7
Z= y=
z1=(1,3)..
zk=(1,xk)
y1=7..
β=(ZTZ)-1(ZTy)
yest = β0+ β1 x
Copyright © Andrew W. Moore Slide 7
Quadratic Regression
x
y
Copyright © Andrew W. Moore Slide 8
Quadratic Regression
::
31
73
YX
:
1
3
:
3
7X= y=
x1=(3,2).. y1=7..
1
9
1
3
:
1
1
:
3
7
Z=
y=
z=(1 , x, x2
,)
β=(ZTZ)-1(ZTy)
yest = β0+ β1 x+ β2 x2
Much more about
this in the future
Andrew Lecture:
“Favorite
Regression
Algorithms”
Copyright © Andrew W. Moore Slide 9
Join-the-dots
x
y
Also known as piecewise
linear nonparametric
regression if that makes
you feel better
Copyright © Andrew W. Moore Slide 10
Which is best?
x
y
x
y
Why not choose the method with the
best fit to the data?
Copyright © Andrew W. Moore Slide 11
What do we really want?
x
y
x
y
Why not choose the method with the
best fit to the data?
“How well are you going to predict
future data drawn from the same
distribution?”
Copyright © Andrew W. Moore Slide 12
The test set method
x
y
1. Randomly choose
30% of the data to be in a
test set
2. The remainder is a
training set
Copyright © Andrew W. Moore Slide 13
The test set method
x
y
1. Randomly choose
30% of the data to be in a
test set
2. The remainder is a
training set
3. Perform your
regression on the training
set
(Linear regression example)
Copyright © Andrew W. Moore Slide 14
The test set method
x
y
1. Randomly choose
30% of the data to be in a
test set
2. The remainder is a
training set
3. Perform your
regression on the training
set
4. Estimate your future
performance with the test
set
(Linear regression example)
Mean Squared Error = 2.4
Copyright © Andrew W. Moore Slide 15
The test set method
x
y
1. Randomly choose
30% of the data to be in a
test set
2. The remainder is a
training set
3. Perform your
regression on the training
set
4. Estimate your future
performance with the test
set
(Quadratic regression example)
Mean Squared Error = 0.9
Copyright © Andrew W. Moore Slide 16
The test set method
x
y
1. Randomly choose
30% of the data to be in a
test set
2. The remainder is a
training set
3. Perform your
regression on the training
set
4. Estimate your future
performance with the test
set
(Join the dots example)
Mean Squared Error = 2.2
Copyright © Andrew W. Moore Slide 17
The test set method
Good news:
•Very very simple
•Can then simply choose the method with
the best test-set score
Bad news:
•What’s the downside?
Copyright © Andrew W. Moore Slide 18
The test set method
Good news:
•Very very simple
•Can then simply choose the method with
the best test-set score
Bad news:
•Wastes data: we get an estimate of the
best method to apply to 30% less data
•If we don’t have much data, our test-set
might just be lucky or unlucky
We say the
“test-set
estimator of
performance
has high
variance”
Copyright © Andrew W. Moore Slide 19
LOOCV (Leave-one-out Cross Validation)
x
y
For k=1 to R
1. Let (xk,yk) be the kth record
Copyright © Andrew W. Moore Slide 20
LOOCV (Leave-one-out Cross Validation)
x
y
For k=1 to R
1. Let (xk,yk) be the kth record
2. Temporarily remove (xk,yk)
from the dataset
Copyright © Andrew W. Moore Slide 21
LOOCV (Leave-one-out Cross Validation)
x
y
For k=1 to R
1. Let (xk,yk) be the kth record
2. Temporarily remove (xk,yk)
from the dataset
3. Train on the remaining R-1
datapoints
Copyright © Andrew W. Moore Slide 22
LOOCV (Leave-one-out Cross Validation)
For k=1 to R
1. Let (xk,yk) be the kth record
2. Temporarily remove (xk,yk)
from the dataset
3. Train on the remaining R-1
datapoints
4. Note your error (xk,yk)
x
y
Copyright © Andrew W. Moore Slide 23
LOOCV (Leave-one-out Cross Validation)
For k=1 to R
1. Let (xk,yk) be the kth record
2. Temporarily remove (xk,yk)
from the dataset
3. Train on the remaining R-1
datapoints
4. Note your error (xk,yk)
When you’ve done all points,
report the mean error.
x
y
Copyright © Andrew W. Moore Slide 24
LOOCV (Leave-one-out Cross Validation)
For k=1 to R
1. Let (xk,yk) be
the kth
record
2. Temporarily
remove
(xk,yk) from
the dataset
3. Train on the
remaining
R-1
datapoints
4. Note your
error (xk,yk)
When you’ve
done all points,
report the mean
error.
x
y
x
y
x
y
x
y
x
y
x
y
x
y
x
y
x
y
MSELOOCV
= 2.12
Copyright © Andrew W. Moore Slide 25
LOOCV for Quadratic Regression
For k=1 to R
1. Let (xk,yk) be
the kth
record
2. Temporarily
remove
(xk,yk) from
the dataset
3. Train on the
remaining
R-1
datapoints
4. Note your
error (xk,yk)
When you’ve
done all points,
report the mean
error.
x
y
x
y
x
y
x
y
x
y
x
y
x
y
x
y
x
y
MSELOOCV
=0.962
Copyright © Andrew W. Moore Slide 26
LOOCV for Join The Dots
For k=1 to R
1. Let (xk,yk) be
the kth
record
2. Temporarily
remove
(xk,yk) from
the dataset
3. Train on the
remaining
R-1
datapoints
4. Note your
error (xk,yk)
When you’ve
done all points,
report the mean
error.
x
y
x
y
x
y
x
y
x
y
x
y
x
y
x
y
x
y
MSELOOCV
=3.33
Copyright © Andrew W. Moore Slide 27
Which kind of Cross Validation?
Doesn’t
waste data
Expensive.
Has some weird
behavior
Leave-
one-out
CheapVariance: unreliable
estimate of future
performance
Test-set
UpsideDownside
..can we get the best of both worlds?
Copyright © Andrew W. Moore Slide 28
k-fold Cross
Validation
x
y
Randomly break the dataset into k
partitions (in our example we’ll have k=3
partitions colored Red Green and Blue)
Copyright © Andrew W. Moore Slide 29
k-fold Cross
Validation
x
y
Randomly break the dataset into k
partitions (in our example we’ll have k=3
partitions colored Red Green and Blue)
For the red partition: Train on all the
points not in the red partition. Find
the test-set sum of errors on the red
points.
Copyright © Andrew W. Moore Slide 30
k-fold Cross
Validation
x
y
Randomly break the dataset into k
partitions (in our example we’ll have k=3
partitions colored Red Green and Blue)
For the red partition: Train on all the
points not in the red partition. Find
the test-set sum of errors on the red
points.
For the green partition: Train on all the
points not in the green partition.
Find the test-set sum of errors on
the green points.
Copyright © Andrew W. Moore Slide 31
k-fold Cross
Validation
x
y
Randomly break the dataset into k
partitions (in our example we’ll have k=3
partitions colored Red Green and Blue)
For the red partition: Train on all the
points not in the red partition. Find
the test-set sum of errors on the red
points.
For the green partition: Train on all the
points not in the green partition.
Find the test-set sum of errors on
the green points.
For the blue partition: Train on all the
points not in the blue partition. Find
the test-set sum of errors on the
blue points.
Copyright © Andrew W. Moore Slide 32
k-fold Cross
Validation
x
y
Randomly break the dataset into k
partitions (in our example we’ll have k=3
partitions colored Red Green and Blue)
For the red partition: Train on all the
points not in the red partition. Find
the test-set sum of errors on the red
points.
For the green partition: Train on all the
points not in the green partition.
Find the test-set sum of errors on
the green points.
For the blue partition: Train on all the
points not in the blue partition. Find
the test-set sum of errors on the
blue points.
Then report the mean error
Linear Regression
MSE3FOLD=2.05
Copyright © Andrew W. Moore Slide 33
k-fold Cross
Validation
x
y
Randomly break the dataset into k
partitions (in our example we’ll have k=3
partitions colored Red Green and Blue)
For the red partition: Train on all the
points not in the red partition. Find
the test-set sum of errors on the red
points.
For the green partition: Train on all the
points not in the green partition.
Find the test-set sum of errors on
the green points.
For the blue partition: Train on all the
points not in the blue partition. Find
the test-set sum of errors on the
blue points.
Then report the mean error
Quadratic Regression
MSE3FOLD=1.11
Copyright © Andrew W. Moore Slide 34
k-fold Cross
Validation
x
y
Randomly break the dataset into k
partitions (in our example we’ll have k=3
partitions colored Red Green and Blue)
For the red partition: Train on all the
points not in the red partition. Find
the test-set sum of errors on the red
points.
For the green partition: Train on all the
points not in the green partition.
Find the test-set sum of errors on
the green points.
For the blue partition: Train on all the
points not in the blue partition. Find
the test-set sum of errors on the
blue points.
Then report the mean error
Joint-the-dots
MSE3FOLD=2.93
Copyright © Andrew W. Moore Slide 35
Which kind of Cross Validation?
Doesn’t waste dataExpensive.
Has some weird behavior
Leave-
one-out
Only wastes 10%. Only
10 times more expensive
instead of R times.
Wastes 10% of the data.
10 times more expensive
than test set
10-fold
Slightly better than test-
set
Wastier than 10-fold.
Expensivier than test set
3-fold
Identical to Leave-one-outR-fold
CheapVariance: unreliable
estimate of future
performance
Test-set
UpsideDownside
Copyright © Andrew W. Moore Slide 36
Which kind of Cross Validation?
Doesn’t waste dataExpensive.
Has some weird behavior
Leave-
one-out
Only wastes 10%. Only
10 times more expensive
instead of R times.
Wastes 10% of the data.
10 times more expensive
than testset
10-fold
Slightly better than test-
set
Wastier than 10-fold.
Expensivier than testset
3-fold
Identical to Leave-one-outR-fold
CheapVariance: unreliable
estimate of future
performance
Test-set
UpsideDownside
But note: One of
Andrew’s joys in life is
algorithmic tricks for
making these cheap
Copyright © Andrew W. Moore Slide 37
CV-based Model Selection
• We’re trying to decide which algorithm to use.
• We train each machine and make a table…
f44
f55
f66
⌦f33
f22
f11
Choice10-FOLD-CV-ERRTRAINERRfii
Copyright © Andrew W. Moore Slide 38
CV-based Model Selection
• Example: Choosing number of hidden units in a one-
hidden-layer neural net.
• Step 1: Compute 10-fold CV error for six different model
classes:
3 hidden units
4 hidden units
5 hidden units
⌦2 hidden units
1 hidden units
0 hidden units
Choice10-FOLD-CV-ERRTRAINERRAlgorithm
• Step 2: Whichever model class gave best CV score: train it
with all the data, and that’s the predictive model you’ll use.
Copyright © Andrew W. Moore Slide 39
CV-based Model Selection
• Example: Choosing “k” for a k-nearest-neighbor regression.
• Step 1: Compute LOOCV error for six different model
classes:
• Step 2: Whichever model class gave best CV score: train it
with all the data, and that’s the predictive model you’ll use.
⌦K=4
K=5
K=6
K=3
K=2
K=1
Choice10-fold-CV-ERRTRAINERRAlgorithm
Copyright © Andrew W. Moore Slide 40
⌦K=4
K=5
K=6
K=3
K=2
K=1
ChoiceLOOCV-ERRTRAINERRAlgorithm
CV-based Model Selection
• Example: Choosing “k” for a k-nearest-neighbor regression.
• Step 1: Compute LOOCV error for six different model
classes:
• Step 2: Whichever model class gave best CV score: train it
with all the data, and that’s the predictive model you’ll use.
Why did we use 10-fold-CV for
neural nets and LOOCV for k-
nearest neighbor?
And why stop at K=6
Are we guaranteed that a local
optimum of K vs LOOCV will be
the global optimum?
What should we do if we are
depressed at the expense of
doing LOOCV for K= 1 through
1000?
The reason is Computational. For k-
NN (and all other nonparametric
methods) LOOCV happens to be as
cheap as regular predictions.
No good reason, except it looked
like things were getting worse as K
was increasing
Sadly, no. And in fact, the
relationship can be very bumpy.
Idea One: K=1, K=2, K=4, K=8,
K=16, K=32, K=64 … K=1024
Idea Two: Hillclimbing from an initial
guess at K
Copyright © Andrew W. Moore Slide 41
CV-based Model Selection
• Can you think of other decisions we can ask Cross
Validation to make for us, based on other machine learning
algorithms in the class so far?
Copyright © Andrew W. Moore Slide 42
CV-based Model Selection
• Can you think of other decisions we can ask Cross
Validation to make for us, based on other machine learning
algorithms in the class so far?
• Degree of polynomial in polynomial regression
• Whether to use full, diagonal or spherical Gaussians in a Gaussian
Bayes Classifier.
• The Kernel Width in Kernel Regression
• The Kernel Width in Locally Weighted Regression
• The Bayesian Prior in Bayesian Regression
These involve
choosing the value of a
real-valued parameter.
What should we do?
Copyright © Andrew W. Moore Slide 43
CV-based Model Selection
• Can you think of other decisions we can ask Cross
Validation to make for us, based on other machine learning
algorithms in the class so far?
• Degree of polynomial in polynomial regression
• Whether to use full, diagonal or spherical Gaussians in a Gaussian
Bayes Classifier.
• The Kernel Width in Kernel Regression
• The Kernel Width in Locally Weighted Regression
• The Bayesian Prior in Bayesian Regression
These involve
choosing the value of a
real-valued parameter.
What should we do?
Idea One: Consider a discrete set of values
(often best to consider a set of values with
exponentially increasing gaps, as in the K-NN
example).
Idea Two: Compute and then
do gradianet descent.
Parameter
LOOCV
∂
∂
Copyright © Andrew W. Moore Slide 44
CV-based Model Selection
• Can you think of other decisions we can ask Cross
Validation to make for us, based on other machine learning
algorithms in the class so far?
• Degree of polynomial in polynomial regression
• Whether to use full, diagonal or spherical Gaussians in a Gaussian
Bayes Classifier.
• The Kernel Width in Kernel Regression
• The Kernel Width in Locally Weighted Regression
• The Bayesian Prior in Bayesian Regression
These involve
choosing the value of a
real-valued parameter.
What should we do?
Idea One: Consider a discrete set of values
(often best to consider a set of values with
exponentially increasing gaps, as in the K-NN
example).
Idea Two: Compute and then
do gradianet descent.
Parameter
LOOCV
∂
∂
Also: The scale factors of a non-
parametric distance metric
Copyright © Andrew W. Moore Slide 45
⌦Quad reg’n
LWR, KW=0.1
LWR, KW=0.5
Linear Reg’n
10-NN
1-NN
Choice10-fold-CV-ERRTRAINERRAlgorithm
CV-based Algorithm Choice
• Example: Choosing which regression algorithm to use
• Step 1: Compute 10-fold-CV error for six different model
classes:
• Step 2: Whichever algorithm gave best CV score: train it
with all the data, and that’s the predictive model you’ll use.
Copyright © Andrew W. Moore Slide 46
• Model selection methods:
1. Cross-validation
2. AIC (Akaike Information Criterion)
3. BIC (Bayesian Information Criterion)
4. VC-dimension (Vapnik-Chervonenkis Dimension)
Only directly applicable to
choosing classifiers
Described in a future
Lecture
Alternatives to CV-based model selection
Copyright © Andrew W. Moore Slide 47
Which model selection method is best?
1. (CV) Cross-validation
2. AIC (Akaike Information Criterion)
3. BIC (Bayesian Information Criterion)
4. (SRMVC) Structural Risk Minimize with VC-dimension
• AIC, BIC and SRMVC advantage: you only need the training
error.
• CV error might have more variance
• SRMVC is wildly conservative
• Asymptotically AIC and Leave-one-out CV should be the same
• Asymptotically BIC and carefully chosen k-fold should be same
• You want BIC if you want the best structure instead of the best
predictor (e.g. for clustering or Bayes Net structure finding)
• Many alternatives---including proper Bayesian approaches.
• It’s an emotional issue.
Copyright © Andrew W. Moore Slide 48
Other Cross-validation issues
• Can do “leave all pairs out” or “leave-all-
ntuples-out” if feeling resourceful.
• Some folks do k-folds in which each fold is
an independently-chosen subset of the data
• Do you know what AIC and BIC are?
If so…
• LOOCV behaves like AIC asymptotically.
• k-fold behaves like BIC if you choose k carefully
If not…
• Nyardely nyardely nyoo nyoo
Copyright © Andrew W. Moore Slide 49
Cross-Validation for regression
• Choosing the number of hidden units in a
neural net
• Feature selection (see later)
• Choosing a polynomial degree
• Choosing which regressor to use
Copyright © Andrew W. Moore Slide 50
Supervising Gradient Descent
• This is a weird but common use of Test-set
validation
• Suppose you have a neural net with too
many hidden units. It will overfit.
• As gradient descent progresses, maintain a
graph of MSE-testset-error vs. Iteration
Iteration of Gradient Descent
MeanSquared
Error
Training Set
Test Set
Use the weights you
found on this iteration
Copyright © Andrew W. Moore Slide 51
Supervising Gradient Descent
• This is a weird but common use of Test-set
validation
• Suppose you have a neural net with too
many hidden units. It will overfit.
• As gradient descent progresses, maintain a
graph of MSE-testset-error vs. Iteration
Iteration of Gradient Descent
MeanSquared
Error
Training Set
Test Set
Use the weights you
found on this iteration
Relies on an intuition that a not-fully-
minimized set of weights is somewhat like
having fewer parameters.
Works pretty well in practice, apparently
Copyright © Andrew W. Moore Slide 52
Cross-validation for classification
• Instead of computing the sum squared
errors on a test set, you should compute…
Copyright © Andrew W. Moore Slide 53
Cross-validation for classification
• Instead of computing the sum squared
errors on a test set, you should compute…
The total number of misclassifications on
a testset.
Copyright © Andrew W. Moore Slide 54
Cross-validation for classification
• Instead of computing the sum squared
errors on a test set, you should compute…
The total number of misclassifications on
a testset.
• What’s LOOCV of 1-NN?
• What’s LOOCV of 3-NN?
• What’s LOOCV of 22-NN?
Copyright © Andrew W. Moore Slide 55
Cross-validation for classification
• Instead of computing the sum squared
errors on a test set, you should compute…
The total number of misclassifications on
a testset.
• But there’s a more sensitive alternative:
Compute
log P(all test outputs|all test inputs, your model)
Copyright © Andrew W. Moore Slide 56
Cross-Validation for classification
• Choosing the pruning parameter for decision
trees
• Feature selection (see later)
• What kind of Gaussian to use in a Gaussian-
based Bayes Classifier
• Choosing which classifier to use
Copyright © Andrew W. Moore Slide 57
Cross-Validation for density
estimation
• Compute the sum of log-likelihoods of test
points
Example uses:
• Choosing what kind of Gaussian assumption
to use
• Choose the density estimator
• NOT Feature selection (testset density will
almost always look better with fewer
features)
Copyright © Andrew W. Moore Slide 58
Feature Selection
• Suppose you have a learning algorithm LA
and a set of input attributes { X1 , X2 .. Xm }
• You expect that LA will only find some
subset of the attributes useful.
• Question: How can we use cross-validation
to find a useful subset?
• Four ideas:
• Forward selection
• Backward elimination
• Hill Climbing
• Stochastic search (Simulated Annealing or GAs)
Another fun area in which
Andrew has spent a lot of his
wild youth
Copyright © Andrew W. Moore Slide 59
Very serious warning
• Intensive use of cross validation can overfit.
• How?
• What can be done about it?
Copyright © Andrew W. Moore Slide 60
Very serious warning
• Intensive use of cross validation can overfit.
• How?
• Imagine a dataset with 50 records and 1000
attributes.
• You try 1000 linear regression models, each one
using one of the attributes.
• What can be done about it?
Copyright © Andrew W. Moore Slide 61
Very serious warning
• Intensive use of cross validation can overfit.
• How?
• Imagine a dataset with 50 records and 1000
attributes.
• You try 1000 linear regression models, each one
using one of the attributes.
• The best of those 1000 looks good!
• What can be done about it?
Copyright © Andrew W. Moore Slide 62
Very serious warning
• Intensive use of cross validation can overfit.
• How?
• Imagine a dataset with 50 records and 1000
attributes.
• You try 1000 linear regression models, each one
using one of the attributes.
• The best of those 1000 looks good!
• But you realize it would have looked good even if the
output had been purely random!
• What can be done about it?
• Hold out an additional testset before doing any model
selection. Check the best model performs well even
on the additional testset.
• Or: Randomization Testing
Copyright © Andrew W. Moore Slide 63
What you should know
• Why you can’t use “training-set-error” to
estimate the quality of your learning
algorithm on your data.
• Why you can’t use “training set error” to
choose the learning algorithm
• Test-set cross-validation
• Leave-one-out cross-validation
• k-fold cross-validation
• Feature selection methods
• CV for classification, regression & densities

Weitere ähnliche Inhalte

Was ist angesagt?

2013-1 Machine Learning Lecture 03 - Andrew Moore - probabilistic and baye…
2013-1 Machine Learning Lecture 03 - Andrew Moore - probabilistic and baye…2013-1 Machine Learning Lecture 03 - Andrew Moore - probabilistic and baye…
2013-1 Machine Learning Lecture 03 - Andrew Moore - probabilistic and baye…
Dongseo University
 
Final PhD Seminar
Final PhD SeminarFinal PhD Seminar
Final PhD Seminar
Matt Moores
 

Was ist angesagt? (20)

Classification
ClassificationClassification
Classification
 
Slides ACTINFO 2016
Slides ACTINFO 2016Slides ACTINFO 2016
Slides ACTINFO 2016
 
Intro to ABC
Intro to ABCIntro to ABC
Intro to ABC
 
IA-advanced-R
IA-advanced-RIA-advanced-R
IA-advanced-R
 
Slides angers-sfds
Slides angers-sfdsSlides angers-sfds
Slides angers-sfds
 
Econometrics 2017-graduate-3
Econometrics 2017-graduate-3Econometrics 2017-graduate-3
Econometrics 2017-graduate-3
 
Pricing Game, 100% Data Sciences
Pricing Game, 100% Data SciencesPricing Game, 100% Data Sciences
Pricing Game, 100% Data Sciences
 
Pareto Models, Slides EQUINEQ
Pareto Models, Slides EQUINEQPareto Models, Slides EQUINEQ
Pareto Models, Slides EQUINEQ
 
Availability of a Redundant System with Two Parallel Active Components
Availability of a Redundant System with Two Parallel Active ComponentsAvailability of a Redundant System with Two Parallel Active Components
Availability of a Redundant System with Two Parallel Active Components
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
ABC short course: introduction chapters
ABC short course: introduction chaptersABC short course: introduction chapters
ABC short course: introduction chapters
 
Side 2019 #7
Side 2019 #7Side 2019 #7
Side 2019 #7
 
Rencontres Mutualistes
Rencontres MutualistesRencontres Mutualistes
Rencontres Mutualistes
 
Slides erm-cea-ia
Slides erm-cea-iaSlides erm-cea-ia
Slides erm-cea-ia
 
2013-1 Machine Learning Lecture 03 - Andrew Moore - probabilistic and baye…
2013-1 Machine Learning Lecture 03 - Andrew Moore - probabilistic and baye…2013-1 Machine Learning Lecture 03 - Andrew Moore - probabilistic and baye…
2013-1 Machine Learning Lecture 03 - Andrew Moore - probabilistic and baye…
 
Precomputation for SMC-ABC with undirected graphical models
Precomputation for SMC-ABC with undirected graphical modelsPrecomputation for SMC-ABC with undirected graphical models
Precomputation for SMC-ABC with undirected graphical models
 
Slides ensae-2016-11
Slides ensae-2016-11Slides ensae-2016-11
Slides ensae-2016-11
 
Graduate Econometrics Course, part 4, 2017
Graduate Econometrics Course, part 4, 2017Graduate Econometrics Course, part 4, 2017
Graduate Econometrics Course, part 4, 2017
 
Final PhD Seminar
Final PhD SeminarFinal PhD Seminar
Final PhD Seminar
 
Slides astin
Slides astinSlides astin
Slides astin
 

Andere mochten auch (13)

NFC
NFCNFC
NFC
 
Tableof contents
Tableof contentsTableof contents
Tableof contents
 
Chapter2 maes
Chapter2 maesChapter2 maes
Chapter2 maes
 
Dm week02 decision-trees-handout
Dm week02 decision-trees-handoutDm week02 decision-trees-handout
Dm week02 decision-trees-handout
 
Week02 answer
Week02 answerWeek02 answer
Week02 answer
 
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprintSw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
 
Chapter3 black
Chapter3 blackChapter3 black
Chapter3 black
 
Sw wordnet h1
Sw wordnet h1Sw wordnet h1
Sw wordnet h1
 
Sw owl rules-proposal
Sw owl rules-proposalSw owl rules-proposal
Sw owl rules-proposal
 
Kbms video-app
Kbms video-appKbms video-app
Kbms video-app
 
Sw 7 triple20
Sw 7 triple20Sw 7 triple20
Sw 7 triple20
 
Kbms knowledge
Kbms knowledgeKbms knowledge
Kbms knowledge
 
Dm part03 neural-networks-handout
Dm part03 neural-networks-handoutDm part03 neural-networks-handout
Dm part03 neural-networks-handout
 

Ähnlich wie Overfit10

Statistik 1 10 12 edited_anova
Statistik 1 10 12 edited_anovaStatistik 1 10 12 edited_anova
Statistik 1 10 12 edited_anova
Selvin Hadi
 
Transportation and logistics modeling 2
Transportation and logistics modeling 2Transportation and logistics modeling 2
Transportation and logistics modeling 2
karim sal3awi
 
JKN 10 Inference 2 Populations.Consider the following set of d.docx
JKN 10 Inference 2 Populations.Consider the following set of d.docxJKN 10 Inference 2 Populations.Consider the following set of d.docx
JKN 10 Inference 2 Populations.Consider the following set of d.docx
christiandean12115
 

Ähnlich wie Overfit10 (20)

Mech ma6452 snm_notes
Mech ma6452 snm_notesMech ma6452 snm_notes
Mech ma6452 snm_notes
 
Statistik 1 10 12 edited_anova
Statistik 1 10 12 edited_anovaStatistik 1 10 12 edited_anova
Statistik 1 10 12 edited_anova
 
Probability density in data mining and coveriance
Probability density in data mining and coverianceProbability density in data mining and coveriance
Probability density in data mining and coveriance
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Quantitative Analysis for Emperical Research
Quantitative Analysis for Emperical ResearchQuantitative Analysis for Emperical Research
Quantitative Analysis for Emperical Research
 
svm-jain.ppt
svm-jain.pptsvm-jain.ppt
svm-jain.ppt
 
Machine learning in science and industry — day 3
Machine learning in science and industry — day 3Machine learning in science and industry — day 3
Machine learning in science and industry — day 3
 
Respose surface methods
Respose surface methodsRespose surface methods
Respose surface methods
 
Input analysis
Input analysisInput analysis
Input analysis
 
Transportation and logistics modeling 2
Transportation and logistics modeling 2Transportation and logistics modeling 2
Transportation and logistics modeling 2
 
Advanced Statistics And Probability (MSC 615
Advanced Statistics And Probability (MSC 615Advanced Statistics And Probability (MSC 615
Advanced Statistics And Probability (MSC 615
 
Dynamics of structures with uncertainties
Dynamics of structures with uncertaintiesDynamics of structures with uncertainties
Dynamics of structures with uncertainties
 
Week8 livelecture2010
Week8 livelecture2010Week8 livelecture2010
Week8 livelecture2010
 
Prac excises 3[1].5
Prac excises 3[1].5Prac excises 3[1].5
Prac excises 3[1].5
 
Big Data Analysis
Big Data AnalysisBig Data Analysis
Big Data Analysis
 
Admission in India
Admission in IndiaAdmission in India
Admission in India
 
Chapter07.pdf
Chapter07.pdfChapter07.pdf
Chapter07.pdf
 
JKN 10 Inference 2 Populations.Consider the following set of d.docx
JKN 10 Inference 2 Populations.Consider the following set of d.docxJKN 10 Inference 2 Populations.Consider the following set of d.docx
JKN 10 Inference 2 Populations.Consider the following set of d.docx
 
Flavours of Physics Challenge: Transfer Learning approach
Flavours of Physics Challenge: Transfer Learning approachFlavours of Physics Challenge: Transfer Learning approach
Flavours of Physics Challenge: Transfer Learning approach
 
Introduction to Machine Learning Lectures
Introduction to Machine Learning LecturesIntroduction to Machine Learning Lectures
Introduction to Machine Learning Lectures
 

Mehr von okeee

Dm uitwerkingen wc4
Dm uitwerkingen wc4Dm uitwerkingen wc4
Dm uitwerkingen wc4
okeee
 
Dm uitwerkingen wc2
Dm uitwerkingen wc2Dm uitwerkingen wc2
Dm uitwerkingen wc2
okeee
 
Dm uitwerkingen wc1
Dm uitwerkingen wc1Dm uitwerkingen wc1
Dm uitwerkingen wc1
okeee
 
Dm uitwerkingen wc3
Dm uitwerkingen wc3Dm uitwerkingen wc3
Dm uitwerkingen wc3
okeee
 
Dm uitwerkingen wc1
Dm uitwerkingen wc1Dm uitwerkingen wc1
Dm uitwerkingen wc1
okeee
 
Dm part03 neural-networks-homework
Dm part03 neural-networks-homeworkDm part03 neural-networks-homework
Dm part03 neural-networks-homework
okeee
 
10[1].1.1.115.9508
10[1].1.1.115.950810[1].1.1.115.9508
10[1].1.1.115.9508
okeee
 
Hcm p137 hilliges
Hcm p137 hilligesHcm p137 hilliges
Hcm p137 hilliges
okeee
 
Prob18
Prob18Prob18
Prob18
okeee
 
Decision tree.10.11
Decision tree.10.11Decision tree.10.11
Decision tree.10.11
okeee
 
Dm week01 linreg.handout
Dm week01 linreg.handoutDm week01 linreg.handout
Dm week01 linreg.handout
okeee
 
Dm week01 prob-refresher.handout
Dm week01 prob-refresher.handoutDm week01 prob-refresher.handout
Dm week01 prob-refresher.handout
okeee
 
Dm week01 intro.handout
Dm week01 intro.handoutDm week01 intro.handout
Dm week01 intro.handout
okeee
 
Dm week01 homework(1)
Dm week01 homework(1)Dm week01 homework(1)
Dm week01 homework(1)
okeee
 
Chapter7 huizing
Chapter7 huizingChapter7 huizing
Chapter7 huizing
okeee
 
Chapter8 choo
Chapter8 chooChapter8 choo
Chapter8 choo
okeee
 
Chapter6 huizing
Chapter6 huizingChapter6 huizing
Chapter6 huizing
okeee
 
Kbms text-image
Kbms text-imageKbms text-image
Kbms text-image
okeee
 
Kbms audio
Kbms audioKbms audio
Kbms audio
okeee
 
Kbms jan catin cont(1)
Kbms jan catin cont(1)Kbms jan catin cont(1)
Kbms jan catin cont(1)
okeee
 

Mehr von okeee (20)

Dm uitwerkingen wc4
Dm uitwerkingen wc4Dm uitwerkingen wc4
Dm uitwerkingen wc4
 
Dm uitwerkingen wc2
Dm uitwerkingen wc2Dm uitwerkingen wc2
Dm uitwerkingen wc2
 
Dm uitwerkingen wc1
Dm uitwerkingen wc1Dm uitwerkingen wc1
Dm uitwerkingen wc1
 
Dm uitwerkingen wc3
Dm uitwerkingen wc3Dm uitwerkingen wc3
Dm uitwerkingen wc3
 
Dm uitwerkingen wc1
Dm uitwerkingen wc1Dm uitwerkingen wc1
Dm uitwerkingen wc1
 
Dm part03 neural-networks-homework
Dm part03 neural-networks-homeworkDm part03 neural-networks-homework
Dm part03 neural-networks-homework
 
10[1].1.1.115.9508
10[1].1.1.115.950810[1].1.1.115.9508
10[1].1.1.115.9508
 
Hcm p137 hilliges
Hcm p137 hilligesHcm p137 hilliges
Hcm p137 hilliges
 
Prob18
Prob18Prob18
Prob18
 
Decision tree.10.11
Decision tree.10.11Decision tree.10.11
Decision tree.10.11
 
Dm week01 linreg.handout
Dm week01 linreg.handoutDm week01 linreg.handout
Dm week01 linreg.handout
 
Dm week01 prob-refresher.handout
Dm week01 prob-refresher.handoutDm week01 prob-refresher.handout
Dm week01 prob-refresher.handout
 
Dm week01 intro.handout
Dm week01 intro.handoutDm week01 intro.handout
Dm week01 intro.handout
 
Dm week01 homework(1)
Dm week01 homework(1)Dm week01 homework(1)
Dm week01 homework(1)
 
Chapter7 huizing
Chapter7 huizingChapter7 huizing
Chapter7 huizing
 
Chapter8 choo
Chapter8 chooChapter8 choo
Chapter8 choo
 
Chapter6 huizing
Chapter6 huizingChapter6 huizing
Chapter6 huizing
 
Kbms text-image
Kbms text-imageKbms text-image
Kbms text-image
 
Kbms audio
Kbms audioKbms audio
Kbms audio
 
Kbms jan catin cont(1)
Kbms jan catin cont(1)Kbms jan catin cont(1)
Kbms jan catin cont(1)
 

Kürzlich hochgeladen

Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Kürzlich hochgeladen (20)

Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 

Overfit10

  • 1. Copyright © Andrew W. Moore Slide 1 Cross-validation for detecting and preventing overfitting Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.
  • 2. Copyright © Andrew W. Moore Slide 2 A Regression Problem x y y = f(x) + noise Can we learn f from this data? Let’s consider three methods…
  • 3. Copyright © Andrew W. Moore Slide 3 Linear Regression x y
  • 4. Copyright © Andrew W. Moore Slide 4 Linear Regression Univariate Linear regression with a constant term: :: 31 73 YX : 1 3 : 3 7X= y= x1=(3).. y1=7.. Originally discussed in the previous Andrew Lecture: “Neural Nets”
  • 5. Copyright © Andrew W. Moore Slide 5 Linear Regression Univariate Linear regression with a constant term: :: 31 73 YX : 1 3 : 3 7X= y= x1=(3).. y1=7.. 1 3 : 1 1 : 3 7 Z= y= z1=(1,3).. zk=(1,xk) y1=7..
  • 6. Copyright © Andrew W. Moore Slide 6 Linear Regression Univariate Linear regression with a constant term: :: 31 73 YX : 1 3 : 3 7X= y= x1=(3).. y1=7.. 1 3 : 1 1 : 3 7 Z= y= z1=(1,3).. zk=(1,xk) y1=7.. β=(ZTZ)-1(ZTy) yest = β0+ β1 x
  • 7. Copyright © Andrew W. Moore Slide 7 Quadratic Regression x y
  • 8. Copyright © Andrew W. Moore Slide 8 Quadratic Regression :: 31 73 YX : 1 3 : 3 7X= y= x1=(3,2).. y1=7.. 1 9 1 3 : 1 1 : 3 7 Z= y= z=(1 , x, x2 ,) β=(ZTZ)-1(ZTy) yest = β0+ β1 x+ β2 x2 Much more about this in the future Andrew Lecture: “Favorite Regression Algorithms”
  • 9. Copyright © Andrew W. Moore Slide 9 Join-the-dots x y Also known as piecewise linear nonparametric regression if that makes you feel better
  • 10. Copyright © Andrew W. Moore Slide 10 Which is best? x y x y Why not choose the method with the best fit to the data?
  • 11. Copyright © Andrew W. Moore Slide 11 What do we really want? x y x y Why not choose the method with the best fit to the data? “How well are you going to predict future data drawn from the same distribution?”
  • 12. Copyright © Andrew W. Moore Slide 12 The test set method x y 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set
  • 13. Copyright © Andrew W. Moore Slide 13 The test set method x y 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set 3. Perform your regression on the training set (Linear regression example)
  • 14. Copyright © Andrew W. Moore Slide 14 The test set method x y 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set 3. Perform your regression on the training set 4. Estimate your future performance with the test set (Linear regression example) Mean Squared Error = 2.4
  • 15. Copyright © Andrew W. Moore Slide 15 The test set method x y 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set 3. Perform your regression on the training set 4. Estimate your future performance with the test set (Quadratic regression example) Mean Squared Error = 0.9
  • 16. Copyright © Andrew W. Moore Slide 16 The test set method x y 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set 3. Perform your regression on the training set 4. Estimate your future performance with the test set (Join the dots example) Mean Squared Error = 2.2
  • 17. Copyright © Andrew W. Moore Slide 17 The test set method Good news: •Very very simple •Can then simply choose the method with the best test-set score Bad news: •What’s the downside?
  • 18. Copyright © Andrew W. Moore Slide 18 The test set method Good news: •Very very simple •Can then simply choose the method with the best test-set score Bad news: •Wastes data: we get an estimate of the best method to apply to 30% less data •If we don’t have much data, our test-set might just be lucky or unlucky We say the “test-set estimator of performance has high variance”
  • 19. Copyright © Andrew W. Moore Slide 19 LOOCV (Leave-one-out Cross Validation) x y For k=1 to R 1. Let (xk,yk) be the kth record
  • 20. Copyright © Andrew W. Moore Slide 20 LOOCV (Leave-one-out Cross Validation) x y For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset
  • 21. Copyright © Andrew W. Moore Slide 21 LOOCV (Leave-one-out Cross Validation) x y For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 datapoints
  • 22. Copyright © Andrew W. Moore Slide 22 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 datapoints 4. Note your error (xk,yk) x y
  • 23. Copyright © Andrew W. Moore Slide 23 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 datapoints 4. Note your error (xk,yk) When you’ve done all points, report the mean error. x y
  • 24. Copyright © Andrew W. Moore Slide 24 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 datapoints 4. Note your error (xk,yk) When you’ve done all points, report the mean error. x y x y x y x y x y x y x y x y x y MSELOOCV = 2.12
  • 25. Copyright © Andrew W. Moore Slide 25 LOOCV for Quadratic Regression For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 datapoints 4. Note your error (xk,yk) When you’ve done all points, report the mean error. x y x y x y x y x y x y x y x y x y MSELOOCV =0.962
  • 26. Copyright © Andrew W. Moore Slide 26 LOOCV for Join The Dots For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 datapoints 4. Note your error (xk,yk) When you’ve done all points, report the mean error. x y x y x y x y x y x y x y x y x y MSELOOCV =3.33
  • 27. Copyright © Andrew W. Moore Slide 27 Which kind of Cross Validation? Doesn’t waste data Expensive. Has some weird behavior Leave- one-out CheapVariance: unreliable estimate of future performance Test-set UpsideDownside ..can we get the best of both worlds?
  • 28. Copyright © Andrew W. Moore Slide 28 k-fold Cross Validation x y Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue)
  • 29. Copyright © Andrew W. Moore Slide 29 k-fold Cross Validation x y Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points.
  • 30. Copyright © Andrew W. Moore Slide 30 k-fold Cross Validation x y Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points.
  • 31. Copyright © Andrew W. Moore Slide 31 k-fold Cross Validation x y Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points.
  • 32. Copyright © Andrew W. Moore Slide 32 k-fold Cross Validation x y Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Linear Regression MSE3FOLD=2.05
  • 33. Copyright © Andrew W. Moore Slide 33 k-fold Cross Validation x y Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Quadratic Regression MSE3FOLD=1.11
  • 34. Copyright © Andrew W. Moore Slide 34 k-fold Cross Validation x y Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Joint-the-dots MSE3FOLD=2.93
  • 35. Copyright © Andrew W. Moore Slide 35 Which kind of Cross Validation? Doesn’t waste dataExpensive. Has some weird behavior Leave- one-out Only wastes 10%. Only 10 times more expensive instead of R times. Wastes 10% of the data. 10 times more expensive than test set 10-fold Slightly better than test- set Wastier than 10-fold. Expensivier than test set 3-fold Identical to Leave-one-outR-fold CheapVariance: unreliable estimate of future performance Test-set UpsideDownside
  • 36. Copyright © Andrew W. Moore Slide 36 Which kind of Cross Validation? Doesn’t waste dataExpensive. Has some weird behavior Leave- one-out Only wastes 10%. Only 10 times more expensive instead of R times. Wastes 10% of the data. 10 times more expensive than testset 10-fold Slightly better than test- set Wastier than 10-fold. Expensivier than testset 3-fold Identical to Leave-one-outR-fold CheapVariance: unreliable estimate of future performance Test-set UpsideDownside But note: One of Andrew’s joys in life is algorithmic tricks for making these cheap
  • 37. Copyright © Andrew W. Moore Slide 37 CV-based Model Selection • We’re trying to decide which algorithm to use. • We train each machine and make a table… f44 f55 f66 ⌦f33 f22 f11 Choice10-FOLD-CV-ERRTRAINERRfii
  • 38. Copyright © Andrew W. Moore Slide 38 CV-based Model Selection • Example: Choosing number of hidden units in a one- hidden-layer neural net. • Step 1: Compute 10-fold CV error for six different model classes: 3 hidden units 4 hidden units 5 hidden units ⌦2 hidden units 1 hidden units 0 hidden units Choice10-FOLD-CV-ERRTRAINERRAlgorithm • Step 2: Whichever model class gave best CV score: train it with all the data, and that’s the predictive model you’ll use.
  • 39. Copyright © Andrew W. Moore Slide 39 CV-based Model Selection • Example: Choosing “k” for a k-nearest-neighbor regression. • Step 1: Compute LOOCV error for six different model classes: • Step 2: Whichever model class gave best CV score: train it with all the data, and that’s the predictive model you’ll use. ⌦K=4 K=5 K=6 K=3 K=2 K=1 Choice10-fold-CV-ERRTRAINERRAlgorithm
  • 40. Copyright © Andrew W. Moore Slide 40 ⌦K=4 K=5 K=6 K=3 K=2 K=1 ChoiceLOOCV-ERRTRAINERRAlgorithm CV-based Model Selection • Example: Choosing “k” for a k-nearest-neighbor regression. • Step 1: Compute LOOCV error for six different model classes: • Step 2: Whichever model class gave best CV score: train it with all the data, and that’s the predictive model you’ll use. Why did we use 10-fold-CV for neural nets and LOOCV for k- nearest neighbor? And why stop at K=6 Are we guaranteed that a local optimum of K vs LOOCV will be the global optimum? What should we do if we are depressed at the expense of doing LOOCV for K= 1 through 1000? The reason is Computational. For k- NN (and all other nonparametric methods) LOOCV happens to be as cheap as regular predictions. No good reason, except it looked like things were getting worse as K was increasing Sadly, no. And in fact, the relationship can be very bumpy. Idea One: K=1, K=2, K=4, K=8, K=16, K=32, K=64 … K=1024 Idea Two: Hillclimbing from an initial guess at K
  • 41. Copyright © Andrew W. Moore Slide 41 CV-based Model Selection • Can you think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far?
  • 42. Copyright © Andrew W. Moore Slide 42 CV-based Model Selection • Can you think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far? • Degree of polynomial in polynomial regression • Whether to use full, diagonal or spherical Gaussians in a Gaussian Bayes Classifier. • The Kernel Width in Kernel Regression • The Kernel Width in Locally Weighted Regression • The Bayesian Prior in Bayesian Regression These involve choosing the value of a real-valued parameter. What should we do?
  • 43. Copyright © Andrew W. Moore Slide 43 CV-based Model Selection • Can you think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far? • Degree of polynomial in polynomial regression • Whether to use full, diagonal or spherical Gaussians in a Gaussian Bayes Classifier. • The Kernel Width in Kernel Regression • The Kernel Width in Locally Weighted Regression • The Bayesian Prior in Bayesian Regression These involve choosing the value of a real-valued parameter. What should we do? Idea One: Consider a discrete set of values (often best to consider a set of values with exponentially increasing gaps, as in the K-NN example). Idea Two: Compute and then do gradianet descent. Parameter LOOCV ∂ ∂
  • 44. Copyright © Andrew W. Moore Slide 44 CV-based Model Selection • Can you think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far? • Degree of polynomial in polynomial regression • Whether to use full, diagonal or spherical Gaussians in a Gaussian Bayes Classifier. • The Kernel Width in Kernel Regression • The Kernel Width in Locally Weighted Regression • The Bayesian Prior in Bayesian Regression These involve choosing the value of a real-valued parameter. What should we do? Idea One: Consider a discrete set of values (often best to consider a set of values with exponentially increasing gaps, as in the K-NN example). Idea Two: Compute and then do gradianet descent. Parameter LOOCV ∂ ∂ Also: The scale factors of a non- parametric distance metric
  • 45. Copyright © Andrew W. Moore Slide 45 ⌦Quad reg’n LWR, KW=0.1 LWR, KW=0.5 Linear Reg’n 10-NN 1-NN Choice10-fold-CV-ERRTRAINERRAlgorithm CV-based Algorithm Choice • Example: Choosing which regression algorithm to use • Step 1: Compute 10-fold-CV error for six different model classes: • Step 2: Whichever algorithm gave best CV score: train it with all the data, and that’s the predictive model you’ll use.
  • 46. Copyright © Andrew W. Moore Slide 46 • Model selection methods: 1. Cross-validation 2. AIC (Akaike Information Criterion) 3. BIC (Bayesian Information Criterion) 4. VC-dimension (Vapnik-Chervonenkis Dimension) Only directly applicable to choosing classifiers Described in a future Lecture Alternatives to CV-based model selection
  • 47. Copyright © Andrew W. Moore Slide 47 Which model selection method is best? 1. (CV) Cross-validation 2. AIC (Akaike Information Criterion) 3. BIC (Bayesian Information Criterion) 4. (SRMVC) Structural Risk Minimize with VC-dimension • AIC, BIC and SRMVC advantage: you only need the training error. • CV error might have more variance • SRMVC is wildly conservative • Asymptotically AIC and Leave-one-out CV should be the same • Asymptotically BIC and carefully chosen k-fold should be same • You want BIC if you want the best structure instead of the best predictor (e.g. for clustering or Bayes Net structure finding) • Many alternatives---including proper Bayesian approaches. • It’s an emotional issue.
  • 48. Copyright © Andrew W. Moore Slide 48 Other Cross-validation issues • Can do “leave all pairs out” or “leave-all- ntuples-out” if feeling resourceful. • Some folks do k-folds in which each fold is an independently-chosen subset of the data • Do you know what AIC and BIC are? If so… • LOOCV behaves like AIC asymptotically. • k-fold behaves like BIC if you choose k carefully If not… • Nyardely nyardely nyoo nyoo
  • 49. Copyright © Andrew W. Moore Slide 49 Cross-Validation for regression • Choosing the number of hidden units in a neural net • Feature selection (see later) • Choosing a polynomial degree • Choosing which regressor to use
  • 50. Copyright © Andrew W. Moore Slide 50 Supervising Gradient Descent • This is a weird but common use of Test-set validation • Suppose you have a neural net with too many hidden units. It will overfit. • As gradient descent progresses, maintain a graph of MSE-testset-error vs. Iteration Iteration of Gradient Descent MeanSquared Error Training Set Test Set Use the weights you found on this iteration
  • 51. Copyright © Andrew W. Moore Slide 51 Supervising Gradient Descent • This is a weird but common use of Test-set validation • Suppose you have a neural net with too many hidden units. It will overfit. • As gradient descent progresses, maintain a graph of MSE-testset-error vs. Iteration Iteration of Gradient Descent MeanSquared Error Training Set Test Set Use the weights you found on this iteration Relies on an intuition that a not-fully- minimized set of weights is somewhat like having fewer parameters. Works pretty well in practice, apparently
  • 52. Copyright © Andrew W. Moore Slide 52 Cross-validation for classification • Instead of computing the sum squared errors on a test set, you should compute…
  • 53. Copyright © Andrew W. Moore Slide 53 Cross-validation for classification • Instead of computing the sum squared errors on a test set, you should compute… The total number of misclassifications on a testset.
  • 54. Copyright © Andrew W. Moore Slide 54 Cross-validation for classification • Instead of computing the sum squared errors on a test set, you should compute… The total number of misclassifications on a testset. • What’s LOOCV of 1-NN? • What’s LOOCV of 3-NN? • What’s LOOCV of 22-NN?
  • 55. Copyright © Andrew W. Moore Slide 55 Cross-validation for classification • Instead of computing the sum squared errors on a test set, you should compute… The total number of misclassifications on a testset. • But there’s a more sensitive alternative: Compute log P(all test outputs|all test inputs, your model)
  • 56. Copyright © Andrew W. Moore Slide 56 Cross-Validation for classification • Choosing the pruning parameter for decision trees • Feature selection (see later) • What kind of Gaussian to use in a Gaussian- based Bayes Classifier • Choosing which classifier to use
  • 57. Copyright © Andrew W. Moore Slide 57 Cross-Validation for density estimation • Compute the sum of log-likelihoods of test points Example uses: • Choosing what kind of Gaussian assumption to use • Choose the density estimator • NOT Feature selection (testset density will almost always look better with fewer features)
  • 58. Copyright © Andrew W. Moore Slide 58 Feature Selection • Suppose you have a learning algorithm LA and a set of input attributes { X1 , X2 .. Xm } • You expect that LA will only find some subset of the attributes useful. • Question: How can we use cross-validation to find a useful subset? • Four ideas: • Forward selection • Backward elimination • Hill Climbing • Stochastic search (Simulated Annealing or GAs) Another fun area in which Andrew has spent a lot of his wild youth
  • 59. Copyright © Andrew W. Moore Slide 59 Very serious warning • Intensive use of cross validation can overfit. • How? • What can be done about it?
  • 60. Copyright © Andrew W. Moore Slide 60 Very serious warning • Intensive use of cross validation can overfit. • How? • Imagine a dataset with 50 records and 1000 attributes. • You try 1000 linear regression models, each one using one of the attributes. • What can be done about it?
  • 61. Copyright © Andrew W. Moore Slide 61 Very serious warning • Intensive use of cross validation can overfit. • How? • Imagine a dataset with 50 records and 1000 attributes. • You try 1000 linear regression models, each one using one of the attributes. • The best of those 1000 looks good! • What can be done about it?
  • 62. Copyright © Andrew W. Moore Slide 62 Very serious warning • Intensive use of cross validation can overfit. • How? • Imagine a dataset with 50 records and 1000 attributes. • You try 1000 linear regression models, each one using one of the attributes. • The best of those 1000 looks good! • But you realize it would have looked good even if the output had been purely random! • What can be done about it? • Hold out an additional testset before doing any model selection. Check the best model performs well even on the additional testset. • Or: Randomization Testing
  • 63. Copyright © Andrew W. Moore Slide 63 What you should know • Why you can’t use “training-set-error” to estimate the quality of your learning algorithm on your data. • Why you can’t use “training set error” to choose the learning algorithm • Test-set cross-validation • Leave-one-out cross-validation • k-fold cross-validation • Feature selection methods • CV for classification, regression & densities