SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
VC-dimension for
                     characterizing
                       classifiers
                                                   Andrew W. Moore
Note to other teachers and users of
these slides. Andrew would be delighted
                                                  Associate Professor
if you found this source material useful in
giving your own lectures. Feel free to use
these slides verbatim, or to modify them
                                              School of Computer Science
to fit your own needs. PowerPoint
originals are available. If you make use
                                               Carnegie Mellon University
of a significant portion of these slides in
your own lecture, please include this
                                                    www.cs.cmu.edu/~awm
message, or the following link to the
source repository of Andrew’s tutorials:             awm@cs.cmu.edu
http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully                    412-268-7599
received.




               Copyright © 2001, Andrew W. Moore                            Nov 20th, 2001




                                   A learning machine
 • A learning machine f takes an input x and
   transforms it, somehow using weights α, into a
   predicted output yest = +/- 1


                                                             α
     α is some vector of
     adjustable parameters

                                      x                       f              yest




 Copyright © 2001, Andrew W. Moore                                              VC-dimension: Slide 2




                                                                                                        1
α
    Examples
                                    x   f                           yest
                                        f(x,b) = sign(x.x – b)
           denotes +1
           denotes -1




Copyright © 2001, Andrew W. Moore                   VC-dimension: Slide 3




                                        α
    Examples
                                    x   f                           yest
                                         f(x,w) = sign(x.w)
           denotes +1
           denotes -1




Copyright © 2001, Andrew W. Moore                   VC-dimension: Slide 4




                                                                            2
α
    Examples
                                    x    f                          yest
                                        f(x,w,b) = sign(x.w+b)
           denotes +1
           denotes -1




Copyright © 2001, Andrew W. Moore                   VC-dimension: Slide 5




How do we characterize “power”?
• Different machines have different amounts of
  “power”.
• Tradeoff between:
   • More power: Can model more complex
     classifiers but might overfit.
   • Less power: Not going to overfit, but restricted in
     what it can model.
• How do we characterize the amount of power?




Copyright © 2001, Andrew W. Moore                   VC-dimension: Slide 6




                                                                            3
Some definitions
•   Given some machine f
•   And under the assumption that all training points (xk ,yk ) were drawn i.i.d
    from some distribution.
•   And under the assumption that future test points will be drawn from the
    same distribution
•   Define
                                           1                 Probabilit y of
                  R(α ) = TESTERR (α ) = E  y − f ( x,α )  =
                                           2               Misclassif ication
     Official terminology           Terminology we’ll use




Copyright © 2001, Andrew W. Moore                                        VC-dimension: Slide 7




                           Some definitions
•   Given some machine f
•   And under the assumption that all training points (xk ,yk ) were drawn i.i.d
    from some distribution.
•   And under the assumption that future test points will be drawn from the
    same distribution
•   Define
                                           1                 Probabilit y of
                  R(α ) = TESTERR (α ) = E  y − f ( x,α )  =
                                           2               Misclassif ication
     Official terminology           Terminology we’ll use

                                                                      Fraction Training
                                             1 R1
                                               ∑ 2 yk − f ( xk ,α ) = Set misclassif ied
         R emp (α ) = TRAINERR(α ) =
                                             R k =1

                                         R = #training set
                                           data points

Copyright © 2001, Andrew W. Moore                                        VC-dimension: Slide 8




                                                                                                 4
Vapnik-Chervonenkis dimension
                 1                                               1R1
                                                                     ∑ y k − f ( x k ,α )
TESTERR (α ) = E  y − f ( x, α )            TRAINERR(α ) =
                 2                                               R k =1 2
•   Given some machine f, let h be its VC dimension.
•   h is a measure of f’s power (h does not depend on the choice of training set)
•   Vapnik showed that with probability 1-η


                                                  h (log( 2 R / h) + 1) − log(η / 4)
       TESTERR (α ) ≤ TRAINERR(α ) +
                                                                   R

                           This gives us a way to estimate the error on
                           future data based only on the training error
                           and the VC-dimension of f




Copyright © 2001, Andrew W. Moore                                         VC-dimension: Slide 9




    What VC-dimension is used for
                 1                                               1R1
                                                                     ∑ y k − f ( x k ,α )
TESTERR (α ) = E  y − f ( x, α )            TRAINERR(α ) =
                 2                                               R k =1 2
•   Given some machine f, let h be its VC dimension.
•   h is a measure of f’s power.
                                          f,
                                      ine
•   Vapnik showed that with probability 1-η
                                ach
                            n m efine
                        give we d h(log( 2R / h) + 1) − log(η / 4)
                  But
       TESTERR (α ) ≤ TRAINERR(α ) + te h?
                           do
                      how compu                  R
                         nd
                        a
                           This gives us a way to estimate the error on
                           future data based only on the training error
                           and the VC-dimension of f




Copyright © 2001, Andrew W. Moore                                        VC-dimension: Slide 10




                                                                                                  5
Shattering
•   Machine f can shatter a set of points x1, x2 .. xr if and only if…
         For every possible training set of the form (x1,y1) , (x2,y2 ) ,… (xr ,yr)
             …There exists some value of α that gets zero training error.




There are 2r such training sets
    to consider, each with a
 different combination of +1’s
       and –1’s for the y’s




Copyright © 2001, Andrew W. Moore                                    VC-dimension: Slide 11




                                    Shattering
•   Machine f can shatter a set of points x1, x2 .. Xr if and only if…
      For every possible training set of the form (x1,y1) , (x2,y2 ) ,… (xr ,yr)
      …There exists some value of α that gets zero training error.
•   Question: Can the following f shatter the following points?




                                      f(x,w) = sign(x.w)




Copyright © 2001, Andrew W. Moore                                    VC-dimension: Slide 12




                                                                                              6
Shattering
•    Machine f can shatter a set of points x1, x2 .. Xr if and only if…
       For every possible training set of the form (x1,y1) , (x2,y2 ) ,… (xr ,yr)
       …There exists some value of α that gets zero training error.
•    Question: Can the following f shatter the following points?




                                        f(x,w) = sign(x.w)

 •   Answer: No problem. There are four training sets to consider




     w=(0,1)                 w=(-2,3)        w=(2,-3)          w=(0,-1)
Copyright © 2001, Andrew W. Moore                                         VC-dimension: Slide 13




                                    Shattering
•    Machine f can shatter a set of points x1, x2 .. Xr if and only if…
       For every possible training set of the form (x1,y1) , (x2,y2 ) ,… (xr ,yr)
       …There exists some value of α that gets zero training error.
•    Question: Can the following f shatter the following points?




                                        f(x,b) = sign(x.x-b)




Copyright © 2001, Andrew W. Moore                                         VC-dimension: Slide 14




                                                                                                   7
Shattering
•       Machine f can shatter a set of points x1, x2 .. Xr if and only if…
          For every possible training set of the form (x1,y1) , (x2,y2 ) ,… (xr ,yr)
          …There exists some value of α that gets zero training error.
•       Question: Can the following f shatter the following points?




                                       f(x,b) = sign(x.x-b)

    •    Answer: No way my friend.




Copyright © 2001, Andrew W. Moore                                      VC-dimension: Slide 15




             Definition of VC dimension
Given machine f, the VC-dimension h is
          The maximum number of points that can be
          arranged so that f shatter them.
Example: What’s VC dimension of f(x,b) = sign(x.x-b)




Copyright © 2001, Andrew W. Moore                                      VC-dimension: Slide 16




                                                                                                8
VC dim of trivial circle
Given machine f, the VC-dimension h is
        The maximum number of points that can be
        arranged so that f shatter them.
Example: What’s VC dimension of f(x,b) = sign(x.x-b)
   Answer = 1: we can’t even shatter two points! (but it’s
   clear we can shatter 1)




Copyright © 2001, Andrew W. Moore                 VC-dimension: Slide 17




                       Reformulated circle
Given machine f, the VC-dimension h is
        The maximum number of points that can be
        arranged so that f shatter them.
Example: For 2-d inputs, what’s VC dimension of f(x,q,b) =
  sign(qx.x-b)




Copyright © 2001, Andrew W. Moore                 VC-dimension: Slide 18




                                                                           9
Reformulated circle
Given machine f, the VC-dimension h is
        The maximum number of points that can be
        arranged so that f shatter them.
Example: What’s VC dimension of f(x,q,b) = sign(qx.x-b)




  •   Answer = 2
                                        q,b are -ve




Copyright © 2001, Andrew W. Moore                     VC-dimension: Slide 19




                       Reformulated circle
Given machine f, the VC-dimension h is
        The maximum number of points that can be
        arranged so that f shatter them.
Example: What’s VC dimension of f(x,q,b) = sign(qx.x-b)




  •   Answer = 2 (clearly can’t do 3)
                                        q,b are -ve




Copyright © 2001, Andrew W. Moore                     VC-dimension: Slide 20




                                                                               10
VC dim of separating line
Given machine f, the VC-dimension h is
        The maximum number of points that can be
        arranged so that f shatter them.
Example: For 2-d inputs, what’s VC-dim of f(x,w,b) = sign(w.x+b)?
Well, can f shatter these three points?




Copyright © 2001, Andrew W. Moore                                         VC-dimension: Slide 21




                 VC dim of line machine
Given machine f, the VC-dimension h is
        The maximum number of points that can be
        arranged so that f shatter them.
Example: For 2-d inputs, what’s VC-dim of f(x,w,b) = sign(w.x+b)?
Well, can f shatter these three points?


                                    Yes, of course.
                                    All -ve or all +ve is trivial
                                    One +ve can be picked off by a line
                                    One -ve can be picked off too.




Copyright © 2001, Andrew W. Moore                                         VC-dimension: Slide 22




                                                                                                   11
VC dim of line machine
Given machine f, the VC-dimension h is
        The maximum number of points that can be
        arranged so that f shatter them.
Example: For 2-d inputs, what’s VC-dim of f(x,w,b) = sign(w.x+b)?
Well, can we find four points that f can shatter?




Copyright © 2001, Andrew W. Moore                                       VC-dimension: Slide 23




                 VC dim of line machine
Given machine f, the VC-dimension h is
        The maximum number of points that can be
        arranged so that f shatter them.
Example: For 2-d inputs, what’s VC-dim of f(x,w,b) = sign(w.x+b)?
Well, can we find four points that f can shatter?



                                    Can always draw six lines between pairs of four
                                    points.




Copyright © 2001, Andrew W. Moore                                       VC-dimension: Slide 24




                                                                                                 12
VC dim of line machine
Given machine f, the VC-dimension h is
        The maximum number of points that can be
        arranged so that f shatter them.
Example: For 2-d inputs, what’s VC-dim of f(x,w,b) = sign(w.x+b)?
Well, can we find four points that f can shatter?



                                    Can always draw six lines between pairs of four
                                    points.
                                    Two of those lines will cross.




Copyright © 2001, Andrew W. Moore                                          VC-dimension: Slide 25




                 VC dim of line machine
Given machine f, the VC-dimension h is
        The maximum number of points that can be
        arranged so that f shatter them.
Example: For 2-d inputs, what’s VC-dim of f(x,w,b) = sign(w.x+b)?
Well, can we find four points that f can shatter?



                                    Can always draw six lines between pairs of four
                                    points.
                                    Two of those lines will cross.
                                    If we put points linked by the crossing lines in the
                                    same class they can’t be linearly separated
                                    So a line can shatter 3 points but not 4
                                    So VC-dim of Line Machine is 3
Copyright © 2001, Andrew W. Moore                                          VC-dimension: Slide 26




                                                                                                    13
VC dim of linear classifiers in m-dimensions
If input space is m-dimensional and if f is sign(w.x-b), what is
    the VC-dimension?
Proof that h >= m: Show that m points can be shattered
Can you guess how?




Copyright © 2001, Andrew W. Moore                                                   VC-dimension: Slide 27




   VC dim of linear classifiers in m-dimensions
If input space is m-dimensional and if f is sign(w.x-b), what is
    the VC-dimension?
Proof that h >= m: Show that m points can be shattered
Define m input points thus:

        x1 = (1,0,0,…,0)
        x2 = (0,1,0,…,0)
        :
        xm = (0,0,0,…,1)            So xk [j] = 1 if k=j and 0 otherwise


Let y1, y2 ,… ym , be any one of the 2m combinations of class
  labels.
Guess how we can define w1 , w2,… wm and b to ensure
  sign(w. xk + b) = yk for all k ? Note:                                                                   
                                                          m
                                                             sign (w .x k + b) = sign  b + ∑ wj . x k [ j ]
                                                                                                           
                                                                                                           
                                                                                            j=1
Copyright © 2001, Andrew W. Moore                                                   VC-dimension: Slide 28




                                                                                                                14
VC dim of linear classifiers in m-dimensions
If input space is m-dimensional and if f is sign(w.x-b), what is
    the VC-dimension?
Proof that h >= m: Show that m points can be shattered
Define m input points thus:

        x1 = (1,0,0,…,0)
        x2 = (0,1,0,…,0)
        :
        xm = (0,0,0,…,1)            So xk [j] = 1 if k=j and 0 otherwise


Let y1, y2 ,… ym , be any one of the 2m combinations of class
  labels.
Guess how we can define w1 , w2,… wm and b to ensure
  sign(w. xk + b) = yk for all k ? Note:                                               
                                                                         m
                                         sign (w .x k + b) = sign  b + ∑ wj . x k [ j ]
                                                                                       
Answer: b=0 and wk = yk for all k.                                                     
                                                                        j=1
Copyright © 2001, Andrew W. Moore                                          VC-dimension: Slide 29




   VC dim of linear classifiers in m-dimensions
If input space is m-dimensional and if f is sign(w.x-b), what is
    the VC-dimension?
• Now we know that h >= m
• In fact, h=m+1
• Proof that h >= m+1 is easy
• Proof that h < m+2 is moderate




Copyright © 2001, Andrew W. Moore                                          VC-dimension: Slide 30




                                                                                                    15
What does VC-dim measure?
    • Is it the number of parameters?
                 Related but not really the same.
    • I can create a machine with one numeric
      parameter that really encodes 7 parameters
      (How?)
    • And I can create a machine with 7 parameters
      which has a VC-dim of 1 (How?)
    •   Andrew’s private opinion: it often is the number of parameters that counts.




    Copyright © 2001, Andrew W. Moore                                           VC-dimension: Slide 31




                 Structural Risk Minimization
        Let φ(f) = the set of functions representable by f.
    •
                       f ( f1 ) ⊆ f ( f 2 ) ⊆ L f ( f n )
    •   Suppose
                 h ( f1 ) ≤ h( f 2 ) ≤ L h( f n ) (Hey, can you formally prove this?)
    •   Then
    •   We’re trying to decide which machine to use.
    •   We train each machine and make a table…
                                                 h (log( 2 R / h) + 1) − log(η / 4)
TESTERR (α ) ≤ TRAINERR(α ) +
                                                                  R
             TRAINER
i       fi                   VC-Conf                   Probable upper bound               Choice
             R                                         on TESTERR
1 f1
2 f2
                                                                                          Ö
3 f3
4 f4
5 f5
6 f6
    Copyright © 2001, Andrew W. Moore                                           VC-dimension: Slide 32




                                                                                                         16
Using VC-dimensionality
    That’s what VC-dimensionality is about
    People have worked hard to find VC-dimension for..
        • Decision Trees
        • Perceptrons
        • Neural Nets
        • Decision Lists
        • Support Vector Machines
        • And many many more
    All with the goals of
        1. Understanding which learning machines are more or
            less powerful under which circumstances
        2. Using Structural Risk Minimization for to choose the
            best learning machine
    Copyright © 2001, Andrew W. Moore                                 VC-dimension: Slide 33




     Alternatives to VC-dim-based model selection
    • What could we do instead of the scheme below?




                                        h (log( 2 R / h) + 1) − log(η / 4)
TESTERR (α ) ≤ TRAINERR(α ) +
                                                         R
             TRAINER
i       fi                   VC-Conf          Probable upper bound              Choice
             R                                on TESTERR
1 f1
2 f2
                                                                                Ö
3 f3
4 f4
5 f5
6 f6
    Copyright © 2001, Andrew W. Moore                                 VC-dimension: Slide 34




                                                                                               17
Alternatives to VC-dim-based model selection
     • What could we do instead of the scheme below?
        1. Cross-validation




              TRAINER
 i       fi                   10-FOLD-CV-ERR           Choice
              R
 1 f1
 2 f2
                                                       Ö
 3 f3
 4 f4
 5 f5
 6 f6

     Copyright © 2001, Andrew W. Moore                                    VC-dimension: Slide 35




      Alternatives to VC-dim-based model selection
     • What could we do instead of the scheme below?
                                               As the amount of data
        1. Cross-validation                    goes to infinity, AIC
        2. AIC (Akaike Information Criterion)  promises* to select the
                                                                 model that’ll have the
                                                                 best likelihood for future
AICSCORE = LL ( Data | MLE params) − (# parameters)              data
                                                                 *Subject to about a million
                                                                 caveats


              LOGLIKE(TRAINERR)
 i      fi                               #parameters       AIC                     Choice
 1 f1
 2 f2
 3 f3
                                                                                   Ö
 4 f4
 5 f5
 6 f6
     Copyright © 2001, Andrew W. Moore                                    VC-dimension: Slide 36




                                                                                                   18
Alternatives to VC-dim-based model selection
    • What could we do instead of the scheme below?
                                               As the amount of data
       1. Cross-validation                     goes to infinity, BIC
       2. AIC (Akaike Information Criterion) promises* to select the
                                               model that the data was
       3. BIC (Bayesian Information Criterion) generated from. More
                                                            conservative than AIC.
                                     # params                    *Another million caveats
BICSCORE = LL ( Data | MLE params) −          log R
                                         2
            LOGLIKE(TRAINERR)
i      fi                               #parameters   BIC                     Choice
1 f1
2 f2
                                                                              Ö
3 f3
4 f4
5 f5
6 f6
    Copyright © 2001, Andrew W. Moore                                VC-dimension: Slide 37




      Which model selection method is best?
            1.     (CV) Cross-validation
            2.     AIC (Akaike Information Criterion)
            3.     BIC (Bayesian Information Criterion)
            4.     (SRMVC) Structural Risk Minimize with VC-
                   dimension
    • AIC, BIC and SRMVC have the advantage that you only need the
      training error.
    • CV error might have more variance
    • SRMVC is wildly conservative
    • Asymptotically AIC and Leave-one-out CV should be the same
    • Asymptotically BIC and a carefully chosen k-fold should be the same
    • BIC is what you want if you want the best structure instead of the best
      predictor (e.g. for clustering or Bayes Net structure finding)
    • Many alternatives to the above including proper Bayesian approaches.
    • It’s an emotional issue.
    Copyright © 2001, Andrew W. Moore                                VC-dimension: Slide 38




                                                                                              19
Extra Comments
• Beware: that second “VC-confidence” term is
  usually very very conservative (at least hundreds
  of times larger than the empirical overfitting effect).
• An excellent tutorial on VC-dimension and Support
  Vector Machines (which we’ll be studying soon):
            C.J.C. Burges. A tutorial on support vector machines
             for pattern recognition. Data Mining and Knowledge
             Discovery, 2(2):955-974, 1998.
             http://citeseer.nj.nec.com/burges98tutorial.html




Copyright © 2001, Andrew W. Moore                     VC-dimension: Slide 39




                  What you should know
• The definition of a learning machine: f(x,α )
• The definition of Shattering
• Be able to work through simple examples of
  shattering
• The definition of VC-dimension
• Be able to work through simple examples of VC-
  dimension
• Structural Risk Minimization for model selection
• Awareness of other model selection methods



Copyright © 2001, Andrew W. Moore                     VC-dimension: Slide 40




                                                                               20

Weitere ähnliche Inhalte

Ähnlich wie VC dimensio

05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer visionzukun
 
Information Gain
Information GainInformation Gain
Information Gainguest32311f
 
Regression Theory
Regression TheoryRegression Theory
Regression TheorySSA KPI
 

Ähnlich wie VC dimensio (6)

Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer vision
 
Information Gain
Information GainInformation Gain
Information Gain
 
Draft6
Draft6Draft6
Draft6
 
Regression Theory
Regression TheoryRegression Theory
Regression Theory
 
Morphing Image
Morphing Image Morphing Image
Morphing Image
 

Mehr von guestfee8698

Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Modelsguestfee8698
 
K-means and Hierarchical Clustering
K-means and Hierarchical ClusteringK-means and Hierarchical Clustering
K-means and Hierarchical Clusteringguestfee8698
 
Gaussian Mixture Models
Gaussian Mixture ModelsGaussian Mixture Models
Gaussian Mixture Modelsguestfee8698
 
Short Overview of Bayes Nets
Short Overview of Bayes NetsShort Overview of Bayes Nets
Short Overview of Bayes Netsguestfee8698
 
A Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian ClassifiersA Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian Classifiersguestfee8698
 
Learning Bayesian Networks
Learning Bayesian NetworksLearning Bayesian Networks
Learning Bayesian Networksguestfee8698
 
Inference in Bayesian Networks
Inference in Bayesian NetworksInference in Bayesian Networks
Inference in Bayesian Networksguestfee8698
 
Eight Regression Algorithms
Eight Regression AlgorithmsEight Regression Algorithms
Eight Regression Algorithmsguestfee8698
 
Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)guestfee8698
 
Gaussian Bayes Classifiers
Gaussian Bayes ClassifiersGaussian Bayes Classifiers
Gaussian Bayes Classifiersguestfee8698
 
Maximum Likelihood Estimation
Maximum Likelihood EstimationMaximum Likelihood Estimation
Maximum Likelihood Estimationguestfee8698
 
Probability for Data Miners
Probability for Data MinersProbability for Data Miners
Probability for Data Minersguestfee8698
 

Mehr von guestfee8698 (15)

Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Models
 
K-means and Hierarchical Clustering
K-means and Hierarchical ClusteringK-means and Hierarchical Clustering
K-means and Hierarchical Clustering
 
Gaussian Mixture Models
Gaussian Mixture ModelsGaussian Mixture Models
Gaussian Mixture Models
 
Short Overview of Bayes Nets
Short Overview of Bayes NetsShort Overview of Bayes Nets
Short Overview of Bayes Nets
 
A Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian ClassifiersA Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian Classifiers
 
Learning Bayesian Networks
Learning Bayesian NetworksLearning Bayesian Networks
Learning Bayesian Networks
 
Inference in Bayesian Networks
Inference in Bayesian NetworksInference in Bayesian Networks
Inference in Bayesian Networks
 
Bayesian Networks
Bayesian NetworksBayesian Networks
Bayesian Networks
 
Eight Regression Algorithms
Eight Regression AlgorithmsEight Regression Algorithms
Eight Regression Algorithms
 
Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)
 
Cross-Validation
Cross-ValidationCross-Validation
Cross-Validation
 
Gaussian Bayes Classifiers
Gaussian Bayes ClassifiersGaussian Bayes Classifiers
Gaussian Bayes Classifiers
 
Maximum Likelihood Estimation
Maximum Likelihood EstimationMaximum Likelihood Estimation
Maximum Likelihood Estimation
 
Gaussians
GaussiansGaussians
Gaussians
 
Probability for Data Miners
Probability for Data MinersProbability for Data Miners
Probability for Data Miners
 

Kürzlich hochgeladen

Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 

Kürzlich hochgeladen (20)

Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 

VC dimensio

  • 1. VC-dimension for characterizing classifiers Andrew W. Moore Note to other teachers and users of these slides. Andrew would be delighted Associate Professor if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them School of Computer Science to fit your own needs. PowerPoint originals are available. If you make use Carnegie Mellon University of a significant portion of these slides in your own lecture, please include this www.cs.cmu.edu/~awm message, or the following link to the source repository of Andrew’s tutorials: awm@cs.cmu.edu http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully 412-268-7599 received. Copyright © 2001, Andrew W. Moore Nov 20th, 2001 A learning machine • A learning machine f takes an input x and transforms it, somehow using weights α, into a predicted output yest = +/- 1 α α is some vector of adjustable parameters x f yest Copyright © 2001, Andrew W. Moore VC-dimension: Slide 2 1
  • 2. α Examples x f yest f(x,b) = sign(x.x – b) denotes +1 denotes -1 Copyright © 2001, Andrew W. Moore VC-dimension: Slide 3 α Examples x f yest f(x,w) = sign(x.w) denotes +1 denotes -1 Copyright © 2001, Andrew W. Moore VC-dimension: Slide 4 2
  • 3. α Examples x f yest f(x,w,b) = sign(x.w+b) denotes +1 denotes -1 Copyright © 2001, Andrew W. Moore VC-dimension: Slide 5 How do we characterize “power”? • Different machines have different amounts of “power”. • Tradeoff between: • More power: Can model more complex classifiers but might overfit. • Less power: Not going to overfit, but restricted in what it can model. • How do we characterize the amount of power? Copyright © 2001, Andrew W. Moore VC-dimension: Slide 6 3
  • 4. Some definitions • Given some machine f • And under the assumption that all training points (xk ,yk ) were drawn i.i.d from some distribution. • And under the assumption that future test points will be drawn from the same distribution • Define 1  Probabilit y of R(α ) = TESTERR (α ) = E  y − f ( x,α )  = 2  Misclassif ication Official terminology Terminology we’ll use Copyright © 2001, Andrew W. Moore VC-dimension: Slide 7 Some definitions • Given some machine f • And under the assumption that all training points (xk ,yk ) were drawn i.i.d from some distribution. • And under the assumption that future test points will be drawn from the same distribution • Define 1  Probabilit y of R(α ) = TESTERR (α ) = E  y − f ( x,α )  = 2  Misclassif ication Official terminology Terminology we’ll use Fraction Training 1 R1 ∑ 2 yk − f ( xk ,α ) = Set misclassif ied R emp (α ) = TRAINERR(α ) = R k =1 R = #training set data points Copyright © 2001, Andrew W. Moore VC-dimension: Slide 8 4
  • 5. Vapnik-Chervonenkis dimension 1  1R1 ∑ y k − f ( x k ,α ) TESTERR (α ) = E  y − f ( x, α )  TRAINERR(α ) = 2  R k =1 2 • Given some machine f, let h be its VC dimension. • h is a measure of f’s power (h does not depend on the choice of training set) • Vapnik showed that with probability 1-η h (log( 2 R / h) + 1) − log(η / 4) TESTERR (α ) ≤ TRAINERR(α ) + R This gives us a way to estimate the error on future data based only on the training error and the VC-dimension of f Copyright © 2001, Andrew W. Moore VC-dimension: Slide 9 What VC-dimension is used for 1  1R1 ∑ y k − f ( x k ,α ) TESTERR (α ) = E  y − f ( x, α )  TRAINERR(α ) = 2  R k =1 2 • Given some machine f, let h be its VC dimension. • h is a measure of f’s power. f, ine • Vapnik showed that with probability 1-η ach n m efine give we d h(log( 2R / h) + 1) − log(η / 4) But TESTERR (α ) ≤ TRAINERR(α ) + te h? do how compu R nd a This gives us a way to estimate the error on future data based only on the training error and the VC-dimension of f Copyright © 2001, Andrew W. Moore VC-dimension: Slide 10 5
  • 6. Shattering • Machine f can shatter a set of points x1, x2 .. xr if and only if… For every possible training set of the form (x1,y1) , (x2,y2 ) ,… (xr ,yr) …There exists some value of α that gets zero training error. There are 2r such training sets to consider, each with a different combination of +1’s and –1’s for the y’s Copyright © 2001, Andrew W. Moore VC-dimension: Slide 11 Shattering • Machine f can shatter a set of points x1, x2 .. Xr if and only if… For every possible training set of the form (x1,y1) , (x2,y2 ) ,… (xr ,yr) …There exists some value of α that gets zero training error. • Question: Can the following f shatter the following points? f(x,w) = sign(x.w) Copyright © 2001, Andrew W. Moore VC-dimension: Slide 12 6
  • 7. Shattering • Machine f can shatter a set of points x1, x2 .. Xr if and only if… For every possible training set of the form (x1,y1) , (x2,y2 ) ,… (xr ,yr) …There exists some value of α that gets zero training error. • Question: Can the following f shatter the following points? f(x,w) = sign(x.w) • Answer: No problem. There are four training sets to consider w=(0,1) w=(-2,3) w=(2,-3) w=(0,-1) Copyright © 2001, Andrew W. Moore VC-dimension: Slide 13 Shattering • Machine f can shatter a set of points x1, x2 .. Xr if and only if… For every possible training set of the form (x1,y1) , (x2,y2 ) ,… (xr ,yr) …There exists some value of α that gets zero training error. • Question: Can the following f shatter the following points? f(x,b) = sign(x.x-b) Copyright © 2001, Andrew W. Moore VC-dimension: Slide 14 7
  • 8. Shattering • Machine f can shatter a set of points x1, x2 .. Xr if and only if… For every possible training set of the form (x1,y1) , (x2,y2 ) ,… (xr ,yr) …There exists some value of α that gets zero training error. • Question: Can the following f shatter the following points? f(x,b) = sign(x.x-b) • Answer: No way my friend. Copyright © 2001, Andrew W. Moore VC-dimension: Slide 15 Definition of VC dimension Given machine f, the VC-dimension h is The maximum number of points that can be arranged so that f shatter them. Example: What’s VC dimension of f(x,b) = sign(x.x-b) Copyright © 2001, Andrew W. Moore VC-dimension: Slide 16 8
  • 9. VC dim of trivial circle Given machine f, the VC-dimension h is The maximum number of points that can be arranged so that f shatter them. Example: What’s VC dimension of f(x,b) = sign(x.x-b) Answer = 1: we can’t even shatter two points! (but it’s clear we can shatter 1) Copyright © 2001, Andrew W. Moore VC-dimension: Slide 17 Reformulated circle Given machine f, the VC-dimension h is The maximum number of points that can be arranged so that f shatter them. Example: For 2-d inputs, what’s VC dimension of f(x,q,b) = sign(qx.x-b) Copyright © 2001, Andrew W. Moore VC-dimension: Slide 18 9
  • 10. Reformulated circle Given machine f, the VC-dimension h is The maximum number of points that can be arranged so that f shatter them. Example: What’s VC dimension of f(x,q,b) = sign(qx.x-b) • Answer = 2 q,b are -ve Copyright © 2001, Andrew W. Moore VC-dimension: Slide 19 Reformulated circle Given machine f, the VC-dimension h is The maximum number of points that can be arranged so that f shatter them. Example: What’s VC dimension of f(x,q,b) = sign(qx.x-b) • Answer = 2 (clearly can’t do 3) q,b are -ve Copyright © 2001, Andrew W. Moore VC-dimension: Slide 20 10
  • 11. VC dim of separating line Given machine f, the VC-dimension h is The maximum number of points that can be arranged so that f shatter them. Example: For 2-d inputs, what’s VC-dim of f(x,w,b) = sign(w.x+b)? Well, can f shatter these three points? Copyright © 2001, Andrew W. Moore VC-dimension: Slide 21 VC dim of line machine Given machine f, the VC-dimension h is The maximum number of points that can be arranged so that f shatter them. Example: For 2-d inputs, what’s VC-dim of f(x,w,b) = sign(w.x+b)? Well, can f shatter these three points? Yes, of course. All -ve or all +ve is trivial One +ve can be picked off by a line One -ve can be picked off too. Copyright © 2001, Andrew W. Moore VC-dimension: Slide 22 11
  • 12. VC dim of line machine Given machine f, the VC-dimension h is The maximum number of points that can be arranged so that f shatter them. Example: For 2-d inputs, what’s VC-dim of f(x,w,b) = sign(w.x+b)? Well, can we find four points that f can shatter? Copyright © 2001, Andrew W. Moore VC-dimension: Slide 23 VC dim of line machine Given machine f, the VC-dimension h is The maximum number of points that can be arranged so that f shatter them. Example: For 2-d inputs, what’s VC-dim of f(x,w,b) = sign(w.x+b)? Well, can we find four points that f can shatter? Can always draw six lines between pairs of four points. Copyright © 2001, Andrew W. Moore VC-dimension: Slide 24 12
  • 13. VC dim of line machine Given machine f, the VC-dimension h is The maximum number of points that can be arranged so that f shatter them. Example: For 2-d inputs, what’s VC-dim of f(x,w,b) = sign(w.x+b)? Well, can we find four points that f can shatter? Can always draw six lines between pairs of four points. Two of those lines will cross. Copyright © 2001, Andrew W. Moore VC-dimension: Slide 25 VC dim of line machine Given machine f, the VC-dimension h is The maximum number of points that can be arranged so that f shatter them. Example: For 2-d inputs, what’s VC-dim of f(x,w,b) = sign(w.x+b)? Well, can we find four points that f can shatter? Can always draw six lines between pairs of four points. Two of those lines will cross. If we put points linked by the crossing lines in the same class they can’t be linearly separated So a line can shatter 3 points but not 4 So VC-dim of Line Machine is 3 Copyright © 2001, Andrew W. Moore VC-dimension: Slide 26 13
  • 14. VC dim of linear classifiers in m-dimensions If input space is m-dimensional and if f is sign(w.x-b), what is the VC-dimension? Proof that h >= m: Show that m points can be shattered Can you guess how? Copyright © 2001, Andrew W. Moore VC-dimension: Slide 27 VC dim of linear classifiers in m-dimensions If input space is m-dimensional and if f is sign(w.x-b), what is the VC-dimension? Proof that h >= m: Show that m points can be shattered Define m input points thus: x1 = (1,0,0,…,0) x2 = (0,1,0,…,0) : xm = (0,0,0,…,1) So xk [j] = 1 if k=j and 0 otherwise Let y1, y2 ,… ym , be any one of the 2m combinations of class labels. Guess how we can define w1 , w2,… wm and b to ensure sign(w. xk + b) = yk for all k ? Note:   m sign (w .x k + b) = sign  b + ∑ wj . x k [ j ]     j=1 Copyright © 2001, Andrew W. Moore VC-dimension: Slide 28 14
  • 15. VC dim of linear classifiers in m-dimensions If input space is m-dimensional and if f is sign(w.x-b), what is the VC-dimension? Proof that h >= m: Show that m points can be shattered Define m input points thus: x1 = (1,0,0,…,0) x2 = (0,1,0,…,0) : xm = (0,0,0,…,1) So xk [j] = 1 if k=j and 0 otherwise Let y1, y2 ,… ym , be any one of the 2m combinations of class labels. Guess how we can define w1 , w2,… wm and b to ensure sign(w. xk + b) = yk for all k ? Note:   m sign (w .x k + b) = sign  b + ∑ wj . x k [ j ]   Answer: b=0 and wk = yk for all k.   j=1 Copyright © 2001, Andrew W. Moore VC-dimension: Slide 29 VC dim of linear classifiers in m-dimensions If input space is m-dimensional and if f is sign(w.x-b), what is the VC-dimension? • Now we know that h >= m • In fact, h=m+1 • Proof that h >= m+1 is easy • Proof that h < m+2 is moderate Copyright © 2001, Andrew W. Moore VC-dimension: Slide 30 15
  • 16. What does VC-dim measure? • Is it the number of parameters? Related but not really the same. • I can create a machine with one numeric parameter that really encodes 7 parameters (How?) • And I can create a machine with 7 parameters which has a VC-dim of 1 (How?) • Andrew’s private opinion: it often is the number of parameters that counts. Copyright © 2001, Andrew W. Moore VC-dimension: Slide 31 Structural Risk Minimization Let φ(f) = the set of functions representable by f. • f ( f1 ) ⊆ f ( f 2 ) ⊆ L f ( f n ) • Suppose h ( f1 ) ≤ h( f 2 ) ≤ L h( f n ) (Hey, can you formally prove this?) • Then • We’re trying to decide which machine to use. • We train each machine and make a table… h (log( 2 R / h) + 1) − log(η / 4) TESTERR (α ) ≤ TRAINERR(α ) + R TRAINER i fi VC-Conf Probable upper bound Choice R on TESTERR 1 f1 2 f2 Ö 3 f3 4 f4 5 f5 6 f6 Copyright © 2001, Andrew W. Moore VC-dimension: Slide 32 16
  • 17. Using VC-dimensionality That’s what VC-dimensionality is about People have worked hard to find VC-dimension for.. • Decision Trees • Perceptrons • Neural Nets • Decision Lists • Support Vector Machines • And many many more All with the goals of 1. Understanding which learning machines are more or less powerful under which circumstances 2. Using Structural Risk Minimization for to choose the best learning machine Copyright © 2001, Andrew W. Moore VC-dimension: Slide 33 Alternatives to VC-dim-based model selection • What could we do instead of the scheme below? h (log( 2 R / h) + 1) − log(η / 4) TESTERR (α ) ≤ TRAINERR(α ) + R TRAINER i fi VC-Conf Probable upper bound Choice R on TESTERR 1 f1 2 f2 Ö 3 f3 4 f4 5 f5 6 f6 Copyright © 2001, Andrew W. Moore VC-dimension: Slide 34 17
  • 18. Alternatives to VC-dim-based model selection • What could we do instead of the scheme below? 1. Cross-validation TRAINER i fi 10-FOLD-CV-ERR Choice R 1 f1 2 f2 Ö 3 f3 4 f4 5 f5 6 f6 Copyright © 2001, Andrew W. Moore VC-dimension: Slide 35 Alternatives to VC-dim-based model selection • What could we do instead of the scheme below? As the amount of data 1. Cross-validation goes to infinity, AIC 2. AIC (Akaike Information Criterion) promises* to select the model that’ll have the best likelihood for future AICSCORE = LL ( Data | MLE params) − (# parameters) data *Subject to about a million caveats LOGLIKE(TRAINERR) i fi #parameters AIC Choice 1 f1 2 f2 3 f3 Ö 4 f4 5 f5 6 f6 Copyright © 2001, Andrew W. Moore VC-dimension: Slide 36 18
  • 19. Alternatives to VC-dim-based model selection • What could we do instead of the scheme below? As the amount of data 1. Cross-validation goes to infinity, BIC 2. AIC (Akaike Information Criterion) promises* to select the model that the data was 3. BIC (Bayesian Information Criterion) generated from. More conservative than AIC. # params *Another million caveats BICSCORE = LL ( Data | MLE params) − log R 2 LOGLIKE(TRAINERR) i fi #parameters BIC Choice 1 f1 2 f2 Ö 3 f3 4 f4 5 f5 6 f6 Copyright © 2001, Andrew W. Moore VC-dimension: Slide 37 Which model selection method is best? 1. (CV) Cross-validation 2. AIC (Akaike Information Criterion) 3. BIC (Bayesian Information Criterion) 4. (SRMVC) Structural Risk Minimize with VC- dimension • AIC, BIC and SRMVC have the advantage that you only need the training error. • CV error might have more variance • SRMVC is wildly conservative • Asymptotically AIC and Leave-one-out CV should be the same • Asymptotically BIC and a carefully chosen k-fold should be the same • BIC is what you want if you want the best structure instead of the best predictor (e.g. for clustering or Bayes Net structure finding) • Many alternatives to the above including proper Bayesian approaches. • It’s an emotional issue. Copyright © 2001, Andrew W. Moore VC-dimension: Slide 38 19
  • 20. Extra Comments • Beware: that second “VC-confidence” term is usually very very conservative (at least hundreds of times larger than the empirical overfitting effect). • An excellent tutorial on VC-dimension and Support Vector Machines (which we’ll be studying soon): C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html Copyright © 2001, Andrew W. Moore VC-dimension: Slide 39 What you should know • The definition of a learning machine: f(x,α ) • The definition of Shattering • Be able to work through simple examples of shattering • The definition of VC-dimension • Be able to work through simple examples of VC- dimension • Structural Risk Minimization for model selection • Awareness of other model selection methods Copyright © 2001, Andrew W. Moore VC-dimension: Slide 40 20