SlideShare a Scribd company logo
1 of 48
Download to read offline
DBM630: Data Mining and
                       Data Warehousing

                              MS.IT. Rangsit University
                                                 Semester 2/2011



                                                 Lecture 7
                     Classification and Prediction
                  Naïve Bayes, Regression and SVM

    by Kritsada Sriphaew (sriphaew.k AT gmail.com)

1
Topics
 Statistical Modeling: Naïve Bayes Classification
   sparseness problem
   missing value
   numeric attributes
 Regression
        Linear Regression
        Regression Tree
    Support Vector Machine


 2                            Data Warehousing and Data Mining by Kritsada Sriphaew
Statistical Modeling
 “Opposite” of 1R: use all the attributes
 Two assumptions: Attributes are
        equally important
        statistically independent (given the class value)
 This means that knowledge about the value of a
  particular attribute doesn’t tell us anything about the
  value of another attribute (if the class is known)
 Although based on assumptions that are almost never
  correct, this scheme works well in practice!

 3                                                    Classification – Naïve Bayes
An Example: Evaluating the Weather Attributes
(Revised)
Outlook    Temp.   Humidity Windy Play    Attribute       Rule             Error    Total Error

 sunny      hot     high    false   no    Outlook      sunny  no            2/5        4/14
                                                      overcast  yes         0/4
 sunny      hot     high    true    no
                                                       rainy  yes           2/5
overcast    hot     high    false   yes    Temp.       hot  no*             2/4        5/14
 rainy     mild     high    false   yes                mild  yes            2/6
 rainy     cool    normal   false   yes                cool  yes            1/4
                                          Humidity     high  no             3/7        4/14
 rainy     cool    normal   true    no
                                                      normal  yes           1/7
overcast   cool    normal   true    yes
                                           Windy       false  yes           2/8        5/14
 sunny     mild     high    false   no                 true  no*            3/6
 sunny     cool    normal   false   yes
 rainy     mild    normal   false   yes
 sunny     mild    normal   true    yes        1R chooses the attribute that
overcast   mild     high    true    yes
                                               produces rules with the smallest
                                               number of errors, i.e., rule 1 or 3
overcast    hot    normal   false   yes
 rainy     mild     high    true    no

 4                                                               Classification – Naïve Bayes
Probabilities for the Weather Data




                              Probabilistic model
5                             Classification – Naïve Bayes
Bayes’s Rule
 Probability of event H given evidence E:

                          p( E | H )  p( H )
             p( H | E ) 
                                 p( E )
 A priori probability of H: p(H)
   Probability of event before evidence has
    been seen
 A posteriori probability of H: p(H|E)
   Probability of event after evidence has been
    seen
6                                               Classification – Naïve Bayes
Naïve Bayes for Classification
       Classification learning: what’s the probability of the class given
        an instance?
         Evidence E = instance
         Event H = class value for instance

       Naïve Bayes assumption: "independent feature model“, i.e.,
        the presence (or absence) of a particular attribute (or
        feature) of a class is unrelated to the presence (or absence)
        of any other attribute, therefore:
                                                p( E | H )  p( H )
                                 p( H | E ) 
                                                       p( E )
                                  p( E1 | H )  p( E2 | H )  p( En | H )  p( H )
        p( H | E1 , E2 ,, En ) 
                                                         p( E )
    7                                                                 Classification – Naïve Bayes
Naïve Bayes for Classification




 p( play  y | outlook  s, temp  c, humid  h, windy  t ) 
 p(out  s | pl  y )  p(te  c | pl  y )  p(hu  h | pl  y )  p( wi  t | pl  y )  p( pl  y )
                                 p(out  s, te  c, hu  h, wi  t )

                              2 3 3 3 9
                                   
                             9 9 9 9 14
                        p(out  s, te  c, hu  h, wi  t )
8                                                                            Classification – Naïve Bayes
The Sparseness Problem
(The “zero-frequency problem”)

       What if an attribute value doesn’t occur with every class value (e. g.
        “Outlook = overcast” for class “no”)?
         Probability will be zero! P(outlook=overcast|play=no) = 0
         A posteriori probability will also be zero! (No matter how likely the
          other values are!)
          P(play=no|outlook=overcast, temp=cool, humidity=high, windy=true) = 0

       Remedy: add 1 to the count for every attribute value-class
        combination (Laplace estimator)
       Result: probabilities will never be zero! (also: stabilizes probability
        estimates)

    9                                                                    Classification – Naïve Bayes
Modified Probability Estimates
    In some cases adding a constant different from 1 might
     be more appropriate
    Example: attribute outlook for class yes
    We can apply an equal weight, or weights don’t need to
     be equal (if they sum to 1, That is, p1 + p2 + p3 = 1)
       Equal weight             Normalized weight (p1 + p2 + p3 = 1)
                      2
                         m                        2  ( m  p1 )
                         3          sunny 
        sunny                                       9m
                      9m
                                                  4  ( m  p2 )
                      4
                         m      overcast 
                         3                            9m
      overcast 
                      9m                         3  (m  p 3 )
                                     rainy 
                      3
                         m                           9m
          rainy         3
                      9m
10                                                  Classification – Naïve Bayes
Missing Value Problem
 Training: instance is not included in the frequency
  count for attribute value-class combination
 Classification: attribute will be omitted from
  calculation




 11                                        Classification – Naïve Bayes
Dealing with Numeric Attributes
 Common assumption: attributes have a normal or
  Gaussian probability distribution (given the class)
 The probability density function for the normal
  distribution is defined by:
       The sample mean :

       The standard deviation :
                                                        -
       The density function f(x):

 12                                        Classification – Naïve Bayes
An Example: Evaluating the Weather
Attributes (Numeric)
          Outlook    Temp.   Humidity   Windy          Play
           sunny      85        85      false           no
           sunny      80        90      true            no
          overcast    83        86      false           yes
           rainy      70        96      false           yes
           rainy      68        80      false           yes
           rainy      65        70      true            no
          overcast    64        65      true            yes
           sunny      72        95      false           no
           sunny      69        70      false           yes
           rainy      75        80      false           yes
           sunny      75        70      true            yes
          overcast    72        90      true            yes
          overcast    81        75      false           yes
           rainy      71        91      true            no
13                                       Classification – Naïve Bayes
Statistics for the Weather Data




   Example for density value:
                                                               (66−73)2
                                                     1
         ������ ������������������������������������������������������������������ = 66 ������������������ =           ������ 2∗6.22      = 0.0340
                                                    2������6.2
                                                                      -
                                                         (99−86.2)2
                                              1
         ������ ℎ������������������������������������������ = 99 ������������ =             ������     2∗9.72     = 0.0380
                                             2������9.7

14                                                                        Classification – Naïve Bayes
Classify a New Case




    Classify a new case (if any missing values in both training and
     classifying , omit them)
                                                             The case we would
                                                               like to predict




15                                                 L6: Statistical Classification Approach
Probability Densities
   Relationship between probability and density:



 But: this doesn’t change calculation of a posteriori
  probabilities because  is cancelled out
 Exact relationship:




 16                                        Classification – Naïve Bayes
Discussion of Naïve Bayes
 Naïve Bayes works surprisingly well
  (even if independence assumption is clearly violated)
 Why? Because classification doesn’t require accurate
  probability estimates as long as maximum probability
    is assigned to correct class
 However: adding too many redundant attributes will
  cause problems (e. g. identical attributes)
 Note also: many numeric attributes are not normally
  distributed (  kernel density estimators)

 17                                      Classification – Naïve Bayes
General Bayesian Classification
 Probabilistic learning: Calculate explicit probabilities
  for hypothesis, among the most practical approaches
  to certain types of learning problems
 Incremental: Each training example can incrementally
  increase/decrease the probability that a hypothesis
  is correct. Prior knowledge can be combined with
  observed data.
 Probabilistic prediction: Predict multiple hypotheses,
  weighted by their probabilities

 18                                        Classification – Naïve Bayes
Bayesian Theorem
    Given training data D, posteriori probability of a hypothesis h,
     P(h|D) follows the Bayes theorem
                         P(h|D)  P(D|h)P(h)
                                     P(D)
    MAP (maximum a posteriori) hypothesis
                 h    arg max P(h | D)  arg max P(D | h)P(h).
                  MAP hH                  hH

    Difficulty: need initial knowledge of many probabilities, significant
     computational cost
    If assume P(hi) = P(hj) then method can further simplify, and
     choose the Maximum Likelihood (ML) hypothesis
                     h  arg max P(h | D)  arg max P(D | h ).
                                       i                    i
                      ML hH  i              hHi



    19                                                          Classification – Naïve Bayes
Naïve Bayes Classifiers
    Assumption: attributes are conditionally independent:
                    c     arg max P(c | {v v ... v })   i       1,       2,    ,       J
                     MAP     cC        i



                                 arg max  P(v | c )P(c ).
                                                   J
                                                             j        i             i

                                    cC i
                                                  j 1



    Greatly reduces the computation cost, only count the class distribution.
    However, it is seldom satisfied in practice, as attributes (variables) are often
     correlated.
    Attempts to overcome this limitation:
      Bayesian networks, that combine Bayesian reasoning with causal relationships
        between attributes
      Decision trees, that reason on one attribute at the time, considering most
        important attributes first
      Association rules that reason a class by several attributes


    20                                                                         Classification – Naïve Bayes
Bayesian Belief Network
  (An Example)
                                            The conditional probability table
           Storm        BusTourGroup
                                            (CPT) for the variable Campfire
                                                           (S,B)   (S, ~B) (~S, B) (~S, ~B)

                                                    C        0.4        0.1         0.8       0.2

         Lightening       Campfire
                                                   ~C        0.6        0.9         0.2       0.8

                                               •    Network represents a set of conditional
                                                    independence assertions.
                                               •    Directed acyclic graph

          Thunder          Forestfire
                                                        Also called Bayes Nets
 Attributes (variables) are often correlated.
 Each variable is conditionally independent given its predecessors

    21                                                                Classification – Naïve Bayes
Bayesian Belief Network
(Dependence and Independence)
          0.7                0.85
                                                            The conditional probability table (CPT)
                 Storm           BusTourGroup
                                                            for the variable Campfire
                                                                      (S,B) (S, ~B)(~S, B)(~S, ~B)

                Lightening          Campfire                        C   0.4     0.1      0.8      0.2
                                                                ~C      0.6     0.9      0.2      0.8

                 Thunder                Forestfire

    Represents joint probability distribution over all variables, e.g., P(Storm,
     BusTourGroup,…,ForestFire)                       n
    In general, P( y , y ,..., y )   P( y | Parents(Y ))
                             1      2         n                 i              i
                                                     i 1
    where Parents(Yi) denotes immediate predecessors of Yi in graph
    So, Joint distribution is fully defined by graph, plus the table
     p(yi|Parents(Yi))
    22                                                                        Classification – Naïve Bayes
Bayesian Belief Network
(Inference in Bayes Nets)

    Infer the values of one or more network variables, given
     observed values of others
        Bayes net contains all information needed for this inference
        If only one variable with unknown value, it is easy to infer it.
        In general case, the problem is NP hard.
        Anyway, there are three types of inference.

          Top-down inference:      p(Campfire|Storm)
          Bottom-up inference:     p(Storm|Campfire)
          Hybrid inference:        p(BusTourGroup|Storm,Campfire)



    23                                                       Classification – Naïve Bayes
Bayesian Belief Network
(Training Bayesian Belief Networks)
 Several variants of this learning task
        Network structure might be known or unknown
        Training examples might provide values of all network variables, or
         just some
    If structure known and observe all variables
      Then it is easy as training a Naïve Bayes classifier
    If structure known but some variables observed, e.g. observe
     ForestFire, Storm, BustourGroup, Thunder but not Lightening,
     Campfire
        Use gradient ascent.
        Converge to network h that maximize P(D|h)


    24                                                      Classification – Naïve Bayes
Numerical Modeling: Regression
 Numerical model is used for prediction
 Counterparts exist for all schemes that we previously
  discussed
       Decision trees, statistical models, etc.
   All classification schemes can be applied to
    regression problems using discretization
       Prediction: weighted average of intervals’ midpoints
        (weighted according to class probabilities)
   Regression is more difficult than classification (i. e.
    percent correct vs. mean squared error)

 25                                                   Prediction – Regression
Linear Regression
 Work most naturally with numeric attributes
 Standard technique for numeric prediction
       Outcome kis linear combination of attributes
           Y   w j x j  w0 x0  w1 x1  w2 x2  ...  wk xk
                j 0



 Weights are calculated from the      X (1)training data
   Predicted value for) first instance
              k
             Y   w j x j  w0 x0  w1 x1  ...  wk xk
                     (1           (1)       (1)         (1)

                  j 0




 26                                                    Prediction – Regression
Minimize the Squared Error (I)
   k+1 coefficients are chosen so that the squared error
  on the training data is minimized
 Squared error:
                                                  2
                     n 
                                            (i ) 
                               k
                          y          wjx j 
                          (i )
                                                
                          
                         i 1
                                j 0             
   Coefficient can be derived using standard matrix
    operations
       Can be done if there are more instances than attributes
        (roughly speaking)
       If there are less instances, a lot of solutions
       Minimization of absolute error is more difficult!

 27                                                   Prediction – Regression
Minimize the Squared Error (II)

                                                  Y  X w
                                      2
         n  (i) k      (i ) 
     min   y   w j x j 
                                                                2
                                           min
                            
           
        i 1
                 j 0        
                                                                     2
                 (0)   (0)        (0)            (0)
                                                         w  
               y                               xk(1)   0  
                 (1)   x01)
                            (
                                  x
                                  1
                                  (1)                            
                 y    x0     x               xk   w1  
      min                   
                                      1
                                                       
                 (n)  (n)                       (n) 
                                                              
                          x0                      xk  wk  
                                      (n)
                y                x                         
                              1




                Ynx1              Xnxk                     wkx1


28                                                         Prediction – Regression
Example : Find the linear regression of salary data
Salary data                                          k
                                             Y   w j x j  w0 x0  w1 x1  w2 x2  ...  wk xk
X = {x1}          Y                                 j 0


                                            For simplicity, x0 = 1
Year experience   Salary (in $1000s)
3                 30
8                 57
                                            Therefore, Y  w  w x                  0       1   1
9                 64
13                72
                                            With the method of least square error,
                                                                i 1
                                                                   s
                                                                     ( x1i  x)( y i  y )
3                 36
                                                         w                                      3 .5
                                                             ( x  x)
                                                           1            s       i       2
6                 43                                                    i 1    1

11                59                            
21                90
                                                      w  y  w x  23.55
                                                           0                1   1

1                 20                           Predicted line is estimated by
                                                               Y = 23.55 + 3.5 x1
16                83

 x = 9.1 and y = 55.4
 s = # training instances = 10         Prediction for X = 10, Y = 23.55+3.5(10) = 58.55
     29                                                                             Prediction – Regression
Classification using Linear Regression (One with
the others)
   Any regression technique can be used for
    classification
       Training: perform a regression for each class, setting the
        output to 1 for training instances that belong to class, and
        0 for those that do not
       Prediction: predict class corresponding to model with
        largest output value(membership value)
   For linear regression, this is known as multi-response
        For example, the data has three classes {A, B, C}.
    linear regression
        Linear Regression Model 1: predict 1 for class A and 0 for not A
        Linear Regression Model 2: predict 1 for class B and 0 for not B
        Linear Regression Model 3: predict 1 for class C and 0 for not C
30                                                                 Prediction – Regression
Classification using Linear Regression (Pairwise
Regression)
   Another way of using regression for classification:
       A regression function for every pair of classes,using only
        instances from these two classes
       An output of +1 is assigned to one member of the pair, an
        output of –1 to the other
   Prediction is done by voting
       Class that receives most votes is predicted
       Alternative: “don’t know” if there is no agreement
        For example, the data has three classes {A, B, C}.
   More likely to be accurate but more expensive
        Linear Regression Model 1: predict +1 for class A and -1 for class B
        Linear Regression Model 2: predict +1 for class A and -1 for class C
        Linear Regression Model 3: predict +1 for class B and -1 for class C
31                                                                  Prediction – Regression
Regression Tree                                                Regression tree is a decision tree with averaged
 and Model Tree                                                 numeric values at the leaves.
                                                                Model tree is a tree whose leaves contain linear
         cycle   main memory    cache     channels    perfor    regressions.         <=7.5
                                                                                           CHMIN
                                                                                                 >7.5
         time     min  max  (Kb)         min   max    mace
        MYCT     MMIN MMAX CACH         CHMIN CHMAX PRP                                                     CACH                       MMAX
                             25                                                                 <=8.5                      >28
 1       125     256   6000               16   128  198                                                     (8.5,28]                 <=28000        >28000
                              6
 2       29      8000 32000  32           8     32    269                                                     19.3                                      CHMAX
                                                                                              MMAX                        MMAX        157(21/73.7
 3       29      8000 32000  32           8     32    220                                                  (28/8.7%)
                                                                                                                                          %)          <=58     >58
4        29      8000   32000    32       8     32    172                        <=2500 (2500,4250] >4250 <=10000>10000
5        29      8000   16000    32      8      16    132                          19.3         29.8        CACH           75.7        133          MMIN         783
                                                                                (28/8.7%)    (37/8.18%)                 (10/24.6%) (16/28.8%)                 (5/35.9%)
…        ...      ...     ...    ...     ...    ...    ...                                                  <=0.5                             <=12000 >12000
207      125     2000   8000     0        2     14     52                                                                (0.5,8.5]
                                                                                                        MYCT
208      480     512    8000     32       0     0      67                                                                 59.3                281          492

209      480     1000   4000      0       0     0      45                                        <=550          >550 (24/16.9%)            (11/56%)     (7/53.9%)


                                                                                                37.3           18.3
PRP = -55.9 + 0.0489 MYCT + 0.153 MMIN +                                                     (19/11.3%)     (7/3.83%)

        0.0056 MMAX + 0.6410 CACH -
                                                                                                                                          Regression
        0.2700 CHMIN + 1.480 CHMAX                                                                        CHMIN
                                                                                              <=7.5                     >7.5                Tree
     LM1: PRP = 8.29 + 0.004 MMAX +2.77 CHMIN                                                CACH                       MMAX
     LM2: PRP = 20.3 + 0.004 MMIN -3.99 CHMIN                                    <=8.5                  >8.5
                                                                                                                    <=28000          >28000
                + 0.946 CHMAX
     LM3: PRP = 38.1 + 0.12 MMIN                                               MMAX                       LM4          LM5(21/45.5        LM6

     LM4: PRP = 19.5 + 0.02 MMAX + 0.698 CACH                    <=4250                     >4250 (50/22.1%)               %)          (23/63.5%)

                + 0.969 CHMAX                                     LM1                CACH
     LM5: PRP = 285 + 1.46 MYCT + 1.02 CACH                    (65/7.32%)
                                                                                <=0.5       (0.5,8.5]
                - 9.39 CHMIN
     LM6: PRP = -65.8 + 0.03 MMIN - 2.94 CHMIN                                 LM2          LM3

               + 4.98 CHMAX
                                                                            (26/6.37%)   (24/14.5%)                 Model Tree

                                                                                                                            Prediction – Regression
Support Vector Machine (SVM)
    SVM is related to statistical learning theory
    SVM was first introduced in 1992 [1] by Vladimir Vapnik, a Soviet
     Union researcher
    SVM becomes popular because of its success in handwritten digit
     recognition
      1.1% test error rate for SVM. This is the same as the error rates of a
        carefully constructed neural network, LeNet 4.
    SVM is now regarded as an important example of “kernel methods”,
     one of the key area in machine learning.
    SVM is popularly used in classification task



    33                                                      Support Vector Machines
What is a good Decision Boundary?
   A two-class, linearly
    separable classification
                                                    Class 2
    problem
   Many decision boundaries!
        The Perceptron algorithm can
         be used to find such a
         boundary
        Different algorithms have been
         proposed
                                          Class 1
   Are all decision boundaries
    equally good?

    34                                              Support Vector Machines
Examples of Bad Decision Boundaries

        BEST
               Class 2                             Class 2




     Class 1              Class 1




35                                  Support Vector Machines
Large-margin Decision Boundary
    The decision boundary should be as far away from the
     data of both classes as possible
      We should maximize the margin, m
      Distance between the origin and the line wtx=k is
       k/||w||
                            w                        2
                                               m
                                                  || w ||
                            Class 2


                        m
                                      w x  b 1
                                       T


           Class 1
                                      w xb  0
                                           T

    36
         w x  b  1
          T
                                               Support Vector Machines
 2                       2 4 6
                       w1       w2    b   1   w1   w2         b b b   1  1 1
Example                               4
                                      4
                                                               4 2 3
                                                                                                      2 / 3
                       w1       w2    b   1                                                w       
                                      2                                                             2 / 3
                                     6 
                       w1       w2    b   1                                                b  5
                                      3                  Distance between 2 hyperplanes
                                                                                                       3 2
                                                                             2                    m
7
                                      Class 2
                                                                       m                               2
                                                                          || w ||
6
5

4                                w                                          supports

3
2
         Class 1                                               w x  b 1
                                                                   T

                                      m
1
0
     0     1   2   3         4       5     6     7            w xb  0
                                                                 T



37
     w x  b  1
         T
                                                      Support Vector Machines
Best boundary:
Example                m
                             2
                                             Solve => maximize m
                          || w ||                            or minimize ������
                                             As we also want to prevent data points
                                             falling into the margin, we add the following
                                             constraints for each point i,
7                                                 ������ ������ ������������ + ������ 1 ������������������ ������������ ������������ ������ℎ������ ������������������������������ ������������������������������
                               Class 2       and
6                                             ������ ������ ������������ + ������ − 1 ������������������ ������������ ������������ ������ℎ������ ������������������������������������ ������������������������������
                                             For n point, this can be rewriten as:
5                                            => ������������ ������������ ������������ + ������ ≥ ������ ������������������ ������������������ ������ ≤ ������ ≤ ������
4                          w
3
2
         Class 1                                       w x  b 1
                                                           T

                               m
1
0
     0     1   2   3   4       5   6     7            w xb  0
                                                          T



38
     w x  b  1
         T
                                             Support Vector Machines
   Previously, it is difficult to solve
 Primal form                                                    because it depends on ||w||, the
                                                                norm of w, which involves a square
                                                                root
                                                               We alter the equation by
                                                                                          1
                                                                substituting ||w|| with ������ 2 (the
                                                                                              2
                                                                factor of 1/2 being used for mathematical
   7                                                            convenience)
                                       Class 2               This is called “Quadratic
   6                                                          programming (QP) optimization”
   5
                                                              problem
                                                            Minimize in (w, b)
   4                               w                                     ������
                                                                         ������
                                                                              ������ ������
   3                                                        subject to (for any i = 1, …, n)
           Class 1
   2                                                         ������������ ������������ ������������ + ������ ≥ ������ ������������������ ������������������ ������ ≤ ������ ≤ ������
                                       m
   1
                                                               How to solve this optimization and
   0                                                            more information on SVM, e.g., dual
       0     1     2     3     4       5    6     7             form, kernel, can be found in the ref [1]


[1] Support Vector Machines and other kernel-based learning methods, John Shawe-Taylor & Nello Cristianini ,
     Cambridge University Press, 2000. http://www.support-vector.net Vector Machines
    39                                                     Support
Extension to Non-linear Decision Boundary
    So far, we have only considered large-margin classifier with a linear
     decision boundary
    How to generalize it to become nonlinear?
    Key idea: transform xi to a higher dimensional space to “make life
     easier”
        Input space: the space the point xi are located
        Feature space: the space of f(xi) after transformation
    Why transform?
        Linear operation in the feature space is equivalent to non-linear operation in input
         space
        Classification can become easier with a proper transformation. In the XOR
         problem, for example, adding a new feature of x1x2 make the problem linearly
         separable

    40                                                                  Support Vector Machines
Transforming the Data
                                                                                 f( )
                                                                           f( )       f( )
                                                                                    f( ) f( ) f( )
                                                     f(.)                f( )
                                                                              f( )     f( )
                                                                                            f( ) f( )
                                                                                   f( ) f( )
                                                                            f( )     f( ) f( )
                                                                              f( )
                                                                                         f( )

                        Input space                                        Feature space
                                                                 Note: feature space is of higher dimension
                                                                 than the input space in practice

   Computation in the feature space can be costly because it is high
    dimensional
     The feature space is typically infinite-dimensional!
   The kernel trick can help (more info. in ref [1])
[1] Support Vector Machines and other kernel-based learning methods, John Shawe-Taylor & Nello Cristianini ,
    41
     Cambridge University Press, 2000. http://www.support-vector.net               Support Vector Machines
Why SVM Work?
    The feature space is often very high dimensional. Why don’t we have
     the curse of dimensionality?
    A classifier in a high-dimensional space has many parameters and is
     hard to estimate
    Vapnik argues that the fundamental problem is not the number of
     parameters to be estimated. Rather, the problem is about the flexibility
     of a classifier
    Typically, a classifier with many parameters is very flexible, but there
     are also exceptions
      Let xi=10i where i ranges from 1 to n. The classifier
                              can classify all xi correctly for all possible
        combination of class labels on xi
      This 1-parameter classifier is very flexible


    42                                                      Support Vector Machines
Why SVM works?
    Vapnik argues that the flexibility of a classifier should not be
     characterized by the number of parameters, but by the flexibility
     (capacity) of a classifier
        This is formalized by the “VC-dimension” of a classifier
    Consider a linear classifier in two-dimensional space
    If we have three training data points, no matter how those points
     are labeled, we can classify them perfectly




    43                                                 Support Vector Machines
VC-dimension
    However, if we have four points, we can find a labeling such that
     the linear classifier fails to be perfect




    We can see that 3 is the critical number
    The VC-dimension of a linear classifier in a 2D space is 3
     because, if we have 3 points in the training set, perfect
     classification is always possible irrespective of the labeling,
     whereas for 4 points, perfect classification can be impossible


    44                                                   Support Vector Machines
Other Aspects of SVM
    How to use SVM for multi-class classification?
      Original SVM is for binary classification
      One can change the QP formulation to become multi-class
      More often, multiple binary classifiers are combined
      One can train multiple one-versus-the-rest classifiers, or combine multiple pairwise
       classifiers “intelligently”
    How to interpret the SVM discriminant function value as probability?
      By performing logistic regression on the SVM output of a set of data (validation
       set) that is not used for training
    Some SVM software (like libsvm) have these features built-in
        A list of SVM implementation can be found at http://www.kernel-machines.org/software.html
        Some implementation (such as LIBSVM) can handle multi-class classification
        SVMLight is among one of the earliest implementation of SVM
        Several Matlab toolboxes for SVM are also available


    45                                                                      Support Vector Machines
Strengths and Weaknesses of SVM
   Strengths
       Training is relatively easy
          No local optimal, unlike in neural networks
       It scales relatively well to high dimensional data
       Tradeoff between classifier complexity and error can be
        controlled explicitly
       Non-traditional data like strings and trees can be used as
        input to SVM, instead of feature vectors
   Weaknesses
       Need to choose a “good” kernel function.
46                                                   Support Vector Machines
Example: Predicting a class label
using naïve Bayesian classification
          RID   age     income   student    Credit_rating   Class:buys_computer
          1     <=30    High     No         Fair            No
          2     <=30    High     No         Excellent       No
          3     31…40   High     No         Fair            Yes
          4     >40     Medium   No         Fair            Yes
          5     >40     Low      Yes        Fair            Yes
          6     >40     Low      Yes        Excellent       No
          7     31…40   Low      Yes        Excellent       Yes
          8     <=30    Medium   No         Fair            no
          9     <=30    Low      Yes        Fair            Yes
          10    >40     Medium   Yes        Fair            Yes
          11    <=30    Medium   Yes        Excellent       Yes
          12    31…40   Medium   No         Excellent       Yes
          13    31…40   High     Yes        Fair            Yes
Unknown
sample    14    >40     medium   no         Excellent       No
          15    <=30    medium   yes        fair

 47
                                       Data Warehousing and Data Mining by Kritsada Sriphaew
Exercise:
Outlook   Temperature   Humidity Windy Play
Sunny     Hot           High     False   N    Using naïve Bayesain classifier
                                              to predict those unknown data
Sunny     Hot           High     True    N
                                              samples
Overcast Hot            High     False   Y
Rainy     Mild          High     False   Y
Rainy     Cool          Normal   False   Y
Rainy     Cool          Normal   True    N
Overcast Cool           Normal   True    Y
Sunny     Mild          high     False   N
Sunny     Cool          Normal   False   Y
Rainy     Mild          Normal   False   Y
Sunny     Mild          Normal   True    Y
Overcast Hot            Normal   False   Y
Overcast Mild           High     True    Y
Rainy     Mild          High     True    N
Sunny     Cool          Normal   False
                                                 Unknown data samples
Rainy
 48       Mild          High     False

More Related Content

Viewers also liked

Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data WarehousingAswathy S Nair
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseYang Li
 
Design cube in Apache Kylin
Design cube in Apache KylinDesign cube in Apache Kylin
Design cube in Apache KylinYang Li
 
Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseHBaseCon
 
References - sql injection
References - sql injection References - sql injection
References - sql injection Mohammed
 
Oracle-Mengendalikan User
Oracle-Mengendalikan UserOracle-Mengendalikan User
Oracle-Mengendalikan Useridnats
 
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)SANG WON PARK
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Miningidnats
 

Viewers also liked (17)

Datawarehouse and OLAP
Datawarehouse and OLAPDatawarehouse and OLAP
Datawarehouse and OLAP
 
Dbm630_lecture02-03
Dbm630_lecture02-03Dbm630_lecture02-03
Dbm630_lecture02-03
 
Dbm630 lecture05
Dbm630 lecture05Dbm630 lecture05
Dbm630 lecture05
 
Dbm630 lecture09
Dbm630 lecture09Dbm630 lecture09
Dbm630 lecture09
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouse
 
Design cube in Apache Kylin
Design cube in Apache KylinDesign cube in Apache Kylin
Design cube in Apache Kylin
 
Datacube
DatacubeDatacube
Datacube
 
Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBase
 
References
References References
References
 
References - sql injection
References - sql injection References - sql injection
References - sql injection
 
Testing
TestingTesting
Testing
 
Data cubes
Data cubesData cubes
Data cubes
 
Oracle-Mengendalikan User
Oracle-Mengendalikan UserOracle-Mengendalikan User
Oracle-Mengendalikan User
 
MPLS
MPLSMPLS
MPLS
 
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 

Similar to Dbm630 lecture07

2009 naive bayes classifiers unknown
2009 naive bayes classifiers   unknown2009 naive bayes classifiers   unknown
2009 naive bayes classifiers unknownGeorge Ang
 
Dwdm naive bayes_ankit_gadgil_027
Dwdm naive bayes_ankit_gadgil_027Dwdm naive bayes_ankit_gadgil_027
Dwdm naive bayes_ankit_gadgil_027ankitgadgil
 
Inference in Bayesian Networks
Inference in Bayesian NetworksInference in Bayesian Networks
Inference in Bayesian Networksguestfee8698
 
Module 4 bayes classification
Module 4 bayes classificationModule 4 bayes classification
Module 4 bayes classificationSatishH5
 

Similar to Dbm630 lecture07 (6)

2009 naive bayes classifiers unknown
2009 naive bayes classifiers   unknown2009 naive bayes classifiers   unknown
2009 naive bayes classifiers unknown
 
Dwdm naive bayes_ankit_gadgil_027
Dwdm naive bayes_ankit_gadgil_027Dwdm naive bayes_ankit_gadgil_027
Dwdm naive bayes_ankit_gadgil_027
 
Navies bayes
Navies bayesNavies bayes
Navies bayes
 
Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
 
Inference in Bayesian Networks
Inference in Bayesian NetworksInference in Bayesian Networks
Inference in Bayesian Networks
 
Module 4 bayes classification
Module 4 bayes classificationModule 4 bayes classification
Module 4 bayes classification
 

More from Tokyo Institute of Technology (11)

Lecture 4 online and offline business model generation
Lecture 4 online and offline business model generationLecture 4 online and offline business model generation
Lecture 4 online and offline business model generation
 
Lecture 4: Brand Creation
Lecture 4: Brand CreationLecture 4: Brand Creation
Lecture 4: Brand Creation
 
Lecture3 ExperientialMarketing
Lecture3 ExperientialMarketingLecture3 ExperientialMarketing
Lecture3 ExperientialMarketing
 
Lecture3 Tools and Content Creation
Lecture3 Tools and Content CreationLecture3 Tools and Content Creation
Lecture3 Tools and Content Creation
 
Lecture2: Innovation Workshop
Lecture2: Innovation WorkshopLecture2: Innovation Workshop
Lecture2: Innovation Workshop
 
Lecture0: introduction Online Marketing
Lecture0: introduction Online MarketingLecture0: introduction Online Marketing
Lecture0: introduction Online Marketing
 
Lecture2: Marketing and Social Media
Lecture2: Marketing and Social MediaLecture2: Marketing and Social Media
Lecture2: Marketing and Social Media
 
Lecture1: E-Commerce Business Model
Lecture1: E-Commerce Business ModelLecture1: E-Commerce Business Model
Lecture1: E-Commerce Business Model
 
Lecture0: Introduction Social Commerce
Lecture0: Introduction Social CommerceLecture0: Introduction Social Commerce
Lecture0: Introduction Social Commerce
 
Dbm630 lecture06
Dbm630 lecture06Dbm630 lecture06
Dbm630 lecture06
 
Coursesyllabus_dbm630
Coursesyllabus_dbm630Coursesyllabus_dbm630
Coursesyllabus_dbm630
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Dbm630 lecture07

  • 1. DBM630: Data Mining and Data Warehousing MS.IT. Rangsit University Semester 2/2011 Lecture 7 Classification and Prediction Naïve Bayes, Regression and SVM by Kritsada Sriphaew (sriphaew.k AT gmail.com) 1
  • 2. Topics  Statistical Modeling: Naïve Bayes Classification  sparseness problem  missing value  numeric attributes  Regression  Linear Regression  Regression Tree  Support Vector Machine 2 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 3. Statistical Modeling  “Opposite” of 1R: use all the attributes  Two assumptions: Attributes are  equally important  statistically independent (given the class value)  This means that knowledge about the value of a particular attribute doesn’t tell us anything about the value of another attribute (if the class is known)  Although based on assumptions that are almost never correct, this scheme works well in practice! 3 Classification – Naïve Bayes
  • 4. An Example: Evaluating the Weather Attributes (Revised) Outlook Temp. Humidity Windy Play Attribute Rule Error Total Error sunny hot high false no Outlook sunny  no 2/5 4/14 overcast  yes 0/4 sunny hot high true no rainy  yes 2/5 overcast hot high false yes Temp. hot  no* 2/4 5/14 rainy mild high false yes mild  yes 2/6 rainy cool normal false yes cool  yes 1/4 Humidity high  no 3/7 4/14 rainy cool normal true no normal  yes 1/7 overcast cool normal true yes Windy false  yes 2/8 5/14 sunny mild high false no true  no* 3/6 sunny cool normal false yes rainy mild normal false yes sunny mild normal true yes 1R chooses the attribute that overcast mild high true yes produces rules with the smallest number of errors, i.e., rule 1 or 3 overcast hot normal false yes rainy mild high true no 4 Classification – Naïve Bayes
  • 5. Probabilities for the Weather Data Probabilistic model 5 Classification – Naïve Bayes
  • 6. Bayes’s Rule  Probability of event H given evidence E: p( E | H )  p( H ) p( H | E )  p( E )  A priori probability of H: p(H)  Probability of event before evidence has been seen  A posteriori probability of H: p(H|E)  Probability of event after evidence has been seen 6 Classification – Naïve Bayes
  • 7. Naïve Bayes for Classification  Classification learning: what’s the probability of the class given an instance?  Evidence E = instance  Event H = class value for instance  Naïve Bayes assumption: "independent feature model“, i.e., the presence (or absence) of a particular attribute (or feature) of a class is unrelated to the presence (or absence) of any other attribute, therefore: p( E | H )  p( H ) p( H | E )  p( E ) p( E1 | H )  p( E2 | H )  p( En | H )  p( H ) p( H | E1 , E2 ,, En )  p( E ) 7 Classification – Naïve Bayes
  • 8. Naïve Bayes for Classification p( play  y | outlook  s, temp  c, humid  h, windy  t )  p(out  s | pl  y )  p(te  c | pl  y )  p(hu  h | pl  y )  p( wi  t | pl  y )  p( pl  y ) p(out  s, te  c, hu  h, wi  t ) 2 3 3 3 9      9 9 9 9 14 p(out  s, te  c, hu  h, wi  t ) 8 Classification – Naïve Bayes
  • 9. The Sparseness Problem (The “zero-frequency problem”)  What if an attribute value doesn’t occur with every class value (e. g. “Outlook = overcast” for class “no”)?  Probability will be zero! P(outlook=overcast|play=no) = 0  A posteriori probability will also be zero! (No matter how likely the other values are!) P(play=no|outlook=overcast, temp=cool, humidity=high, windy=true) = 0  Remedy: add 1 to the count for every attribute value-class combination (Laplace estimator)  Result: probabilities will never be zero! (also: stabilizes probability estimates) 9 Classification – Naïve Bayes
  • 10. Modified Probability Estimates  In some cases adding a constant different from 1 might be more appropriate  Example: attribute outlook for class yes  We can apply an equal weight, or weights don’t need to be equal (if they sum to 1, That is, p1 + p2 + p3 = 1) Equal weight Normalized weight (p1 + p2 + p3 = 1) 2 m 2  ( m  p1 ) 3 sunny  sunny  9m 9m 4  ( m  p2 ) 4 m overcast  3 9m overcast  9m 3  (m  p 3 ) rainy  3 m 9m rainy  3 9m 10 Classification – Naïve Bayes
  • 11. Missing Value Problem  Training: instance is not included in the frequency count for attribute value-class combination  Classification: attribute will be omitted from calculation 11 Classification – Naïve Bayes
  • 12. Dealing with Numeric Attributes  Common assumption: attributes have a normal or Gaussian probability distribution (given the class)  The probability density function for the normal distribution is defined by:  The sample mean :  The standard deviation : -  The density function f(x): 12 Classification – Naïve Bayes
  • 13. An Example: Evaluating the Weather Attributes (Numeric) Outlook Temp. Humidity Windy Play sunny 85 85 false no sunny 80 90 true no overcast 83 86 false yes rainy 70 96 false yes rainy 68 80 false yes rainy 65 70 true no overcast 64 65 true yes sunny 72 95 false no sunny 69 70 false yes rainy 75 80 false yes sunny 75 70 true yes overcast 72 90 true yes overcast 81 75 false yes rainy 71 91 true no 13 Classification – Naïve Bayes
  • 14. Statistics for the Weather Data  Example for density value: (66−73)2 1 ������ ������������������������������������������������������������������ = 66 ������������������ = ������ 2∗6.22 = 0.0340 2������6.2 - (99−86.2)2 1 ������ ℎ������������������������������������������ = 99 ������������ = ������ 2∗9.72 = 0.0380 2������9.7 14 Classification – Naïve Bayes
  • 15. Classify a New Case  Classify a new case (if any missing values in both training and classifying , omit them) The case we would like to predict 15 L6: Statistical Classification Approach
  • 16. Probability Densities  Relationship between probability and density:  But: this doesn’t change calculation of a posteriori probabilities because  is cancelled out  Exact relationship: 16 Classification – Naïve Bayes
  • 17. Discussion of Naïve Bayes  Naïve Bayes works surprisingly well (even if independence assumption is clearly violated)  Why? Because classification doesn’t require accurate probability estimates as long as maximum probability is assigned to correct class  However: adding too many redundant attributes will cause problems (e. g. identical attributes)  Note also: many numeric attributes are not normally distributed (  kernel density estimators) 17 Classification – Naïve Bayes
  • 18. General Bayesian Classification  Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems  Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.  Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities 18 Classification – Naïve Bayes
  • 19. Bayesian Theorem  Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem P(h|D)  P(D|h)P(h) P(D)  MAP (maximum a posteriori) hypothesis h  arg max P(h | D)  arg max P(D | h)P(h). MAP hH hH  Difficulty: need initial knowledge of many probabilities, significant computational cost  If assume P(hi) = P(hj) then method can further simplify, and choose the Maximum Likelihood (ML) hypothesis h  arg max P(h | D)  arg max P(D | h ). i i ML hH i hHi 19 Classification – Naïve Bayes
  • 20. Naïve Bayes Classifiers  Assumption: attributes are conditionally independent: c  arg max P(c | {v v ... v }) i 1, 2, , J MAP cC i  arg max  P(v | c )P(c ). J j i i cC i j 1  Greatly reduces the computation cost, only count the class distribution.  However, it is seldom satisfied in practice, as attributes (variables) are often correlated.  Attempts to overcome this limitation:  Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes  Decision trees, that reason on one attribute at the time, considering most important attributes first  Association rules that reason a class by several attributes 20 Classification – Naïve Bayes
  • 21. Bayesian Belief Network (An Example) The conditional probability table Storm BusTourGroup (CPT) for the variable Campfire (S,B) (S, ~B) (~S, B) (~S, ~B) C 0.4 0.1 0.8 0.2 Lightening Campfire ~C 0.6 0.9 0.2 0.8 • Network represents a set of conditional independence assertions. • Directed acyclic graph Thunder Forestfire Also called Bayes Nets  Attributes (variables) are often correlated.  Each variable is conditionally independent given its predecessors 21 Classification – Naïve Bayes
  • 22. Bayesian Belief Network (Dependence and Independence) 0.7 0.85 The conditional probability table (CPT) Storm BusTourGroup for the variable Campfire (S,B) (S, ~B)(~S, B)(~S, ~B) Lightening Campfire C 0.4 0.1 0.8 0.2 ~C 0.6 0.9 0.2 0.8 Thunder Forestfire  Represents joint probability distribution over all variables, e.g., P(Storm, BusTourGroup,…,ForestFire) n  In general, P( y , y ,..., y )   P( y | Parents(Y )) 1 2 n i i i 1  where Parents(Yi) denotes immediate predecessors of Yi in graph  So, Joint distribution is fully defined by graph, plus the table p(yi|Parents(Yi)) 22 Classification – Naïve Bayes
  • 23. Bayesian Belief Network (Inference in Bayes Nets)  Infer the values of one or more network variables, given observed values of others  Bayes net contains all information needed for this inference  If only one variable with unknown value, it is easy to infer it.  In general case, the problem is NP hard.  Anyway, there are three types of inference.  Top-down inference: p(Campfire|Storm)  Bottom-up inference: p(Storm|Campfire)  Hybrid inference: p(BusTourGroup|Storm,Campfire) 23 Classification – Naïve Bayes
  • 24. Bayesian Belief Network (Training Bayesian Belief Networks)  Several variants of this learning task  Network structure might be known or unknown  Training examples might provide values of all network variables, or just some  If structure known and observe all variables  Then it is easy as training a Naïve Bayes classifier  If structure known but some variables observed, e.g. observe ForestFire, Storm, BustourGroup, Thunder but not Lightening, Campfire  Use gradient ascent.  Converge to network h that maximize P(D|h) 24 Classification – Naïve Bayes
  • 25. Numerical Modeling: Regression  Numerical model is used for prediction  Counterparts exist for all schemes that we previously discussed  Decision trees, statistical models, etc.  All classification schemes can be applied to regression problems using discretization  Prediction: weighted average of intervals’ midpoints (weighted according to class probabilities)  Regression is more difficult than classification (i. e. percent correct vs. mean squared error) 25 Prediction – Regression
  • 26. Linear Regression  Work most naturally with numeric attributes  Standard technique for numeric prediction  Outcome kis linear combination of attributes Y   w j x j  w0 x0  w1 x1  w2 x2  ...  wk xk j 0  Weights are calculated from the X (1)training data  Predicted value for) first instance k Y   w j x j  w0 x0  w1 x1  ...  wk xk (1 (1) (1) (1) j 0 26 Prediction – Regression
  • 27. Minimize the Squared Error (I)  k+1 coefficients are chosen so that the squared error on the training data is minimized  Squared error: 2 n  (i )  k  y   wjx j  (i )    i 1 j 0   Coefficient can be derived using standard matrix operations  Can be done if there are more instances than attributes (roughly speaking)  If there are less instances, a lot of solutions  Minimization of absolute error is more difficult! 27 Prediction – Regression
  • 28. Minimize the Squared Error (II) Y  X w 2 n  (i) k (i )  min   y   w j x j  2  min    i 1 j 0  2   (0)   (0) (0) (0) w   y   xk(1)   0     (1)   x01) ( x 1 (1)    y    x0 x  xk   w1   min       1         (n)  (n) (n)     x0 xk  wk   (n)  y x       1 Ynx1 Xnxk wkx1 28 Prediction – Regression
  • 29. Example : Find the linear regression of salary data Salary data k Y   w j x j  w0 x0  w1 x1  w2 x2  ...  wk xk X = {x1} Y j 0 For simplicity, x0 = 1 Year experience Salary (in $1000s) 3 30 8 57 Therefore, Y  w  w x 0 1 1 9 64 13 72 With the method of least square error, i 1 s  ( x1i  x)( y i  y ) 3 36 w  3 .5  ( x  x) 1 s i 2 6 43 i 1 1 11 59  21 90 w  y  w x  23.55 0 1 1 1 20  Predicted line is estimated by Y = 23.55 + 3.5 x1 16 83 x = 9.1 and y = 55.4 s = # training instances = 10 Prediction for X = 10, Y = 23.55+3.5(10) = 58.55 29 Prediction – Regression
  • 30. Classification using Linear Regression (One with the others)  Any regression technique can be used for classification  Training: perform a regression for each class, setting the output to 1 for training instances that belong to class, and 0 for those that do not  Prediction: predict class corresponding to model with largest output value(membership value)  For linear regression, this is known as multi-response For example, the data has three classes {A, B, C}. linear regression Linear Regression Model 1: predict 1 for class A and 0 for not A Linear Regression Model 2: predict 1 for class B and 0 for not B Linear Regression Model 3: predict 1 for class C and 0 for not C 30 Prediction – Regression
  • 31. Classification using Linear Regression (Pairwise Regression)  Another way of using regression for classification:  A regression function for every pair of classes,using only instances from these two classes  An output of +1 is assigned to one member of the pair, an output of –1 to the other  Prediction is done by voting  Class that receives most votes is predicted  Alternative: “don’t know” if there is no agreement For example, the data has three classes {A, B, C}.  More likely to be accurate but more expensive Linear Regression Model 1: predict +1 for class A and -1 for class B Linear Regression Model 2: predict +1 for class A and -1 for class C Linear Regression Model 3: predict +1 for class B and -1 for class C 31 Prediction – Regression
  • 32. Regression Tree Regression tree is a decision tree with averaged and Model Tree numeric values at the leaves. Model tree is a tree whose leaves contain linear cycle main memory cache channels perfor regressions. <=7.5 CHMIN >7.5 time min max (Kb) min max mace MYCT MMIN MMAX CACH CHMIN CHMAX PRP CACH MMAX 25 <=8.5 >28 1 125 256 6000 16 128 198 (8.5,28] <=28000 >28000 6 2 29 8000 32000 32 8 32 269 19.3 CHMAX MMAX MMAX 157(21/73.7 3 29 8000 32000 32 8 32 220 (28/8.7%) %) <=58 >58 4 29 8000 32000 32 8 32 172 <=2500 (2500,4250] >4250 <=10000>10000 5 29 8000 16000 32 8 16 132 19.3 29.8 CACH 75.7 133 MMIN 783 (28/8.7%) (37/8.18%) (10/24.6%) (16/28.8%) (5/35.9%) … ... ... ... ... ... ... ... <=0.5 <=12000 >12000 207 125 2000 8000 0 2 14 52 (0.5,8.5] MYCT 208 480 512 8000 32 0 0 67 59.3 281 492 209 480 1000 4000 0 0 0 45 <=550 >550 (24/16.9%) (11/56%) (7/53.9%) 37.3 18.3 PRP = -55.9 + 0.0489 MYCT + 0.153 MMIN + (19/11.3%) (7/3.83%) 0.0056 MMAX + 0.6410 CACH - Regression 0.2700 CHMIN + 1.480 CHMAX CHMIN <=7.5 >7.5 Tree LM1: PRP = 8.29 + 0.004 MMAX +2.77 CHMIN CACH MMAX LM2: PRP = 20.3 + 0.004 MMIN -3.99 CHMIN <=8.5 >8.5 <=28000 >28000 + 0.946 CHMAX LM3: PRP = 38.1 + 0.12 MMIN MMAX LM4 LM5(21/45.5 LM6 LM4: PRP = 19.5 + 0.02 MMAX + 0.698 CACH <=4250 >4250 (50/22.1%) %) (23/63.5%) + 0.969 CHMAX LM1 CACH LM5: PRP = 285 + 1.46 MYCT + 1.02 CACH (65/7.32%) <=0.5 (0.5,8.5] - 9.39 CHMIN LM6: PRP = -65.8 + 0.03 MMIN - 2.94 CHMIN LM2 LM3 + 4.98 CHMAX (26/6.37%) (24/14.5%) Model Tree Prediction – Regression
  • 33. Support Vector Machine (SVM)  SVM is related to statistical learning theory  SVM was first introduced in 1992 [1] by Vladimir Vapnik, a Soviet Union researcher  SVM becomes popular because of its success in handwritten digit recognition  1.1% test error rate for SVM. This is the same as the error rates of a carefully constructed neural network, LeNet 4.  SVM is now regarded as an important example of “kernel methods”, one of the key area in machine learning.  SVM is popularly used in classification task 33 Support Vector Machines
  • 34. What is a good Decision Boundary?  A two-class, linearly separable classification Class 2 problem  Many decision boundaries!  The Perceptron algorithm can be used to find such a boundary  Different algorithms have been proposed Class 1  Are all decision boundaries equally good? 34 Support Vector Machines
  • 35. Examples of Bad Decision Boundaries BEST Class 2 Class 2 Class 1 Class 1 35 Support Vector Machines
  • 36. Large-margin Decision Boundary  The decision boundary should be as far away from the data of both classes as possible  We should maximize the margin, m  Distance between the origin and the line wtx=k is k/||w|| w 2 m || w || Class 2 m w x  b 1 T Class 1 w xb  0 T 36 w x  b  1 T Support Vector Machines
  • 37.  2  2 4 6 w1 w2    b   1 w1 w2    b b b   1  1 1 Example  4  4 4 2 3 2 / 3 w1 w2    b   1 w   2 2 / 3 6  w1 w2    b   1 b  5  3 Distance between 2 hyperplanes 3 2 2 m 7 Class 2 m 2 || w || 6 5 4 w supports 3 2 Class 1 w x  b 1 T m 1 0 0 1 2 3 4 5 6 7 w xb  0 T 37 w x  b  1 T Support Vector Machines
  • 38. Best boundary: Example m 2 Solve => maximize m || w || or minimize ������ As we also want to prevent data points falling into the margin, we add the following constraints for each point i, 7 ������ ������ ������������ + ������ 1 ������������������ ������������ ������������ ������ℎ������ ������������������������������ ������������������������������ Class 2 and 6 ������ ������ ������������ + ������ − 1 ������������������ ������������ ������������ ������ℎ������ ������������������������������������ ������������������������������ For n point, this can be rewriten as: 5 => ������������ ������������ ������������ + ������ ≥ ������ ������������������ ������������������ ������ ≤ ������ ≤ ������ 4 w 3 2 Class 1 w x  b 1 T m 1 0 0 1 2 3 4 5 6 7 w xb  0 T 38 w x  b  1 T Support Vector Machines
  • 39. Previously, it is difficult to solve Primal form because it depends on ||w||, the norm of w, which involves a square root  We alter the equation by 1 substituting ||w|| with ������ 2 (the 2 factor of 1/2 being used for mathematical 7 convenience) Class 2  This is called “Quadratic 6 programming (QP) optimization” 5 problem Minimize in (w, b) 4 w ������ ������ ������ ������ 3 subject to (for any i = 1, …, n) Class 1 2 ������������ ������������ ������������ + ������ ≥ ������ ������������������ ������������������ ������ ≤ ������ ≤ ������ m 1  How to solve this optimization and 0 more information on SVM, e.g., dual 0 1 2 3 4 5 6 7 form, kernel, can be found in the ref [1] [1] Support Vector Machines and other kernel-based learning methods, John Shawe-Taylor & Nello Cristianini , Cambridge University Press, 2000. http://www.support-vector.net Vector Machines 39 Support
  • 40. Extension to Non-linear Decision Boundary  So far, we have only considered large-margin classifier with a linear decision boundary  How to generalize it to become nonlinear?  Key idea: transform xi to a higher dimensional space to “make life easier”  Input space: the space the point xi are located  Feature space: the space of f(xi) after transformation  Why transform?  Linear operation in the feature space is equivalent to non-linear operation in input space  Classification can become easier with a proper transformation. In the XOR problem, for example, adding a new feature of x1x2 make the problem linearly separable 40 Support Vector Machines
  • 41. Transforming the Data f( ) f( ) f( ) f( ) f( ) f( ) f(.) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) Input space Feature space Note: feature space is of higher dimension than the input space in practice  Computation in the feature space can be costly because it is high dimensional  The feature space is typically infinite-dimensional!  The kernel trick can help (more info. in ref [1]) [1] Support Vector Machines and other kernel-based learning methods, John Shawe-Taylor & Nello Cristianini , 41 Cambridge University Press, 2000. http://www.support-vector.net Support Vector Machines
  • 42. Why SVM Work?  The feature space is often very high dimensional. Why don’t we have the curse of dimensionality?  A classifier in a high-dimensional space has many parameters and is hard to estimate  Vapnik argues that the fundamental problem is not the number of parameters to be estimated. Rather, the problem is about the flexibility of a classifier  Typically, a classifier with many parameters is very flexible, but there are also exceptions  Let xi=10i where i ranges from 1 to n. The classifier can classify all xi correctly for all possible combination of class labels on xi  This 1-parameter classifier is very flexible 42 Support Vector Machines
  • 43. Why SVM works?  Vapnik argues that the flexibility of a classifier should not be characterized by the number of parameters, but by the flexibility (capacity) of a classifier  This is formalized by the “VC-dimension” of a classifier  Consider a linear classifier in two-dimensional space  If we have three training data points, no matter how those points are labeled, we can classify them perfectly 43 Support Vector Machines
  • 44. VC-dimension  However, if we have four points, we can find a labeling such that the linear classifier fails to be perfect  We can see that 3 is the critical number  The VC-dimension of a linear classifier in a 2D space is 3 because, if we have 3 points in the training set, perfect classification is always possible irrespective of the labeling, whereas for 4 points, perfect classification can be impossible 44 Support Vector Machines
  • 45. Other Aspects of SVM  How to use SVM for multi-class classification?  Original SVM is for binary classification  One can change the QP formulation to become multi-class  More often, multiple binary classifiers are combined  One can train multiple one-versus-the-rest classifiers, or combine multiple pairwise classifiers “intelligently”  How to interpret the SVM discriminant function value as probability?  By performing logistic regression on the SVM output of a set of data (validation set) that is not used for training  Some SVM software (like libsvm) have these features built-in  A list of SVM implementation can be found at http://www.kernel-machines.org/software.html  Some implementation (such as LIBSVM) can handle multi-class classification  SVMLight is among one of the earliest implementation of SVM  Several Matlab toolboxes for SVM are also available 45 Support Vector Machines
  • 46. Strengths and Weaknesses of SVM  Strengths  Training is relatively easy  No local optimal, unlike in neural networks  It scales relatively well to high dimensional data  Tradeoff between classifier complexity and error can be controlled explicitly  Non-traditional data like strings and trees can be used as input to SVM, instead of feature vectors  Weaknesses  Need to choose a “good” kernel function. 46 Support Vector Machines
  • 47. Example: Predicting a class label using naïve Bayesian classification RID age income student Credit_rating Class:buys_computer 1 <=30 High No Fair No 2 <=30 High No Excellent No 3 31…40 High No Fair Yes 4 >40 Medium No Fair Yes 5 >40 Low Yes Fair Yes 6 >40 Low Yes Excellent No 7 31…40 Low Yes Excellent Yes 8 <=30 Medium No Fair no 9 <=30 Low Yes Fair Yes 10 >40 Medium Yes Fair Yes 11 <=30 Medium Yes Excellent Yes 12 31…40 Medium No Excellent Yes 13 31…40 High Yes Fair Yes Unknown sample 14 >40 medium no Excellent No 15 <=30 medium yes fair 47 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 48. Exercise: Outlook Temperature Humidity Windy Play Sunny Hot High False N Using naïve Bayesain classifier to predict those unknown data Sunny Hot High True N samples Overcast Hot High False Y Rainy Mild High False Y Rainy Cool Normal False Y Rainy Cool Normal True N Overcast Cool Normal True Y Sunny Mild high False N Sunny Cool Normal False Y Rainy Mild Normal False Y Sunny Mild Normal True Y Overcast Hot Normal False Y Overcast Mild High True Y Rainy Mild High True N Sunny Cool Normal False Unknown data samples Rainy 48 Mild High False