SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Introduction to Statistical
                                                                                   Machine Learning

                                                                                       c 2011
                                                                                 Christfried Webers
                                                                                       NICTA
                                                                               The Australian National
                                                                                     University

Introduction to Statistical Machine Learning

                         Christfried Webers

                      Statistical Machine Learning Group
                                     NICTA
                                      and
                College of Engineering and Computer Science
                      The Australian National University


                            Canberra
                       February – June 2011



 (Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")



                                                                                                  1of 267
Introduction to Statistical
                             Machine Learning

                                 c 2011
                           Christfried Webers
                                 NICTA
                         The Australian National
                               University




      Part VII
                         Classification

                         Generalised Linear
                         Model

Linear Classification 1   Inference and Decision

                         Discriminant Functions

                         Fisher’s Linear
                         Discriminant

                         The Perceptron
                         Algorithm




                                           230of 267
Introduction to Statistical

Classification                                                                    Machine Learning

                                                                                     c 2011
                                                                               Christfried Webers
                                                                                     NICTA
     Goal : Given input data x, assign it to one of K discrete               The Australian National
                                                                                   University
     classes Ck where k = 1, . . . , K.
     Divide the input space into different regions.


                                                                             Classification

                                                                             Generalised Linear
                                                                             Model

                                                                             Inference and Decision

                                                                             Discriminant Functions

                                                                             Fisher’s Linear
                                                                             Discriminant

                                                                             The Perceptron
                                                                             Algorithm




Figure: Length of the petal [in cm] for a given sepal [cm] for iris flowers
(Iris Setosa, Iris Versicolor, Iris Virginica).
                                                                                               231of 267
Introduction to Statistical

How to represent binary class labels?                                   Machine Learning

                                                                            c 2011
                                                                      Christfried Webers
                                                                            NICTA
                                                                    The Australian National
                                                                          University




    Class labels are no longer real values as in regression, but
    a discrete set.                                                 Classification

                                                                    Generalised Linear
    Two classes : t ∈ {0, 1}                                        Model

    ( t = 1 represents class C1 and t = 0 represents class C2 )     Inference and Decision

                                                                    Discriminant Functions
    Can interpret the value of t as the probability of class C1 ,   Fisher’s Linear
    with only two values possible for the probability, 0 or 1.      Discriminant

                                                                    The Perceptron
    Note: Other conventions to map classes into integers            Algorithm

    possible, check the setup.




                                                                                      232of 267
Introduction to Statistical

How to represent multi-class labels?                                  Machine Learning

                                                                          c 2011
                                                                    Christfried Webers
                                                                          NICTA
                                                                  The Australian National
                                                                        University




    If there are more than two classes ( K > 2), we call it a
    multi-class setup.
    Often used: 1-of-K coding scheme in which t is a vector of    Classification

    length K which has all values 0 except for tj = 1, where j    Generalised Linear
                                                                  Model
    comes from the membership in class Cj to encode.              Inference and Decision

    Example: Given 5 classes, {C1 , . . . , C5 }. Membership in   Discriminant Functions

    class C2 will be encoded as the target vector                 Fisher’s Linear
                                                                  Discriminant

                                                                  The Perceptron
                          t = (0, 1, 0, 0, 0)T                    Algorithm



    Note: Other conventions to map multi-classes into integers
    possible, check the setup.




                                                                                    233of 267
Introduction to Statistical

Linear Model                                                         Machine Learning

                                                                         c 2011
                                                                   Christfried Webers
                                                                         NICTA
   Idea: Use again a Linear Model as in regression: y(x, w) is   The Australian National
                                                                       University
   a linear function of the parameters w
                       y(xn , w) = wT φ(xn )
   But generally y(xn , w) ∈ R.
   Example: Which class is y(x, w) = 0.71623 ?                   Classification

                                                                 Generalised Linear
                                                                 Model

                                                                 Inference and Decision

                                                                 Discriminant Functions

                                                                 Fisher’s Linear
                                                                 Discriminant

                                                                 The Perceptron
                                                                 Algorithm




                                                                                   234of 267
Introduction to Statistical

Generalised Linear Model                                                 Machine Learning

                                                                             c 2011
                                                                       Christfried Webers
                                                                             NICTA
                                                                     The Australian National
                                                                           University
    Apply a mapping f : R → Z to the linear model to get the
    discrete class labels.
    Generalised Linear Model

                                y(xn , w) = f (wT φ(xn ))            Classification

                                                                     Generalised Linear
                                                                     Model
    Activation function: f (·)
                                                                     Inference and Decision
    Link function : f −1 (·)                                         Discriminant Functions

                                                                     Fisher’s Linear
                       sign z                                        Discriminant
                       1.0
                                                                     The Perceptron
                       0.5                                           Algorithm

                                                            z
                                    0.5   0.0   0.5   1.0


                       0.5



                       1.0




      Figure: Example of an activation function f (z) = sign (z) .


                                                                                       235of 267
Introduction to Statistical

Three Models for Decision Problems                                              Machine Learning

                                                                                    c 2011
                                                                              Christfried Webers
                                                                                    NICTA
                                                                            The Australian National
                                                                                  University
In increasing order of complexity
    Find a discriminant function f (x) which maps each input
    directly onto a class label.
    Discriminative Models                                                   Classification
      1   Solve the inference problem of determining the posterior          Generalised Linear
          class probabilities p(Ck | x).                                    Model

      2   Use decision theory to assign each new x to one of the            Inference and Decision

          classes.                                                          Discriminant Functions

                                                                            Fisher’s Linear
    Generative Models                                                       Discriminant

      1   Solve the inference problem of determining the                    The Perceptron
                                                                            Algorithm
          class-conditional probabilities p(x | Ck ).
      2   Also, infer the prior class probabilities p(Ck ).
      3   Use Bayes’ theorem to find the posterior p(Ck | x).
      4   Alternatively, model the joint distribution p(x, Ck ) directly.
      5   Use decision theory to assign each new x to one of the
          classes.


                                                                                              236of 267
Introduction to Statistical

Two Classes                                                            Machine Learning

                                                                           c 2011
                                                                     Christfried Webers
                                                                           NICTA
                                                                   The Australian National
                                                                         University



Definition
A discriminant is a function that maps from an input vector x to
one of K classes, denoted by Ck .
                                                                   Classification

                                                                   Generalised Linear
                                                                   Model
    Consider first two classes ( K = 2 ).
                                                                   Inference and Decision
    Construct a linear function of the inputs x                    Discriminant Functions

                                                                   Fisher’s Linear
                          y(x) = wT x + w0                         Discriminant

                                                                   The Perceptron
                                                                   Algorithm
    such that x being assigned to class C1 if y(x) ≥ 0, and to
    class C2 otherwise.
    weight vector w
    bias w0 ( sometimes −w0 called threshold )



                                                                                     237of 267
Introduction to Statistical

Two Classes                                                           Machine Learning

                                                                          c 2011
                                                                    Christfried Webers
                                                                          NICTA
                                                                  The Australian National
                                                                        University




   Decision boundary y(x) = 0 is a (D − 1)-dimensional
   hyperplane in a D-dimensional input space (decision            Classification

   surface).                                                      Generalised Linear
                                                                  Model

   w is orthogonal to any vector lying in the decision surface.   Inference and Decision

                                                                  Discriminant Functions
   Proof: Assume xA and xB are two points lying in the            Fisher’s Linear
   decision surface. Then,                                        Discriminant

                                                                  The Perceptron
                                                                  Algorithm
                 0 = y(xA ) − y(xB ) = wT (xA − xB )




                                                                                    238of 267
Introduction to Statistical

Two Classes                                                  Machine Learning

                                                                 c 2011
                                                           Christfried Webers
                                                                 NICTA
                                                         The Australian National
                                                               University


   The normal distance from the origin to the decision
   surface is
                         wT x      w0
                              =−
                          w        w                     Classification

                                                         Generalised Linear
              y>0        x2                              Model
             y=0
                         R1                              Inference and Decision
            y<0
                    R2                                   Discriminant Functions

                                                         Fisher’s Linear
                                                         Discriminant
                                            x
                              w                          The Perceptron
                                           y(x)          Algorithm
                                            w

                                   x⊥

                                                  x1
                                     −w0
                                      w




                                                                           239of 267
Introduction to Statistical

Two Classes                                                                  Machine Learning

                                                                                 c 2011
                                                                           Christfried Webers
                                                                                 NICTA
                                                                         The Australian National
                                                                               University


   The value of y(x) gives a signed measure of the
   perpendicular distance r of the point x from the decision
   surface, r = y(x)/ w .
                                                                         Classification
                      x
                                                                0        Generalised Linear

              T          w                     wT w                      Model
   y(x) = w       x⊥ + r               +w0 = r      + wT x⊥ + w0 = r w   Inference and Decision
                         w                      w
                                                                         Discriminant Functions

                       y>0        x2                                     Fisher’s Linear
                      y=0                                                Discriminant
                     y<0          R1
                             R2                                          The Perceptron
                                                                         Algorithm
                                                     x
                                       w
                                                    y(x)
                                                     w

                                             x⊥

                                                           x1
                                              −w0
                                               w




                                                                                           240of 267
Introduction to Statistical

Two Classes                                                      Machine Learning

                                                                     c 2011
                                                               Christfried Webers
                                                                     NICTA
                                                             The Australian National
                                                                   University




   More compact notation : Add an extra dimension to the
                                                             Classification
   input space and set the value to x0 = 1.                  Generalised Linear
                                                             Model
   Also define w = (w0 , w) and x = (1, x)
                                                             Inference and Decision

                                   T                         Discriminant Functions
                           y(x) = w x
                                                             Fisher’s Linear
                                                             Discriminant

   Decision surface is now a D-dimensional hyperplane in a   The Perceptron
                                                             Algorithm
   D + 1-dimensional expanded input space.




                                                                               241of 267
Introduction to Statistical

Multi-Class                                                     Machine Learning

                                                                    c 2011
                                                              Christfried Webers
                                                                    NICTA
                                                            The Australian National
                                                                  University



    Number of classes K > 2
    Can we combine a number of two-class discriminant
    functions using K − 1 one-versus-the-rest classifiers?
                                                            Classification

                                                            Generalised Linear
                                                            Model

                                                            Inference and Decision
                                      ?
                                                            Discriminant Functions

                                                            Fisher’s Linear
                                                            Discriminant
                        R1
                                          R2                The Perceptron
                                                            Algorithm

                 C1
                                R3
                                               C2
                       not C1
                                 not C2




                                                                              242of 267
Introduction to Statistical

Multi-Class                                                     Machine Learning

                                                                    c 2011
                                                              Christfried Webers
                                                                    NICTA
                                                            The Australian National
                                                                  University



    Number of classes K > 2
    Can we combine a number of two-class discriminant
    functions using K(K − 1)/2 one-versus-one classifiers?
                                                            Classification

                                                            Generalised Linear
                                           C3               Model
                                 C1
                                                            Inference and Decision

                                                            Discriminant Functions

                            R1                              Fisher’s Linear
                                            R3              Discriminant
                  C1                   ?                    The Perceptron
                                                            Algorithm
                                                      C3
                                      R2
                       C2

                                                 C2




                                                                              243of 267
Introduction to Statistical

Multi-Class                                                            Machine Learning

                                                                           c 2011
                                                                     Christfried Webers
                                                                           NICTA
                                                                   The Australian National
                                                                         University

    Number of classes K > 2
    Solution: Use K linear functions

                              yk (x) = wT x + wk0
                                        k                          Classification

                                                                   Generalised Linear
    Assign input x to class Ck if yk (x) > yj (x) for all j = k.   Model

                                                                   Inference and Decision
    Decision boundary between class Ck and Cj given by
                                                                   Discriminant Functions
    yk (x) = yj (x)                                                Fisher’s Linear
                                                                   Discriminant

                                                                   The Perceptron
                                                                   Algorithm
                                        Rj
                         Ri


                                   Rk
                                               xB
                        xA        x




                                                                                     244of 267
Introduction to Statistical

Least Squares for Classification                                      Machine Learning

                                                                         c 2011
                                                                   Christfried Webers
                                                                         NICTA
                                                                 The Australian National
                                                                       University




    Regression with a linear function of the model parameters
    and minimisation of sum-of-squares error function resulted   Classification
    in a closed-from solution for the parameter values.          Generalised Linear
                                                                 Model
    Is this also possible for classification?
                                                                 Inference and Decision
    Given input data x belonging to one of K classes Ck .        Discriminant Functions

    Use 1-of-K binary coding scheme.                             Fisher’s Linear
                                                                 Discriminant

    Each class is described by its own linear model              The Perceptron
                                                                 Algorithm


                 yk (x) = wT x + wk0
                           k            k = 1, . . . , K




                                                                                   245of 267
Introduction to Statistical

Least Squares for Classification                                       Machine Learning

                                                                          c 2011
                                                                    Christfried Webers
                                                                          NICTA
                                                                  The Australian National
                                                                        University



    With the conventions
                    wk0
             wk =                              ∈ RD+1
                    wk                                            Classification

                 1                                   D+1
                                                                  Generalised Linear
              x=                               ∈R                 Model
                 x                                                Inference and Decision

             W = w1       ...   wK         ∈ R(D+1)×K             Discriminant Functions

                                                                  Fisher’s Linear
                                                                  Discriminant
    we get for the discriminant function (vector valued)          The Perceptron
                                                                  Algorithm

                  y(x) = WT x               ∈ RK .

    For a new input x, the class is then defined by the index of
    the largest value in the row vector y(x)



                                                                                    246of 267
Introduction to Statistical

Determine W                                                                 Machine Learning

                                                                                c 2011
                                                                          Christfried Webers
                                                                                NICTA
                                                                        The Australian National
                                                                              University




   Given a training set {xn , t} where n = 1, . . . , N, and t is the
   class in the 1-of-K coding scheme.
   Define a matrix T where row n corresponds to tT .
                                                n                       Classification

   The sum-of-squares error can now be written as                       Generalised Linear
                                                                        Model

                                                                        Inference and Decision
                          1
               ED (W) =     tr (XW − T)T (XW − T)                       Discriminant Functions
                          2                                             Fisher’s Linear
                                                                        Discriminant

   The minimum of ED (W) will be reached for                            The Perceptron
                                                                        Algorithm


                      W = (XT X)−1 XT T = X† T

   where X† is the pseudo-inverse of X.



                                                                                          247of 267
Introduction to Statistical

Discriminant Function for Multi-Class                                   Machine Learning

                                                                            c 2011
                                                                      Christfried Webers
                                                                            NICTA
                                                                    The Australian National
                                                                          University

    The discriminant function y(x) is therefore

                      y(x) = WT x = TT (X† )T x,

    where X is given by the training data, and x is the new         Classification

                                                                    Generalised Linear
    input.                                                          Model

    Interesting property: If for every tn the same linear           Inference and Decision

    constraint aT tn + b = 0 holds, then the prediction y(x) will   Discriminant Functions

                                                                    Fisher’s Linear
    also obey the same constraint                                   Discriminant

                                                                    The Perceptron
                           aT y(x) + b = 0.                         Algorithm




    For the 1-of-K coding scheme, the sum of all components
    in tn is one, and therefore all components of y(x) will sum
    to one. BUT: the components are not probabilities, as they
    are not constraint to the interval (0, 1).

                                                                                      248of 267
Introduction to Statistical

Deficiencies of the Least Squares Approach                              Machine Learning

                                                                           c 2011
                                                                     Christfried Webers
                                                                           NICTA
                                                                   The Australian National
                                                                         University



Magenta curve : Decision Boundary for the least squares
approach ( Green curve : Decision boundary for the logistic
regression model described later)
     4                               4                             Classification

                                                                   Generalised Linear
     2                               2                             Model

                                                                   Inference and Decision
     0                               0
                                                                   Discriminant Functions

    −2                              −2                             Fisher’s Linear
                                                                   Discriminant

    −4                              −4                             The Perceptron
                                                                   Algorithm

    −6                              −6


    −8                              −8

     −4   −2   0   2   4   6   8     −4   −2   0   2   4   6   8




                                                                                     249of 267
Introduction to Statistical

Deficiencies of the Least Squares Approach                               Machine Learning

                                                                            c 2011
                                                                      Christfried Webers
                                                                            NICTA
                                                                    The Australian National
                                                                          University



Magenta curve : Decision Boundary for the least squares
approach ( Green curve : Decision boundary for the logistic
regression model described later)
     6                               6                              Classification

                                                                    Generalised Linear
     4                               4                              Model

                                                                    Inference and Decision
     2                               2
                                                                    Discriminant Functions

                                                                    Fisher’s Linear
     0                               0                              Discriminant

                                                                    The Perceptron
    −2                              −2                              Algorithm


    −4                              −4


    −6                              −6
     −6   −4   −2   0   2   4   6    −6   −4   −2   0   2   4   6




                                                                                      250of 267
Introduction to Statistical

Fisher’s Linear Discriminant                                         Machine Learning

                                                                         c 2011
                                                                   Christfried Webers
                                                                         NICTA
                                                                 The Australian National
                                                                       University




    View linear classification as dimensionality reduction.

                             y(x) = wT x
                                                                 Classification

                                                                 Generalised Linear
    If y ≥ −w0 then class C1 , otherwise C2 .                    Model

    But there are many projections from a D-dimensional input    Inference and Decision

                                                                 Discriminant Functions
    space onto one dimension.
                                                                 Fisher’s Linear
    Projection always means loss of information.                 Discriminant

                                                                 The Perceptron
    For classification we want to preserve the class separation   Algorithm

    in one dimension.
    Can we find a projection which maximally preserves the
    class separation ?




                                                                                   251of 267
Introduction to Statistical

Fisher’s Linear Discriminant                                        Machine Learning

                                                                        c 2011
                                                                  Christfried Webers
                                                                        NICTA
                                                                The Australian National
                                                                      University



Samples from two classes in a two-dimensional input space
and their histogram when projected to two different
one-dimensional spaces.
                                                                Classification

                                                                Generalised Linear
                                                                Model
     4                             4                            Inference and Decision

                                                                Discriminant Functions
     2                             2                            Fisher’s Linear
                                                                Discriminant

                                                                The Perceptron
     0                             0
                                                                Algorithm


    −2                            −2


         −2       2        6           −2       2           6




                                                                                  252of 267
Introduction to Statistical

Fisher’s Linear Discriminant - First Try                                 Machine Learning

                                                                             c 2011
                                                                       Christfried Webers
                                                                             NICTA
    Given N1 input data of class C1 , and N2 input data of class     The Australian National
                                                                           University
    C2 , calculate the centres of the two classes
                      1                             1
               m1 =               xn ,       m2 =               xn
                      N1                            N2
                           n∈C1                          n∈C2

    Choose w so as to maximise the projection of the class           Classification

    means onto w                                                     Generalised Linear
                                                                     Model

                      m1 − m2 = wT (m1 − m2 )                        Inference and Decision

                                                                     Discriminant Functions

    Problem with non-uniform covariance                              Fisher’s Linear
                                                                     Discriminant

                  4                                                  The Perceptron
                                                                     Algorithm


                  2


                  0


                 −2


                      −2                 2      6

                                                                                       253of 267
Introduction to Statistical

Fisher’s Linear Discriminant                                    Machine Learning

                                                                    c 2011
                                                              Christfried Webers
                                                                    NICTA
    Measure also the within-class variance for each class   The Australian National
                                                                  University


                           s2 =
                            k            (yn − mk )2
                                  n∈Ck

    where yn = wT xn .
                                                            Classification
    Maximise the Fisher criterion                           Generalised Linear
                                                            Model
                                                  2
                                     (m2 − m1 )             Inference and Decision
                           J(w) =
                                       s2 + s2
                                        1    2
                                                            Discriminant Functions

                                                            Fisher’s Linear
                                                            Discriminant
                  4
                                                            The Perceptron
                                                            Algorithm

                  2


                  0


                 −2


                      −2             2            6

                                                                              254of 267
Introduction to Statistical

Fisher’s Linear Discriminant                                                    Machine Learning

                                                                                    c 2011
                                                                              Christfried Webers
                                                                                    NICTA
                                                                            The Australian National
                                                                                  University




    The Fisher criterion can be rewritten as
                                       wT SB w
                              J(w) =                                        Classification
                                       wT S W w                             Generalised Linear
                                                                            Model
    SB is the between-class covariance                                      Inference and Decision

                                                                            Discriminant Functions
                        SB = (m2 − m1 )(m2 − m1 )T                          Fisher’s Linear
                                                                            Discriminant

    SW is the within-class covariance                                       The Perceptron
                                                                            Algorithm



     SW =          (xn − m1 )(xn − m1 )T +          (xn − m2 )(xn − m2 )T
            n∈C1                             n∈C2




                                                                                              255of 267
Introduction to Statistical

Fisher’s Linear Discriminant                                          Machine Learning

                                                                          c 2011
                                                                    Christfried Webers
                                                                          NICTA
                                                                  The Australian National
                                                                        University




    The Fisher criterion
                                    wT SB w                       Classification
                           J(w) =
                                    wT S W w                      Generalised Linear
                                                                  Model

                                                                  Inference and Decision
    has a maximum for Fisher’s linear discriminant
                                                                  Discriminant Functions


                           w∝   S−1 (m2
                                 W        − m1 )                  Fisher’s Linear
                                                                  Discriminant

                                                                  The Perceptron
                                                                  Algorithm
    Fisher’s linear discriminant is NOT a discriminant, but can
    be used to construct one by choosing a threshold y0 in the
    projection space.




                                                                                    256of 267
Introduction to Statistical

Fisher’s Discriminant For Multi-Class                                Machine Learning

                                                                         c 2011
                                                                   Christfried Webers
                                                                         NICTA
    Assume that the dimensionality of the input space D is       The Australian National
                                                                       University
    greater than the number of classes K.
    Use D > 1 linear ’features’ yk = wT x and write everything
    in vector form (no bias involved!)

                                 y = wT x.                       Classification

                                                                 Generalised Linear
    The within-class covariance is then the sum of the           Model

                                                                 Inference and Decision
    covariances for all K classes
                                                                 Discriminant Functions
                                           K                     Fisher’s Linear
                                                                 Discriminant
                               SW =            Sk
                                                                 The Perceptron
                                       k=1                       Algorithm


    where

                   Sk =          (xn − mk )(xn − mk )T
                          n∈Ck
                          1
                  mk =                xn
                          Nk
                               n∈Ck
                                                                                   257of 267
Introduction to Statistical

Fisher’s Discriminant For Multi-Class                              Machine Learning

                                                                       c 2011
                                                                 Christfried Webers
                                                                       NICTA
                                                               The Australian National
    Between-class covariance                                         University


                          K
                   SB =         (mk − m)(mk − m)T .
                          k=1
                                                               Classification
    where m is the total mean of the input data                Generalised Linear
                                                               Model
                                        N                      Inference and Decision
                                    1
                              m=              x.               Discriminant Functions
                                    N
                                        n=1                    Fisher’s Linear
                                                               Discriminant

    One possible way to define a function of W which is large   The Perceptron
                                                               Algorithm
    when the between-class covariance is large and the
    within-class covariance is small is given by

                J(W) = tr (WSW WT )−1 (WSB WT )

    The maximum of J(W) is determined by the D
    eigenvectors of S−1 SB with the largest eigenvalues.
                     W
                                                                                 258of 267
Introduction to Statistical

Fisher’s Discriminant For Multi-Class                              Machine Learning

                                                                       c 2011
                                                                 Christfried Webers
                                                                       NICTA
                                                               The Australian National
                                                                     University




                                                               Classification
    How many linear ’features’ can one find with this method?   Generalised Linear
                                                               Model
    SB is of rank at most K − 1 because of the sum of K rank
                                                               Inference and Decision
    one matrices and the global constraint via m.              Discriminant Functions

    Projection onto the subspace spanned by SB can not have    Fisher’s Linear
                                                               Discriminant
    more than K − 1 linear features.
                                                               The Perceptron
                                                               Algorithm




                                                                                 259of 267
Introduction to Statistical

The Perceptron Algorithm                                           Machine Learning

                                                                       c 2011
                                                                 Christfried Webers
                                                                       NICTA
                                                               The Australian National
                                                                     University



    Frank Rosenblatt (1928 - 1969)
    "Principles of neurodynamics: Perceptrons and the theory
    of brain mechanisms" (Spartan Books, 1962)
                                                               Classification

                                                               Generalised Linear
                                                               Model

                                                               Inference and Decision

                                                               Discriminant Functions

                                                               Fisher’s Linear
                                                               Discriminant

                                                               The Perceptron
                                                               Algorithm




                                                                                 260of 267
Introduction to Statistical

The Perceptron Algorithm                                          Machine Learning

                                                                      c 2011
                                                                Christfried Webers
                                                                      NICTA
                                                              The Australian National
                                                                    University



    Perceptron ("MARK 1") was the first computer which could
    learn new skills by trial and error
                                                              Classification

                                                              Generalised Linear
                                                              Model

                                                              Inference and Decision

                                                              Discriminant Functions

                                                              Fisher’s Linear
                                                              Discriminant

                                                              The Perceptron
                                                              Algorithm




                                                                                261of 267
Introduction to Statistical

The Perceptron Algorithm                                     Machine Learning

                                                                 c 2011
                                                           Christfried Webers
                                                                 NICTA
                                                         The Australian National
    Two class model                                            University


    Create feature vector φ(x) by a fixed nonlinear
    transformation of the input x.
    Generalised linear model
                                                         Classification
                         y(x) = f (wT φ(x))              Generalised Linear
                                                         Model

    with φ(x) containing some bias element φ0 (x) = 1.   Inference and Decision

                                                         Discriminant Functions
    nonlinear activation function                        Fisher’s Linear
                                                         Discriminant

                                 +1, a ≥ 0               The Perceptron
                       f (a) =                           Algorithm

                                 −1, a < 0

    Target coding for perceptron

                                 +1, if C1
                          t=
                                 −1, if C2
                                                                           262of 267
Introduction to Statistical

The Perceptron Algorithm - Error Function                          Machine Learning

                                                                       c 2011
                                                                 Christfried Webers
                                                                       NICTA
                                                               The Australian National
                                                                     University




    Idea : Minimise total number of misclassified patterns.
    Problem : As a function of w, this is piecewise constant
                                                               Classification
    and therefore the gradient is zero almost everywhere.      Generalised Linear
                                                               Model
    Better idea: Using the (−1, +1) target coding scheme, we
                                                               Inference and Decision
    want all patterns to satisfy wT φ(xn )tn > 0.
                                                               Discriminant Functions
    Perceptron Criterion : Add the errors for all patterns     Fisher’s Linear
                                                               Discriminant
    belonging to the set of misclassified patterns M
                                                               The Perceptron
                                                               Algorithm

                     EP (w) = −         wT φ(xn )tn
                                  n∈M




                                                                                 263of 267
Introduction to Statistical

Perceptron - Stochastic Gradient Descent                              Machine Learning

                                                                          c 2011
                                                                    Christfried Webers
                                                                          NICTA
                                                                  The Australian National
                                                                        University




    Perceptron Criterion (with notation φn = φ(xn ) )

                          EP (w) = −          wT φn tn
                                                                  Classification
                                       n∈M
                                                                  Generalised Linear
                                                                  Model
    One iteration at step τ                                       Inference and Decision
      1   Choose a training pair (xn , tn )                       Discriminant Functions
      2   Update the weight vector w by                           Fisher’s Linear
                                                                  Discriminant

                    w(τ +1) = w(τ ) − η EP (w) = w(τ ) + ηφn tn   The Perceptron
                                                                  Algorithm


    As y(x, w) does not depend on the norm of w, one can set
    η=1
                       w(τ +1) = w(τ ) + φn tn



                                                                                    264of 267
Introduction to Statistical

The Perceptron Algorithm - Update 1                                      Machine Learning

                                                                             c 2011
                                                                       Christfried Webers
                                                                             NICTA
                                                                     The Australian National
                                                                           University

 Update of the perceptron weights from a misclassified pattern
                           (green)

                      w(τ +1) = w(τ ) + φn tn
                                                                     Classification

                                                                     Generalised Linear
                                                                     Model

                                                                     Inference and Decision
      1                                  1
                                                                     Discriminant Functions

                                                                     Fisher’s Linear
     0.5                                0.5                          Discriminant

                                                                     The Perceptron
                                                                     Algorithm

      0                                  0



    −0.5                               −0.5



     −1                                 −1
      −1   −0.5   0      0.5    1        −1     −0.5   0   0.5   1



                                                                                       265of 267
Introduction to Statistical

The Perceptron Algorithm - Update 2                                      Machine Learning

                                                                             c 2011
                                                                       Christfried Webers
                                                                             NICTA
                                                                     The Australian National
                                                                           University

 Update of the perceptron weights from a misclassified pattern
                           (green)

                      w(τ +1) = w(τ ) + φn tn
                                                                     Classification

                                                                     Generalised Linear
                                                                     Model

                                                                     Inference and Decision
      1                                  1
                                                                     Discriminant Functions

                                                                     Fisher’s Linear
     0.5                                0.5                          Discriminant

                                                                     The Perceptron
                                                                     Algorithm

      0                                  0



    −0.5                               −0.5



     −1                                 −1
      −1   −0.5   0      0.5    1        −1     −0.5   0   0.5   1



                                                                                       266of 267
Introduction to Statistical

The Perceptron Algorithm - Convergence                                       Machine Learning

                                                                                 c 2011
                                                                           Christfried Webers
                                                                                 NICTA
                                                                         The Australian National
                                                                               University



    Does the algorithm converge ?
    For a single update step

     −w(τ +1)T φn tn = −w(τ )T φn tn − (φn tn )T φn tn < −w(τ )T φn tn   Classification

                                                                         Generalised Linear
                                                                         Model
                    T
    because (φn tn ) φn tn = φn tn > 0.                                  Inference and Decision

    BUT: contributions to the error from the other misclassified          Discriminant Functions

    patterns might have increased.                                       Fisher’s Linear
                                                                         Discriminant

    AND: some correctly classified patterns might now be                  The Perceptron
                                                                         Algorithm
    misclassified.
    Perceptron Convergence Theorem : If the training set is
    linearly separable, the perceptron algorithm is guaranteed
    to find a solution in a finite number of steps.



                                                                                           267of 267

Weitere ähnliche Inhalte

Was ist angesagt?

Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised LearningLukas Tencer
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learningmahutte
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine LearningKnoldus Inc.
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality ReductionSaad Elbeleidy
 
Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Marina Santini
 
Linear regression
Linear regressionLinear regression
Linear regressionMartinHogg9
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reductionmrizwan969
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural networkSopheaktra YONG
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learningKien Le
 
Feature selection
Feature selectionFeature selection
Feature selectionDong Guo
 
Support Vector Machine ppt presentation
Support Vector Machine ppt presentationSupport Vector Machine ppt presentation
Support Vector Machine ppt presentationAyanaRukasar
 
Wrapper feature selection method
Wrapper feature selection methodWrapper feature selection method
Wrapper feature selection methodAmir Razmjou
 

Was ist angesagt? (20)

Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised Learning
 
KNN
KNN KNN
KNN
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learning
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine Learning
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Random forest
Random forestRandom forest
Random forest
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural network
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Decision tree
Decision treeDecision tree
Decision tree
 
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Support Vector Machine ppt presentation
Support Vector Machine ppt presentationSupport Vector Machine ppt presentation
Support Vector Machine ppt presentation
 
Wrapper feature selection method
Wrapper feature selection methodWrapper feature selection method
Wrapper feature selection method
 

Andere mochten auch

Managerial economics linearprogramming
Managerial economics linearprogrammingManagerial economics linearprogramming
Managerial economics linearprogrammingNiña Mae Alota
 
Linear Classification
Linear ClassificationLinear Classification
Linear Classificationmailund
 
An Introduction to Linear Programming
An Introduction to Linear ProgrammingAn Introduction to Linear Programming
An Introduction to Linear ProgrammingMinh-Tri Pham
 
Linear programming graphical method (feasibility)
Linear programming   graphical method (feasibility)Linear programming   graphical method (feasibility)
Linear programming graphical method (feasibility)Rajesh Timane, PhD
 
Karmarkar's Algorithm For Linear Programming Problem
Karmarkar's Algorithm For Linear Programming ProblemKarmarkar's Algorithm For Linear Programming Problem
Karmarkar's Algorithm For Linear Programming ProblemAjay Dhamija
 
Classification of Programming Languages
Classification of Programming LanguagesClassification of Programming Languages
Classification of Programming LanguagesProject Student
 
Machine Learning on Big Data
Machine Learning on Big DataMachine Learning on Big Data
Machine Learning on Big DataMax Lin
 
Applications of linear programming
Applications of linear programmingApplications of linear programming
Applications of linear programmingZenblade 93
 
LINEAR PROGRAMMING
LINEAR PROGRAMMINGLINEAR PROGRAMMING
LINEAR PROGRAMMINGrashi9
 
Linear Programming 1
Linear Programming 1Linear Programming 1
Linear Programming 1irsa javed
 
Linear programming - Model formulation, Graphical Method
Linear programming  - Model formulation, Graphical MethodLinear programming  - Model formulation, Graphical Method
Linear programming - Model formulation, Graphical MethodJoseph Konnully
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 

Andere mochten auch (19)

Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Managerial economics linearprogramming
Managerial economics linearprogrammingManagerial economics linearprogramming
Managerial economics linearprogramming
 
Linear Classification
Linear ClassificationLinear Classification
Linear Classification
 
An Introduction to Linear Programming
An Introduction to Linear ProgrammingAn Introduction to Linear Programming
An Introduction to Linear Programming
 
Linear programming graphical method (feasibility)
Linear programming   graphical method (feasibility)Linear programming   graphical method (feasibility)
Linear programming graphical method (feasibility)
 
Karmarkar's Algorithm For Linear Programming Problem
Karmarkar's Algorithm For Linear Programming ProblemKarmarkar's Algorithm For Linear Programming Problem
Karmarkar's Algorithm For Linear Programming Problem
 
Classification of Programming Languages
Classification of Programming LanguagesClassification of Programming Languages
Classification of Programming Languages
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Machine Learning on Big Data
Machine Learning on Big DataMachine Learning on Big Data
Machine Learning on Big Data
 
Applications of linear programming
Applications of linear programmingApplications of linear programming
Applications of linear programming
 
LINEAR PROGRAMMING
LINEAR PROGRAMMINGLINEAR PROGRAMMING
LINEAR PROGRAMMING
 
Linear Programming 1
Linear Programming 1Linear Programming 1
Linear Programming 1
 
Linear programming ppt
Linear programming pptLinear programming ppt
Linear programming ppt
 
Linear programing
Linear programingLinear programing
Linear programing
 
Linear Programming
Linear ProgrammingLinear Programming
Linear Programming
 
Linear programming - Model formulation, Graphical Method
Linear programming  - Model formulation, Graphical MethodLinear programming  - Model formulation, Graphical Method
Linear programming - Model formulation, Graphical Method
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Linear programming
Linear programmingLinear programming
Linear programming
 

Ähnlich wie linear classification

Ähnlich wie linear classification (6)

08 linear classification_2
08 linear classification_208 linear classification_2
08 linear classification_2
 
Introduction
IntroductionIntroduction
Introduction
 
Linear Algebra
Linear AlgebraLinear Algebra
Linear Algebra
 
Linear Regression
Linear RegressionLinear Regression
Linear Regression
 
linear regression part 2
linear regression part 2linear regression part 2
linear regression part 2
 
Probability
ProbabilityProbability
Probability
 

Mehr von nep_test_account (13)

Database slide
Database slideDatabase slide
Database slide
 
database slide 1
database slide 1database slide 1
database slide 1
 
Doc2pages
Doc2pagesDoc2pages
Doc2pages
 
Doc2pages
Doc2pagesDoc2pages
Doc2pages
 
PDF TEST
PDF TESTPDF TEST
PDF TEST
 
Baboom!
Baboom!Baboom!
Baboom!
 
Baboom!
Baboom!Baboom!
Baboom!
 
uat
uatuat
uat
 
test pdf
test pdftest pdf
test pdf
 
Lecture Notes in Machine Learning
Lecture Notes in Machine LearningLecture Notes in Machine Learning
Lecture Notes in Machine Learning
 
Induction of Decision Trees
Induction of Decision TreesInduction of Decision Trees
Induction of Decision Trees
 
Large-Scale Machine Learning at Twitter
Large-Scale Machine Learning at TwitterLarge-Scale Machine Learning at Twitter
Large-Scale Machine Learning at Twitter
 
A Few Useful Things to Know about Machine Learning
A Few Useful Things to Know about Machine LearningA Few Useful Things to Know about Machine Learning
A Few Useful Things to Know about Machine Learning
 

linear classification

  • 1. Introduction to Statistical Machine Learning c 2011 Christfried Webers NICTA The Australian National University Introduction to Statistical Machine Learning Christfried Webers Statistical Machine Learning Group NICTA and College of Engineering and Computer Science The Australian National University Canberra February – June 2011 (Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning") 1of 267
  • 2. Introduction to Statistical Machine Learning c 2011 Christfried Webers NICTA The Australian National University Part VII Classification Generalised Linear Model Linear Classification 1 Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm 230of 267
  • 3. Introduction to Statistical Classification Machine Learning c 2011 Christfried Webers NICTA Goal : Given input data x, assign it to one of K discrete The Australian National University classes Ck where k = 1, . . . , K. Divide the input space into different regions. Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm Figure: Length of the petal [in cm] for a given sepal [cm] for iris flowers (Iris Setosa, Iris Versicolor, Iris Virginica). 231of 267
  • 4. Introduction to Statistical How to represent binary class labels? Machine Learning c 2011 Christfried Webers NICTA The Australian National University Class labels are no longer real values as in regression, but a discrete set. Classification Generalised Linear Two classes : t ∈ {0, 1} Model ( t = 1 represents class C1 and t = 0 represents class C2 ) Inference and Decision Discriminant Functions Can interpret the value of t as the probability of class C1 , Fisher’s Linear with only two values possible for the probability, 0 or 1. Discriminant The Perceptron Note: Other conventions to map classes into integers Algorithm possible, check the setup. 232of 267
  • 5. Introduction to Statistical How to represent multi-class labels? Machine Learning c 2011 Christfried Webers NICTA The Australian National University If there are more than two classes ( K > 2), we call it a multi-class setup. Often used: 1-of-K coding scheme in which t is a vector of Classification length K which has all values 0 except for tj = 1, where j Generalised Linear Model comes from the membership in class Cj to encode. Inference and Decision Example: Given 5 classes, {C1 , . . . , C5 }. Membership in Discriminant Functions class C2 will be encoded as the target vector Fisher’s Linear Discriminant The Perceptron t = (0, 1, 0, 0, 0)T Algorithm Note: Other conventions to map multi-classes into integers possible, check the setup. 233of 267
  • 6. Introduction to Statistical Linear Model Machine Learning c 2011 Christfried Webers NICTA Idea: Use again a Linear Model as in regression: y(x, w) is The Australian National University a linear function of the parameters w y(xn , w) = wT φ(xn ) But generally y(xn , w) ∈ R. Example: Which class is y(x, w) = 0.71623 ? Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm 234of 267
  • 7. Introduction to Statistical Generalised Linear Model Machine Learning c 2011 Christfried Webers NICTA The Australian National University Apply a mapping f : R → Z to the linear model to get the discrete class labels. Generalised Linear Model y(xn , w) = f (wT φ(xn )) Classification Generalised Linear Model Activation function: f (·) Inference and Decision Link function : f −1 (·) Discriminant Functions Fisher’s Linear sign z Discriminant 1.0 The Perceptron 0.5 Algorithm z 0.5 0.0 0.5 1.0 0.5 1.0 Figure: Example of an activation function f (z) = sign (z) . 235of 267
  • 8. Introduction to Statistical Three Models for Decision Problems Machine Learning c 2011 Christfried Webers NICTA The Australian National University In increasing order of complexity Find a discriminant function f (x) which maps each input directly onto a class label. Discriminative Models Classification 1 Solve the inference problem of determining the posterior Generalised Linear class probabilities p(Ck | x). Model 2 Use decision theory to assign each new x to one of the Inference and Decision classes. Discriminant Functions Fisher’s Linear Generative Models Discriminant 1 Solve the inference problem of determining the The Perceptron Algorithm class-conditional probabilities p(x | Ck ). 2 Also, infer the prior class probabilities p(Ck ). 3 Use Bayes’ theorem to find the posterior p(Ck | x). 4 Alternatively, model the joint distribution p(x, Ck ) directly. 5 Use decision theory to assign each new x to one of the classes. 236of 267
  • 9. Introduction to Statistical Two Classes Machine Learning c 2011 Christfried Webers NICTA The Australian National University Definition A discriminant is a function that maps from an input vector x to one of K classes, denoted by Ck . Classification Generalised Linear Model Consider first two classes ( K = 2 ). Inference and Decision Construct a linear function of the inputs x Discriminant Functions Fisher’s Linear y(x) = wT x + w0 Discriminant The Perceptron Algorithm such that x being assigned to class C1 if y(x) ≥ 0, and to class C2 otherwise. weight vector w bias w0 ( sometimes −w0 called threshold ) 237of 267
  • 10. Introduction to Statistical Two Classes Machine Learning c 2011 Christfried Webers NICTA The Australian National University Decision boundary y(x) = 0 is a (D − 1)-dimensional hyperplane in a D-dimensional input space (decision Classification surface). Generalised Linear Model w is orthogonal to any vector lying in the decision surface. Inference and Decision Discriminant Functions Proof: Assume xA and xB are two points lying in the Fisher’s Linear decision surface. Then, Discriminant The Perceptron Algorithm 0 = y(xA ) − y(xB ) = wT (xA − xB ) 238of 267
  • 11. Introduction to Statistical Two Classes Machine Learning c 2011 Christfried Webers NICTA The Australian National University The normal distance from the origin to the decision surface is wT x w0 =− w w Classification Generalised Linear y>0 x2 Model y=0 R1 Inference and Decision y<0 R2 Discriminant Functions Fisher’s Linear Discriminant x w The Perceptron y(x) Algorithm w x⊥ x1 −w0 w 239of 267
  • 12. Introduction to Statistical Two Classes Machine Learning c 2011 Christfried Webers NICTA The Australian National University The value of y(x) gives a signed measure of the perpendicular distance r of the point x from the decision surface, r = y(x)/ w . Classification x 0 Generalised Linear T w wT w Model y(x) = w x⊥ + r +w0 = r + wT x⊥ + w0 = r w Inference and Decision w w Discriminant Functions y>0 x2 Fisher’s Linear y=0 Discriminant y<0 R1 R2 The Perceptron Algorithm x w y(x) w x⊥ x1 −w0 w 240of 267
  • 13. Introduction to Statistical Two Classes Machine Learning c 2011 Christfried Webers NICTA The Australian National University More compact notation : Add an extra dimension to the Classification input space and set the value to x0 = 1. Generalised Linear Model Also define w = (w0 , w) and x = (1, x) Inference and Decision T Discriminant Functions y(x) = w x Fisher’s Linear Discriminant Decision surface is now a D-dimensional hyperplane in a The Perceptron Algorithm D + 1-dimensional expanded input space. 241of 267
  • 14. Introduction to Statistical Multi-Class Machine Learning c 2011 Christfried Webers NICTA The Australian National University Number of classes K > 2 Can we combine a number of two-class discriminant functions using K − 1 one-versus-the-rest classifiers? Classification Generalised Linear Model Inference and Decision ? Discriminant Functions Fisher’s Linear Discriminant R1 R2 The Perceptron Algorithm C1 R3 C2 not C1 not C2 242of 267
  • 15. Introduction to Statistical Multi-Class Machine Learning c 2011 Christfried Webers NICTA The Australian National University Number of classes K > 2 Can we combine a number of two-class discriminant functions using K(K − 1)/2 one-versus-one classifiers? Classification Generalised Linear C3 Model C1 Inference and Decision Discriminant Functions R1 Fisher’s Linear R3 Discriminant C1 ? The Perceptron Algorithm C3 R2 C2 C2 243of 267
  • 16. Introduction to Statistical Multi-Class Machine Learning c 2011 Christfried Webers NICTA The Australian National University Number of classes K > 2 Solution: Use K linear functions yk (x) = wT x + wk0 k Classification Generalised Linear Assign input x to class Ck if yk (x) > yj (x) for all j = k. Model Inference and Decision Decision boundary between class Ck and Cj given by Discriminant Functions yk (x) = yj (x) Fisher’s Linear Discriminant The Perceptron Algorithm Rj Ri Rk xB xA x 244of 267
  • 17. Introduction to Statistical Least Squares for Classification Machine Learning c 2011 Christfried Webers NICTA The Australian National University Regression with a linear function of the model parameters and minimisation of sum-of-squares error function resulted Classification in a closed-from solution for the parameter values. Generalised Linear Model Is this also possible for classification? Inference and Decision Given input data x belonging to one of K classes Ck . Discriminant Functions Use 1-of-K binary coding scheme. Fisher’s Linear Discriminant Each class is described by its own linear model The Perceptron Algorithm yk (x) = wT x + wk0 k k = 1, . . . , K 245of 267
  • 18. Introduction to Statistical Least Squares for Classification Machine Learning c 2011 Christfried Webers NICTA The Australian National University With the conventions wk0 wk = ∈ RD+1 wk Classification 1 D+1 Generalised Linear x= ∈R Model x Inference and Decision W = w1 ... wK ∈ R(D+1)×K Discriminant Functions Fisher’s Linear Discriminant we get for the discriminant function (vector valued) The Perceptron Algorithm y(x) = WT x ∈ RK . For a new input x, the class is then defined by the index of the largest value in the row vector y(x) 246of 267
  • 19. Introduction to Statistical Determine W Machine Learning c 2011 Christfried Webers NICTA The Australian National University Given a training set {xn , t} where n = 1, . . . , N, and t is the class in the 1-of-K coding scheme. Define a matrix T where row n corresponds to tT . n Classification The sum-of-squares error can now be written as Generalised Linear Model Inference and Decision 1 ED (W) = tr (XW − T)T (XW − T) Discriminant Functions 2 Fisher’s Linear Discriminant The minimum of ED (W) will be reached for The Perceptron Algorithm W = (XT X)−1 XT T = X† T where X† is the pseudo-inverse of X. 247of 267
  • 20. Introduction to Statistical Discriminant Function for Multi-Class Machine Learning c 2011 Christfried Webers NICTA The Australian National University The discriminant function y(x) is therefore y(x) = WT x = TT (X† )T x, where X is given by the training data, and x is the new Classification Generalised Linear input. Model Interesting property: If for every tn the same linear Inference and Decision constraint aT tn + b = 0 holds, then the prediction y(x) will Discriminant Functions Fisher’s Linear also obey the same constraint Discriminant The Perceptron aT y(x) + b = 0. Algorithm For the 1-of-K coding scheme, the sum of all components in tn is one, and therefore all components of y(x) will sum to one. BUT: the components are not probabilities, as they are not constraint to the interval (0, 1). 248of 267
  • 21. Introduction to Statistical Deficiencies of the Least Squares Approach Machine Learning c 2011 Christfried Webers NICTA The Australian National University Magenta curve : Decision Boundary for the least squares approach ( Green curve : Decision boundary for the logistic regression model described later) 4 4 Classification Generalised Linear 2 2 Model Inference and Decision 0 0 Discriminant Functions −2 −2 Fisher’s Linear Discriminant −4 −4 The Perceptron Algorithm −6 −6 −8 −8 −4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8 249of 267
  • 22. Introduction to Statistical Deficiencies of the Least Squares Approach Machine Learning c 2011 Christfried Webers NICTA The Australian National University Magenta curve : Decision Boundary for the least squares approach ( Green curve : Decision boundary for the logistic regression model described later) 6 6 Classification Generalised Linear 4 4 Model Inference and Decision 2 2 Discriminant Functions Fisher’s Linear 0 0 Discriminant The Perceptron −2 −2 Algorithm −4 −4 −6 −6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 250of 267
  • 23. Introduction to Statistical Fisher’s Linear Discriminant Machine Learning c 2011 Christfried Webers NICTA The Australian National University View linear classification as dimensionality reduction. y(x) = wT x Classification Generalised Linear If y ≥ −w0 then class C1 , otherwise C2 . Model But there are many projections from a D-dimensional input Inference and Decision Discriminant Functions space onto one dimension. Fisher’s Linear Projection always means loss of information. Discriminant The Perceptron For classification we want to preserve the class separation Algorithm in one dimension. Can we find a projection which maximally preserves the class separation ? 251of 267
  • 24. Introduction to Statistical Fisher’s Linear Discriminant Machine Learning c 2011 Christfried Webers NICTA The Australian National University Samples from two classes in a two-dimensional input space and their histogram when projected to two different one-dimensional spaces. Classification Generalised Linear Model 4 4 Inference and Decision Discriminant Functions 2 2 Fisher’s Linear Discriminant The Perceptron 0 0 Algorithm −2 −2 −2 2 6 −2 2 6 252of 267
  • 25. Introduction to Statistical Fisher’s Linear Discriminant - First Try Machine Learning c 2011 Christfried Webers NICTA Given N1 input data of class C1 , and N2 input data of class The Australian National University C2 , calculate the centres of the two classes 1 1 m1 = xn , m2 = xn N1 N2 n∈C1 n∈C2 Choose w so as to maximise the projection of the class Classification means onto w Generalised Linear Model m1 − m2 = wT (m1 − m2 ) Inference and Decision Discriminant Functions Problem with non-uniform covariance Fisher’s Linear Discriminant 4 The Perceptron Algorithm 2 0 −2 −2 2 6 253of 267
  • 26. Introduction to Statistical Fisher’s Linear Discriminant Machine Learning c 2011 Christfried Webers NICTA Measure also the within-class variance for each class The Australian National University s2 = k (yn − mk )2 n∈Ck where yn = wT xn . Classification Maximise the Fisher criterion Generalised Linear Model 2 (m2 − m1 ) Inference and Decision J(w) = s2 + s2 1 2 Discriminant Functions Fisher’s Linear Discriminant 4 The Perceptron Algorithm 2 0 −2 −2 2 6 254of 267
  • 27. Introduction to Statistical Fisher’s Linear Discriminant Machine Learning c 2011 Christfried Webers NICTA The Australian National University The Fisher criterion can be rewritten as wT SB w J(w) = Classification wT S W w Generalised Linear Model SB is the between-class covariance Inference and Decision Discriminant Functions SB = (m2 − m1 )(m2 − m1 )T Fisher’s Linear Discriminant SW is the within-class covariance The Perceptron Algorithm SW = (xn − m1 )(xn − m1 )T + (xn − m2 )(xn − m2 )T n∈C1 n∈C2 255of 267
  • 28. Introduction to Statistical Fisher’s Linear Discriminant Machine Learning c 2011 Christfried Webers NICTA The Australian National University The Fisher criterion wT SB w Classification J(w) = wT S W w Generalised Linear Model Inference and Decision has a maximum for Fisher’s linear discriminant Discriminant Functions w∝ S−1 (m2 W − m1 ) Fisher’s Linear Discriminant The Perceptron Algorithm Fisher’s linear discriminant is NOT a discriminant, but can be used to construct one by choosing a threshold y0 in the projection space. 256of 267
  • 29. Introduction to Statistical Fisher’s Discriminant For Multi-Class Machine Learning c 2011 Christfried Webers NICTA Assume that the dimensionality of the input space D is The Australian National University greater than the number of classes K. Use D > 1 linear ’features’ yk = wT x and write everything in vector form (no bias involved!) y = wT x. Classification Generalised Linear The within-class covariance is then the sum of the Model Inference and Decision covariances for all K classes Discriminant Functions K Fisher’s Linear Discriminant SW = Sk The Perceptron k=1 Algorithm where Sk = (xn − mk )(xn − mk )T n∈Ck 1 mk = xn Nk n∈Ck 257of 267
  • 30. Introduction to Statistical Fisher’s Discriminant For Multi-Class Machine Learning c 2011 Christfried Webers NICTA The Australian National Between-class covariance University K SB = (mk − m)(mk − m)T . k=1 Classification where m is the total mean of the input data Generalised Linear Model N Inference and Decision 1 m= x. Discriminant Functions N n=1 Fisher’s Linear Discriminant One possible way to define a function of W which is large The Perceptron Algorithm when the between-class covariance is large and the within-class covariance is small is given by J(W) = tr (WSW WT )−1 (WSB WT ) The maximum of J(W) is determined by the D eigenvectors of S−1 SB with the largest eigenvalues. W 258of 267
  • 31. Introduction to Statistical Fisher’s Discriminant For Multi-Class Machine Learning c 2011 Christfried Webers NICTA The Australian National University Classification How many linear ’features’ can one find with this method? Generalised Linear Model SB is of rank at most K − 1 because of the sum of K rank Inference and Decision one matrices and the global constraint via m. Discriminant Functions Projection onto the subspace spanned by SB can not have Fisher’s Linear Discriminant more than K − 1 linear features. The Perceptron Algorithm 259of 267
  • 32. Introduction to Statistical The Perceptron Algorithm Machine Learning c 2011 Christfried Webers NICTA The Australian National University Frank Rosenblatt (1928 - 1969) "Principles of neurodynamics: Perceptrons and the theory of brain mechanisms" (Spartan Books, 1962) Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm 260of 267
  • 33. Introduction to Statistical The Perceptron Algorithm Machine Learning c 2011 Christfried Webers NICTA The Australian National University Perceptron ("MARK 1") was the first computer which could learn new skills by trial and error Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm 261of 267
  • 34. Introduction to Statistical The Perceptron Algorithm Machine Learning c 2011 Christfried Webers NICTA The Australian National Two class model University Create feature vector φ(x) by a fixed nonlinear transformation of the input x. Generalised linear model Classification y(x) = f (wT φ(x)) Generalised Linear Model with φ(x) containing some bias element φ0 (x) = 1. Inference and Decision Discriminant Functions nonlinear activation function Fisher’s Linear Discriminant +1, a ≥ 0 The Perceptron f (a) = Algorithm −1, a < 0 Target coding for perceptron +1, if C1 t= −1, if C2 262of 267
  • 35. Introduction to Statistical The Perceptron Algorithm - Error Function Machine Learning c 2011 Christfried Webers NICTA The Australian National University Idea : Minimise total number of misclassified patterns. Problem : As a function of w, this is piecewise constant Classification and therefore the gradient is zero almost everywhere. Generalised Linear Model Better idea: Using the (−1, +1) target coding scheme, we Inference and Decision want all patterns to satisfy wT φ(xn )tn > 0. Discriminant Functions Perceptron Criterion : Add the errors for all patterns Fisher’s Linear Discriminant belonging to the set of misclassified patterns M The Perceptron Algorithm EP (w) = − wT φ(xn )tn n∈M 263of 267
  • 36. Introduction to Statistical Perceptron - Stochastic Gradient Descent Machine Learning c 2011 Christfried Webers NICTA The Australian National University Perceptron Criterion (with notation φn = φ(xn ) ) EP (w) = − wT φn tn Classification n∈M Generalised Linear Model One iteration at step τ Inference and Decision 1 Choose a training pair (xn , tn ) Discriminant Functions 2 Update the weight vector w by Fisher’s Linear Discriminant w(τ +1) = w(τ ) − η EP (w) = w(τ ) + ηφn tn The Perceptron Algorithm As y(x, w) does not depend on the norm of w, one can set η=1 w(τ +1) = w(τ ) + φn tn 264of 267
  • 37. Introduction to Statistical The Perceptron Algorithm - Update 1 Machine Learning c 2011 Christfried Webers NICTA The Australian National University Update of the perceptron weights from a misclassified pattern (green) w(τ +1) = w(τ ) + φn tn Classification Generalised Linear Model Inference and Decision 1 1 Discriminant Functions Fisher’s Linear 0.5 0.5 Discriminant The Perceptron Algorithm 0 0 −0.5 −0.5 −1 −1 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 265of 267
  • 38. Introduction to Statistical The Perceptron Algorithm - Update 2 Machine Learning c 2011 Christfried Webers NICTA The Australian National University Update of the perceptron weights from a misclassified pattern (green) w(τ +1) = w(τ ) + φn tn Classification Generalised Linear Model Inference and Decision 1 1 Discriminant Functions Fisher’s Linear 0.5 0.5 Discriminant The Perceptron Algorithm 0 0 −0.5 −0.5 −1 −1 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 266of 267
  • 39. Introduction to Statistical The Perceptron Algorithm - Convergence Machine Learning c 2011 Christfried Webers NICTA The Australian National University Does the algorithm converge ? For a single update step −w(τ +1)T φn tn = −w(τ )T φn tn − (φn tn )T φn tn < −w(τ )T φn tn Classification Generalised Linear Model T because (φn tn ) φn tn = φn tn > 0. Inference and Decision BUT: contributions to the error from the other misclassified Discriminant Functions patterns might have increased. Fisher’s Linear Discriminant AND: some correctly classified patterns might now be The Perceptron Algorithm misclassified. Perceptron Convergence Theorem : If the training set is linearly separable, the perceptron algorithm is guaranteed to find a solution in a finite number of steps. 267of 267