SlideShare a Scribd company logo
1 of 34
Download to read offline
Introduction to Statistical
                                                                                   Machine Learning

                                                                                       c 2011
                                                                                 Christfried Webers
                                                                                       NICTA
                                                                               The Australian National
                                                                                     University

Introduction to Statistical Machine Learning

                         Christfried Webers

                      Statistical Machine Learning Group
                                     NICTA
                                      and
                College of Engineering and Computer Science
                      The Australian National University


                            Canberra
                       February – June 2011



 (Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")



                                                                                                  1of 300
Introduction to Statistical
                             Machine Learning

                                 c 2011
                           Christfried Webers
                                 NICTA
                         The Australian National
                               University




      Part VIII
                         Probabilistic Generative
                         Models

                         Continuous Input

Linear Classification 2   Discrete Features

                         Probabilistic
                         Discriminative Models

                         Logistic Regression

                         Iterative Reweighted
                         Least Squares

                         Laplace Approximation

                         Bayesian Logistic
                         Regression




                                          268of 300
Introduction to Statistical

Three Models for Decision Problems                                              Machine Learning

                                                                                    c 2011
                                                                              Christfried Webers
                                                                                    NICTA
                                                                            The Australian National
                                                                                  University
In increasing order of complexity
    Find a discriminant function f (x) which maps each input
    directly onto a class label.
    Discriminative Models                                                   Probabilistic Generative
      1   Solve the inference problem of determining the posterior          Models

          class probabilities p(Ck | x).                                    Continuous Input

      2   Use decision theory to assign each new x to one of the            Discrete Features

          classes.                                                          Probabilistic
                                                                            Discriminative Models

    Generative Models                                                       Logistic Regression

      1   Solve the inference problem of determining the                    Iterative Reweighted
                                                                            Least Squares
          class-conditional probabilities p(x | Ck ).                       Laplace Approximation
      2   Also, infer the prior class probabilities p(Ck ).                 Bayesian Logistic
      3   Use Bayes’ theorem to find the posterior p(Ck | x).                Regression

      4   Alternatively, model the joint distribution p(x, Ck ) directly.
      5   Use decision theory to assign each new x to one of the
          classes.


                                                                                             269of 300
Introduction to Statistical

Probabilistic Generative Models                                       Machine Learning

                                                                          c 2011
                                                                    Christfried Webers
                                                                          NICTA
                                                                  The Australian National
                                                                        University

    Generative approach: model class-conditional densities
    p(x | Ck ) and priors p(Ck ) to calculate the posterior
    probability for class C1

                                  p(x | C1 )p(C1 )                Probabilistic Generative

             p(C1 | x) =                                          Models

                        p(x | C1 )p(C1 ) + p(x | C2 )p(C2 )       Continuous Input

                                 1                                Discrete Features
                      =                     = σ(a(x))             Probabilistic
                        1 + exp(−a(x))                            Discriminative Models

                                                                  Logistic Regression
    where a and the logistic sigmoid function σ(a) are given by   Iterative Reweighted
                                                                  Least Squares

                        p(x | C1 ) p(C1 )      p(x, C1 )          Laplace Approximation
               a(x) = ln                  = ln                    Bayesian Logistic
                        p(x | C2 ) p(C2 )      p(x, C2 )          Regression

                           1
               σ(a) =                .
                      1 + exp(−a)


                                                                                   270of 300
Introduction to Statistical

Logistic Sigmoid                                                                              Machine Learning

                                                                                                  c 2011
                                                                                            Christfried Webers
                                                                                                  NICTA
                                                                                          The Australian National
                                                          1
        The logistic sigmoid function σ(a) =          1+exp(−a)
                                                                                                University


        "squashing function’ because it maps the real axis into a
        finite interval (0, 1)
        σ(−a) = 1 − σ(a)
                     d
                     da σ(a)   = σ(a) σ(−a) = σ(a) (1 − σ(a))                             Probabilistic Generative
        Derivative                                                                        Models

                                                               σ                          Continuous Input
        Inverse is called logit function a(σ) = ln            1−σ                         Discrete Features

                                                                                          Probabilistic
                                                                                          Discriminative Models

                                                                                          Logistic Regression
                      Σa
                                                aΣ
                     1.0                                                                  Iterative Reweighted
                                                                                          Least Squares
                                                4
                     0.8
                                                                                          Laplace Approximation
                                                2
                     0.6                                                                  Bayesian Logistic
                                                                                          Regression
                                                                                      Σ
                                                     0.2      0.4   0.6   0.8   1.0
                     0.4
                                                2

                     0.2
                                                4

                                            a   6
   10         5                 5      10


        Logistic Sigmoid σ(a)                              Logit a(σ)
                                                                                                           271of 300
Introduction to Statistical

Probabilistic Generative Models - Multiclass                              Machine Learning

                                                                              c 2011
                                                                        Christfried Webers
                                                                              NICTA
                                                                      The Australian National
                                                                            University




    The normalised exponential is given by

                           p(x | Ck ) p(Ck )         exp(ak )         Probabilistic Generative
             p(Ck | x) =                        =                     Models
                            j p(x | Cj ) p(Cj )       j exp(aj )      Continuous Input

                                                                      Discrete Features
    where                                                             Probabilistic
                       ak = ln(p(x | Ck ) p(Ck )).                    Discriminative Models

                                                                      Logistic Regression
    Also called softmax function as it is a smoothed version of       Iterative Reweighted
    the max function.                                                 Least Squares

                                                                      Laplace Approximation
    Example: If ak    aj for all j = k, then p(Ck | x)       1, and   Bayesian Logistic
    p(Cj | x) 0.                                                      Regression




                                                                                       272of 300
Introduction to Statistical

Probabil. Generative Model - Continuous Input                         Machine Learning

                                                                          c 2011
                                                                    Christfried Webers
                                                                          NICTA
                                                                  The Australian National
                                                                        University




    Assume class-conditional probabilities are Gaussian, all
    classes share the same covariance. What can we say
    about the posterior probabilities?
                                                                  Probabilistic Generative
                                                                  Models
                    1        1         1
    p(x | Ck ) =      D/2 |Σ|1/2
                                 exp − (x − µk )T Σ−1 (x − µk )   Continuous Input

                 (2π)                  2                          Discrete Features

                                                                  Probabilistic
                    1        1         1 T −1                     Discriminative Models
               =                 exp − x Σ x
                 (2π)D/2 |Σ|1/2        2                          Logistic Regression

                                                                  Iterative Reweighted
                                    1 T −1
                × exp µT Σ−1 x − µk Σ µk
                                                                  Least Squares
                           k
                                    2                             Laplace Approximation

                                                                  Bayesian Logistic
                                                                  Regression
    where we separated the quadratic term in x and the linear
    term.



                                                                                   273of 300
Introduction to Statistical

Probabil. Generative Model - Continuous Input                            Machine Learning

                                                                             c 2011
                                                                       Christfried Webers
                                                                             NICTA
                                                                     The Australian National
    For two classes                                                        University

                              p(C1 | x) = σ(a(x))
    and a(x) is

                       p(x | C1 ) p(C1 )                             Probabilistic Generative
         a(x) = ln                                                   Models
                       p(x | C2 ) p(C2 )
                                                                     Continuous Input
                       exp µT Σ−1 x − 2 µT Σ−1 µ1
                            1
                                      1
                                         1                  p(C1 )   Discrete Features
                = ln                                   + ln
                       exp µT Σ−1 x − 1 µT Σ−1 µ2
                            2         2 2
                                                            p(C2 )   Probabilistic
                                                                     Discriminative Models

                                                                     Logistic Regression
    Therefore                                                        Iterative Reweighted
                           p(C1 | x) = σ(wT x + w0 )                 Least Squares

                                                                     Laplace Approximation

    where                                                            Bayesian Logistic
                                                                     Regression


            w = Σ−1 (µ1 − µ2 )
                  1           1               p(C1 )
            w0 = − µT Σ−1 µ1 + µT Σ−1 µ2 + ln
                    1           2
                  2           2               p(C2 )
                                                                                      274of 300
Introduction to Statistical

Probabil. Generative Model - Continuous Input                               Machine Learning

                                                                                c 2011
                                                                          Christfried Webers
                                                                                NICTA
                                                                        The Australian National
                                                                              University




   Class-conditional densities for two classes (left). Posterior
 probability p(C1 | x) (right). Note the logistic sigmoid of a linear
                             function of x.                             Probabilistic Generative
                                                                        Models

                                                                        Continuous Input

                                                                        Discrete Features

                                                                        Probabilistic
                                                                        Discriminative Models

                                                                        Logistic Regression

                                                                        Iterative Reweighted
                                                                        Least Squares

                                                                        Laplace Approximation

                                                                        Bayesian Logistic
                                                                        Regression




                                                                                         275of 300
Introduction to Statistical

General Case - K Classes, Shared Covariance                            Machine Learning

                                                                           c 2011
                                                                     Christfried Webers
                                                                           NICTA
                                                                   The Australian National
                                                                         University
    Use the normalised exponential

                             p(x | Ck )p(Ck )        exp(ak )
              p(Ck | x) =                        =
                              j p(x | Cj )p(Cj )      j exp(aj )
                                                                   Probabilistic Generative
                                                                   Models
    where
                                                                   Continuous Input
                       ak = ln (p(x | Ck )p(Ck )) .                Discrete Features

    to get a linear function of x                                  Probabilistic
                                                                   Discriminative Models

                                                                   Logistic Regression
                            ak (x) = wT x + wk0 .
                                      k                            Iterative Reweighted
                                                                   Least Squares

    where                                                          Laplace Approximation

                                                                   Bayesian Logistic
                                                                   Regression
                     wk = Σ−1 µk
                            1
                     wk0 = − µT Σ−1 µk + p(Ck ).
                            2 k

                                                                                    276of 300
Introduction to Statistical

General Case - K Classes, Different Covariance                       Machine Learning

                                                                         c 2011
                                                                   Christfried Webers
                                                                         NICTA
                                                                 The Australian National
                                                                       University
    If each class-conditional probability has a different
    covariance, the quadratic terms − 1 xT Σ−1 x do not longer
                                        2
    cancel each other out.
    We get a quadratic discriminant.
                                                                 Probabilistic Generative
                                                                 Models

                                                                 Continuous Input

     2.5                                                         Discrete Features

      2                                                          Probabilistic
                                                                 Discriminative Models
     1.5
                                                                 Logistic Regression
      1
                                                                 Iterative Reweighted
     0.5                                                         Least Squares
      0                                                          Laplace Approximation
    −0.5
                                                                 Bayesian Logistic
     −1                                                          Regression

    −1.5
     −2
    −2.5
           −2   −1   0   1   2




                                                                                  277of 300
Introduction to Statistical

Maximum Likelihood Solution                                                Machine Learning

                                                                               c 2011
                                                                         Christfried Webers
                                                                               NICTA
                                                                       The Australian National
                                                                             University

   Given the functional form of the class-conditional densities
   p(x | Ck ), can we determine the parameters µ and Σ ?
   Not without data ;-)
   Given also a data set (xn , tn ) for n = 1, . . . , N. (Using the   Probabilistic Generative
                                                                       Models
   coding scheme where tn = 1 corresponds to class C1 and
                                                                       Continuous Input
   tn = 0 denotes class C2 .
                                                                       Discrete Features
   Assume the class-conditional densities to be Gaussian               Probabilistic
                                                                       Discriminative Models
   with the same covariance, but different mean.
                                                                       Logistic Regression
   Denote the prior probability p(C1 ) = π, and therefore              Iterative Reweighted

   p(C2 ) = 1 − π.                                                     Least Squares

                                                                       Laplace Approximation
   Then                                                                Bayesian Logistic
                                                                       Regression

          p(xn , C1 ) = p(C1 )p(xn | C1 ) = π N (xn | µ1 , Σ)
          p(xn , C2 ) = p(C2 )p(xn | C2 ) = (1 − π) N (xn | µ2 , Σ)


                                                                                        278of 300
Introduction to Statistical

Maximum Likelihood Solution                                                     Machine Learning

                                                                                    c 2011
                                                                              Christfried Webers
                                                                                    NICTA
   Thus the likelihood for the whole data set X and t is given              The Australian National
                                                                                  University
   by
                                       N
       p(t, X | π, µ1 , µ2 , Σ) =           [π N (xn | µ1 , Σ)]tn ×
                                      n=1
                                                                            Probabilistic Generative
                                            [(1 − π) N (xn | µ2 , Σ)]1−tn   Models

                                                                            Continuous Input

   Maximise the log likelihood                                              Discrete Features

   The term depending on π is                                               Probabilistic
                                                                            Discriminative Models

                                                                            Logistic Regression
                     N
                                                                            Iterative Reweighted
                          (tn ln π + (1 − tn ) ln(1 − π)                    Least Squares

                    n=1                                                     Laplace Approximation

                                                                            Bayesian Logistic
   which is maximal for                                                     Regression


                               N
                           1                N1     N1
                   π=                tn =      =
                           N                N    N1 + N2
                               n=1

   where N1 is the number of data points in class C1 .
                                                                                             279of 300
Introduction to Statistical

Maximum Likelihood Solution                                              Machine Learning

                                                                             c 2011
                                                                       Christfried Webers
                                                                             NICTA
                                                                     The Australian National
                                                                           University




   Similarly, we can maximise the log likelihood (and thereby
   the likelihood p(t, X | π, µ1 , µ2 , Σ) ) depending on the mean
   µ1 or µ2 , and get                                                Probabilistic Generative
                                                                     Models

                                  N                                  Continuous Input
                             1
                      µ1 =              tn x n                       Discrete Features
                             N1                                      Probabilistic
                                  n=1                                Discriminative Models
                                  N
                             1                                       Logistic Regression
                      µ2 =              (1 − tn ) xn                 Iterative Reweighted
                             N2                                      Least Squares
                                  n=1
                                                                     Laplace Approximation

   For each class, this are the means of all input vectors           Bayesian Logistic
                                                                     Regression
   assigned to this class.




                                                                                      280of 300
Introduction to Statistical

Maximum Likelihood Solution                                             Machine Learning

                                                                            c 2011
                                                                      Christfried Webers
                                                                            NICTA
                                                                    The Australian National
                                                                          University




   Finally, the log likelihood ln p(t, X | π, µ1 , µ2 , Σ) can be   Probabilistic Generative
   maximised for the covariance Σ resulting in                      Models

                                                                    Continuous Input

                       N1      N2                                   Discrete Features
                  Σ=      S1 +    S2
                       N        N                                   Probabilistic
                                                                    Discriminative Models
                       1
                  Sk =        (xn − µk )(xn − µk )T                 Logistic Regression

                       Nk                                           Iterative Reweighted
                            n∈Ck                                    Least Squares

                                                                    Laplace Approximation

                                                                    Bayesian Logistic
                                                                    Regression




                                                                                     281of 300
Introduction to Statistical

Discrete Features - Naive Bayes                                       Machine Learning

                                                                          c 2011
                                                                    Christfried Webers
                                                                          NICTA
                                                                  The Australian National
                                                                        University



    Assume the input space consists of discrete features, in
    the simplest case xi ∈ {0, 1}.
    For a D-dimensional input space, a general distribution
    would be represented by a table with 2D entries.              Probabilistic Generative
                                                                  Models

    Together with the normalisation constraint, this are 2D − 1   Continuous Input

                                                                  Discrete Features
    independent variables.
                                                                  Probabilistic
    Grows exponentially with the number of features.              Discriminative Models

                                                                  Logistic Regression
    The Naive Bayes assumption is that all features               Iterative Reweighted
    conditioned on the class Ck are independent of each other.    Least Squares

                                                                  Laplace Approximation
                                  D                               Bayesian Logistic

                   p(x | Ck ) =         µxii (1 − µki )1−xi
                                         k
                                                                  Regression


                                  i=1




                                                                                   282of 300
Introduction to Statistical

Discrete Features - Naive Bayes                                                  Machine Learning

                                                                                     c 2011
                                                                               Christfried Webers
                                                                                     NICTA
                                                                             The Australian National
                                                                                   University

    With the naive Bayes
                                        D
                         p(x | Ck ) =         µxii (1 − µki )1−xi
                                               k
                                        i=1                                  Probabilistic Generative
                                                                             Models

    we can then again find the factors ak in the normalised                   Continuous Input

    exponential                                                              Discrete Features

                                                                             Probabilistic
                                                                             Discriminative Models
                                 p(x | Ck )p(Ck )            exp(ak )
              p(Ck | x) =                            =                       Logistic Regression

                                  j p(x | Cj )p(Cj )          j exp(aj )     Iterative Reweighted
                                                                             Least Squares

                                                                             Laplace Approximation
    as a linear function of the xi
                                                                             Bayesian Logistic
                                                                             Regression
                   D
        ak (x) =         {xi ln µki + (1 − xi ) ln(1 − µki )} + ln p(Ck ).
                   i=1



                                                                                              283of 300
Introduction to Statistical

Three Models for Decision Problems                                              Machine Learning

                                                                                    c 2011
                                                                              Christfried Webers
                                                                                    NICTA
                                                                            The Australian National
                                                                                  University
In increasing order of complexity
    Find a discriminant function f (x) which maps each input
    directly onto a class label.
    Discriminative Models                                                   Probabilistic Generative
      1   Solve the inference problem of determining the posterior          Models

          class probabilities p(Ck | x).                                    Continuous Input

      2   Use decision theory to assign each new x to one of the            Discrete Features

          classes.                                                          Probabilistic
                                                                            Discriminative Models

    Generative Models                                                       Logistic Regression

      1   Solve the inference problem of determining the                    Iterative Reweighted
                                                                            Least Squares
          class-conditional probabilities p(x | Ck ).                       Laplace Approximation
      2   Also, infer the prior class probabilities p(Ck ).                 Bayesian Logistic
      3   Use Bayes’ theorem to find the posterior p(Ck | x).                Regression

      4   Alternatively, model the joint distribution p(x, Ck ) directly.
      5   Use decision theory to assign each new x to one of the
          classes.


                                                                                             284of 300
Introduction to Statistical

Probabilistic Discriminative Models                                     Machine Learning

                                                                            c 2011
                                                                      Christfried Webers
                                                                            NICTA
                                                                    The Australian National
                                                                          University




    Maximise a likelihood function defined through the
    conditional distribution p(Ck | x) directly.
                                                                    Probabilistic Generative
    Discriminative training                                         Models

                                                                    Continuous Input
    Typically fewer parameters to be determined.
                                                                    Discrete Features
    As we learn the posteriror p(Ck | x) directly, prediction may   Probabilistic
                                                                    Discriminative Models
    be better than with a generative model where the
                                                                    Logistic Regression
    class-conditional density assumptions p(x | Ck ) poorly
                                                                    Iterative Reweighted
    approximate the true distributions.                             Least Squares

                                                                    Laplace Approximation
    But: discriminative models can not create synthetic data,
                                                                    Bayesian Logistic
    as p(x) is not modelled.                                        Regression




                                                                                     285of 300
Introduction to Statistical

Original Input versus Feature Space                                      Machine Learning

                                                                             c 2011
                                                                       Christfried Webers
                                                                             NICTA
        Used direct input x until now.                               The Australian National
                                                                           University

        All classification algorithms work also if we first apply a
        fixed nonlinear transformation of the inputs using a vector
        of basis functions φ(x).
        Example: Use two Gaussian basis functions centered at
                                                                     Probabilistic Generative
        the green crosses in the input space.                        Models

                                                                     Continuous Input

                                                                     Discrete Features

                                                                     Probabilistic
                                                                     Discriminative Models
                                        1
                                                                     Logistic Regression
        1
                                                                     Iterative Reweighted
                                      φ2                             Least Squares
   x2
                                                                     Laplace Approximation

        0                              0.5                           Bayesian Logistic
                                                                     Regression




    −1
                                        0

            −1       0    x1   1             0       0.5   φ1   1
                                                                                      286of 300
Introduction to Statistical

Original Input versus Feature Space                                 Machine Learning

                                                                        c 2011
                                                                  Christfried Webers
                                                                        NICTA
                                                                The Australian National
    Linear decision boundaries in the feature space                   University

    correspond to nonlinear decision boundaries in the input
    space.
    Classes which are NOT linearly separable in the input
    space can become linearly separable in the feature space.   Probabilistic Generative
                                                                Models
    BUT: If classes overlap in input space, they will also      Continuous Input
    overlap in feature space.                                   Discrete Features

    Nonlinear features φ(x) can not remove the overlap; but     Probabilistic
                                                                Discriminative Models
    they may increase it !                                      Logistic Regression

                                                                Iterative Reweighted
                                                                Least Squares
                                        1
          1
                                                                Laplace Approximation

                                       φ2                       Bayesian Logistic
     x2
                                                                Regression

          0                            0.5




      −1
                                        0

              −1   0   x1   1                0   0.5   φ1   1


                                                                                 287of 300
Introduction to Statistical

Original Input versus Feature Space                               Machine Learning

                                                                      c 2011
                                                                Christfried Webers
                                                                      NICTA
                                                              The Australian National
                                                                    University




    Fixed basis functions do not adapt to the data and
    therefore have important limitations (see discussion in   Probabilistic Generative
                                                              Models
    Linear Regression).
                                                              Continuous Input
    Understanding of more advanced algorithms becomes         Discrete Features
    easier if we introduce the feature space now and use it   Probabilistic
                                                              Discriminative Models
    instead of the original input space.
                                                              Logistic Regression
    Some applications use fixed features successfully by       Iterative Reweighted
    avoiding the limitations.                                 Least Squares

                                                              Laplace Approximation
    We will therefore use φ instead of x from now on.         Bayesian Logistic
                                                              Regression




                                                                               288of 300
Introduction to Statistical

Logistic Regression is Classification                                    Machine Learning

                                                                            c 2011
                                                                      Christfried Webers
                                                                            NICTA
                                                                    The Australian National
                                                                          University


    Two classes where the posterior of class C1 is a logistic
    sigmoid σ() acting on a linear function of the feature vector
    φ
                     p(C1 | φ) = y(φ) = σ(wT φ)                     Probabilistic Generative
                                                                    Models
    p(C2 | φ) = 1 − p(C1 | φ)                                       Continuous Input

    Model dimension is equal to dimension of the feature            Discrete Features

                                                                    Probabilistic
    space M.                                                        Discriminative Models

    Compare this to fitting two Gaussians                            Logistic Regression

                                                                    Iterative Reweighted
                                                                    Least Squares
                 2M + M(M + 1)/2 = M(M + 5)/2                       Laplace Approximation
                means   shared covariance                           Bayesian Logistic
                                                                    Regression

    For larger M, the logistic regression model has a clear
    advantage.


                                                                                     289of 300
Introduction to Statistical

Logistic Regression is Classification                                             Machine Learning

                                                                                     c 2011
                                                                               Christfried Webers
                                                                                     NICTA
                                                                             The Australian National
                                                                                   University

    Determine the parameter via maximum likelihood for data
    (φn , tn ), n = 1, . . . , N, where φn = φ(xn ). The class
    membership is coded as tn ∈ {0, 1}.
    Likelihood function                                                      Probabilistic Generative
                                                                             Models
                                  N                                          Continuous Input
                     p(t | w) =         ytnn (1 − yn )1−tn                   Discrete Features

                                  n=1                                        Probabilistic
                                                                             Discriminative Models

    where yn = p(C1 | φn ).                                                  Logistic Regression

                                                                             Iterative Reweighted
    Error function : negative log likelihood resulting in the                Least Squares

    cross-entropy error function                                             Laplace Approximation

                                                                             Bayesian Logistic
                                                                             Regression
                                  N
     E(w) = − ln p(t | w) = −           {tn ln yn + (1 − tn ) ln(1 − yn )}
                                  n=1




                                                                                              290of 300
Introduction to Statistical

Logistic Regression is Classification                                     Machine Learning

                                                                             c 2011
                                                                       Christfried Webers
                                                                             NICTA
                                                                     The Australian National
    Error function (cross-entropy error )                                  University


                         N
             E(w) = −         {tn ln yn + (1 − tn ) ln(1 − yn )}
                        n=1
                                                                     Probabilistic Generative
    yn = p(C1 | φn ) = σ(wT φn )                                     Models

                                                                     Continuous Input
                                                 dσ
    Gradient of the error function (using        da   = σ(1 − σ) )   Discrete Features

                                                                     Probabilistic
                                     N                               Discriminative Models

                         E(w) =           (yn − tn )φn               Logistic Regression

                                    n=1                              Iterative Reweighted
                                                                     Least Squares

                                                                     Laplace Approximation
    gradient does not contain any sigmoid function
                                                                     Bayesian Logistic
    for each data point error is product of deviation yn − tn and    Regression

    basis function φn .
    BUT : maximum likelihood solution can exhibit over-fitting
    even for many data points; should use regularised error or
    MAP then.
                                                                                      291of 300
Introduction to Statistical

Laplace Approximation                                                     Machine Learning

                                                                              c 2011
                                                                        Christfried Webers
                                                                              NICTA
                                                                      The Australian National
                                                                            University
        Given a continous distribution p(x) which is not Gaussian,
        can we approximate it by a Gaussian q(x) ?
        Need to find a mode of p(x). Try to find a Gaussian with
        the same mode.
                                                                      Probabilistic Generative
                                                                      Models

                                                                      Continuous Input
  0.8                                  40
                                                                      Discrete Features

                                                                      Probabilistic
  0.6                                  30
                                                                      Discriminative Models

                                                                      Logistic Regression
  0.4                                  20
                                                                      Iterative Reweighted
                                                                      Least Squares
  0.2                                  10
                                                                      Laplace Approximation

   0                                    0                             Bayesian Logistic
   −2     −1   0   1   2   3   4        −2   −1   0   1   2   3   4   Regression



 Non-Gaussian (yellow) and               Negative log of the
Gaussian approximation (red).         Non-Gaussian (yellow) and
                                       Gaussian approx. (red).
                                                                                       292of 300
Introduction to Statistical

Laplace Approximation                                        Machine Learning

                                                                 c 2011
                                                           Christfried Webers
                                                                 NICTA
                                                         The Australian National
                                                               University

   Assume p(x) can be written as

                                       1
                              p(z) =     f (z)
                                       Z
                                                         Probabilistic Generative
                                                         Models
   with normalisation Z =       f (z) dz.
                                                         Continuous Input
   Furthermore, assume Z is unknown !                    Discrete Features

   A mode of p(z) is at a point z0 where p (z0 ) = 0.    Probabilistic
                                                         Discriminative Models

   Taylor expansion of ln f (z) at z0                    Logistic Regression

                                                         Iterative Reweighted
                                                         Least Squares
                                          1
                   ln f (z)   ln f (z0 ) − A(z − z0 )2   Laplace Approximation
                                          2              Bayesian Logistic
                                                         Regression
   where
                                d2
                         A=−        ln f (z) |z=z0
                                dz2


                                                                          293of 300
Introduction to Statistical

Laplace Approximation                                             Machine Learning

                                                                      c 2011
                                                                Christfried Webers
                                                                      NICTA
                                                              The Australian National
                                                                    University


   Exponentiating

                                         1
                 ln f (z)    ln f (z0 ) − A(z − z0 )2
                                         2                    Probabilistic Generative
                                                              Models

   we get                                                     Continuous Input
                                    A
                 f (z) f (z0 ) exp{− (z − z0 )2 }.
                                                              Discrete Features

                                    2                         Probabilistic
                                                              Discriminative Models
   And after normalisation we get the Laplace approximation   Logistic Regression

                                                              Iterative Reweighted
                                 1/2                          Least Squares
                            A               A
               q(z) =                  exp{− (z − z0 )2 }.    Laplace Approximation
                            2π              2                 Bayesian Logistic
                                                              Regression

   Only defined for precision A > 0 as only then p(z) has a
   maximum.


                                                                               294of 300
Introduction to Statistical

Laplace Approximation - Vector Space                              Machine Learning

                                                                      c 2011
                                                                Christfried Webers
                                                                      NICTA
                                                              The Australian National
    Approximate p(z) for z ∈ RM                                     University


                                       1
                            p(z) =       f (z).
                                       Z
    we get the Taylor expansion                               Probabilistic Generative
                                                              Models

                                     1                        Continuous Input
              ln f (z)   ln f (z0 ) − (z − z0 )T A(z − z0 )
                                     2                        Discrete Features

                                                              Probabilistic
                                                              Discriminative Models
    where the Hessian A is defined as
                                                              Logistic Regression

                                                              Iterative Reweighted
                         A=−          ln f (z) |z=z0 .        Least Squares

                                                              Laplace Approximation

    The Laplace approximation of p(z) is then                 Bayesian Logistic
                                                              Regression


                     |A|1/2      1
           q(z) =           exp − (z − z0 )T A(z − z0 )
                    (2π)M/2      2
                = N (z | z0 , A−1 )
                                                                               295of 300
Introduction to Statistical

Bayesian Logistic Regression                                          Machine Learning

                                                                          c 2011
                                                                    Christfried Webers
                                                                          NICTA
                                                                  The Australian National
                                                                        University




    Exact Bayesian inference for the logistic regression is       Probabilistic Generative
    intractable.                                                  Models

                                                                  Continuous Input
    Why? Need to normalise a product of prior probabilities
                                                                  Discrete Features
    and likelihoods which itself are a product of logistic        Probabilistic
    sigmoid functions, one for each data point.                   Discriminative Models

                                                                  Logistic Regression
    Evaluation of the predictive distribution also intractable.   Iterative Reweighted
                                                                  Least Squares
    Therefore we will use the Laplace approximation.
                                                                  Laplace Approximation

                                                                  Bayesian Logistic
                                                                  Regression




                                                                                   296of 300
Introduction to Statistical

Bayesian Logistic Regression                                                 Machine Learning

                                                                                 c 2011
                                                                           Christfried Webers
                                                                                 NICTA
                                                                         The Australian National
                                                                               University




    Assume a Gaussian prior because we want a Gaussian
    posterior.
                    p(w) = N (w | m0 , S0 )                              Probabilistic Generative
                                                                         Models
    for fixed hyperparameter m0 and S0 .                                  Continuous Input

    Hyperparameters are parameters of a prior distribution. In           Discrete Features

    contrast to the model parameters w, they are not learned.            Probabilistic
                                                                         Discriminative Models

    For a set of training data (xn , tn ), where n = 1, . . . , N, the   Logistic Regression

    posterior is given by                                                Iterative Reweighted
                                                                         Least Squares

                                                                         Laplace Approximation
                             p(w | t) ∝ p(w)p(t | w)                     Bayesian Logistic
                                                                         Regression

    where t = (t1 , . . . , tN )T .




                                                                                          297of 300
Introduction to Statistical

Bayesian Logistic Regression                                                      Machine Learning

                                                                                      c 2011
                                                                                Christfried Webers
                                                                                      NICTA
                                                                              The Australian National
                                                                                    University

    Using our previous result for the cross-entropy function
                                    N
     E(w) = − ln p(t | w) = −            {tn ln yn + (1 − tn ) ln(1 − yn )}
                                   n=1                                        Probabilistic Generative
                                                                              Models

    we can now calculate the log of the posterior                             Continuous Input

                                                                              Discrete Features

                       p(w | t) ∝ p(w)p(t | w)                                Probabilistic
                                                                              Discriminative Models

                                                                              Logistic Regression
    using the notation yn = σ(wT φn ) as                                      Iterative Reweighted
                                                                              Least Squares

                         1                                                    Laplace Approximation
          ln p(w | t) = − (w − m0 )T S−1 (w − m0 )
                                      0
                         2                                                    Bayesian Logistic
                                                                              Regression
                           N
                       +         {tn ln yn + (1 − tn ) ln(1 − yn )}
                           n=1



                                                                                               298of 300
Introduction to Statistical

Bayesian Logistic Regression                                                    Machine Learning

                                                                                    c 2011
                                                                              Christfried Webers
                                                                                    NICTA
                                                                            The Australian National
                                                                                  University

    To obtain a Gaussian approximation to
                           1
            ln p(w | t) = − (w − m0 )T S−1 (w − m0 )
                                        0
                           2
                             N                                              Probabilistic Generative
                                                                            Models
                         +         {tn ln yn + (1 − tn ) ln(1 − yn )}       Continuous Input
                             n=1                                            Discrete Features

                                                                            Probabilistic
                                                                            Discriminative Models
      1   Find wMAP which maximises ln p(w | t). This defines the            Logistic Regression
          mean of the Gaussian approximation. (Note: This is a              Iterative Reweighted
          nonlinear function in w because yn = σ(wT φn ).)                  Least Squares

      2   Calculate the second derivative of the negative log likelihood    Laplace Approximation

          to get the inverse covariance of the Laplace approximation        Bayesian Logistic
                                                                            Regression
                                                 N
          SN = −       ln p(w | t) = S−1 +
                                      0               yn (1 − yn )φn φT .
                                                                      n
                                                n=1



                                                                                             299of 300
Introduction to Statistical

Bayesian Logistic Regression                                                   Machine Learning

                                                                                   c 2011
                                                                             Christfried Webers
                                                                                   NICTA
                                                                           The Australian National
                                                                                 University




    The approximated Gaussian (via Laplace approximation)
    of the posterior distribution is now                                   Probabilistic Generative
                                                                           Models

                                                                           Continuous Input
                   q(w | φ) = N (w | wMAP , SN )
                                                                           Discrete Features

                                                                           Probabilistic
    where                                                                  Discriminative Models

                                                                           Logistic Regression
                                             N
                                   S−1
                                                                           Iterative Reweighted
        SN = −     ln p(w | t) =    0    +         yn (1 −   yn )φn φT .
                                                                     n     Least Squares

                                             n=1                           Laplace Approximation

                                                                           Bayesian Logistic
                                                                           Regression




                                                                                            300of 300

More Related Content

What's hot

Intro to modelling-supervised learning
Intro to modelling-supervised learningIntro to modelling-supervised learning
Intro to modelling-supervised learningJustin Sebok
 
KNN Algorithm Using R | Edureka
KNN Algorithm Using R | EdurekaKNN Algorithm Using R | Edureka
KNN Algorithm Using R | EdurekaEdureka!
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introductionLiang Xiang
 
Brief intro : Invariance and Equivariance
Brief intro : Invariance and EquivarianceBrief intro : Invariance and Equivariance
Brief intro : Invariance and Equivariance홍배 김
 
Python と型アノテーション
Python と型アノテーションPython と型アノテーション
Python と型アノテーションK Yamaguchi
 
Recommendation Systems Basics
Recommendation Systems BasicsRecommendation Systems Basics
Recommendation Systems BasicsJarin Tasnim Khan
 
Machine Learning Course | Edureka
Machine Learning Course | EdurekaMachine Learning Course | Edureka
Machine Learning Course | EdurekaEdureka!
 
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Simplilearn
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural NetworksDatabricks
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning Mohammad Junaid Khan
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것NAVER Engineering
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersFunctional Imperative
 
[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnnNAVER D2
 
Introduction to-machine-learning
Introduction to-machine-learningIntroduction to-machine-learning
Introduction to-machine-learningBabu Priyavrat
 
Grid search (parameter tuning)
Grid search (parameter tuning)Grid search (parameter tuning)
Grid search (parameter tuning)Akhilesh Joshi
 
Introduction to Few shot learning
Introduction to Few shot learningIntroduction to Few shot learning
Introduction to Few shot learningRidge-i, Inc.
 

What's hot (20)

Intro to modelling-supervised learning
Intro to modelling-supervised learningIntro to modelling-supervised learning
Intro to modelling-supervised learning
 
Transfer Learning
Transfer LearningTransfer Learning
Transfer Learning
 
KNN Algorithm Using R | Edureka
KNN Algorithm Using R | EdurekaKNN Algorithm Using R | Edureka
KNN Algorithm Using R | Edureka
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introduction
 
Brief intro : Invariance and Equivariance
Brief intro : Invariance and EquivarianceBrief intro : Invariance and Equivariance
Brief intro : Invariance and Equivariance
 
Python と型アノテーション
Python と型アノテーションPython と型アノテーション
Python と型アノテーション
 
Ensemble methods
Ensemble methodsEnsemble methods
Ensemble methods
 
Recommendation Systems Basics
Recommendation Systems BasicsRecommendation Systems Basics
Recommendation Systems Basics
 
Machine Learning Course | Edureka
Machine Learning Course | EdurekaMachine Learning Course | Edureka
Machine Learning Course | Edureka
 
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
 
machine learning
machine learningmachine learning
machine learning
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
 
boosting algorithm
boosting algorithmboosting algorithm
boosting algorithm
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
 
[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn
 
Introduction to-machine-learning
Introduction to-machine-learningIntroduction to-machine-learning
Introduction to-machine-learning
 
Grid search (parameter tuning)
Grid search (parameter tuning)Grid search (parameter tuning)
Grid search (parameter tuning)
 
Introduction to Few shot learning
Introduction to Few shot learningIntroduction to Few shot learning
Introduction to Few shot learning
 

Viewers also liked

Linear Regression vs Logistic Regression vs Poisson Regression
Linear Regression vs Logistic Regression vs Poisson RegressionLinear Regression vs Logistic Regression vs Poisson Regression
Linear Regression vs Logistic Regression vs Poisson RegressionKamil Bartocha
 
Combining statistics and human judgement
Combining statistics and human judgementCombining statistics and human judgement
Combining statistics and human judgementBrad Klingenberg
 
Out shopping
Out shoppingOut shopping
Out shoppingana96377
 
Lecture slides stats1.13.l23.air
Lecture slides stats1.13.l23.airLecture slides stats1.13.l23.air
Lecture slides stats1.13.l23.airatutor_te
 
Boston Predictive Analytics: Linear and Logistic Regression Using R - Interme...
Boston Predictive Analytics: Linear and Logistic Regression Using R - Interme...Boston Predictive Analytics: Linear and Logistic Regression Using R - Interme...
Boston Predictive Analytics: Linear and Logistic Regression Using R - Interme...Enplus Advisors, Inc.
 

Viewers also liked (7)

Poster_OB
Poster_OBPoster_OB
Poster_OB
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Linear Regression vs Logistic Regression vs Poisson Regression
Linear Regression vs Logistic Regression vs Poisson RegressionLinear Regression vs Logistic Regression vs Poisson Regression
Linear Regression vs Logistic Regression vs Poisson Regression
 
Combining statistics and human judgement
Combining statistics and human judgementCombining statistics and human judgement
Combining statistics and human judgement
 
Out shopping
Out shoppingOut shopping
Out shopping
 
Lecture slides stats1.13.l23.air
Lecture slides stats1.13.l23.airLecture slides stats1.13.l23.air
Lecture slides stats1.13.l23.air
 
Boston Predictive Analytics: Linear and Logistic Regression Using R - Interme...
Boston Predictive Analytics: Linear and Logistic Regression Using R - Interme...Boston Predictive Analytics: Linear and Logistic Regression Using R - Interme...
Boston Predictive Analytics: Linear and Logistic Regression Using R - Interme...
 

More from nep_test_account (17)

Database slide
Database slideDatabase slide
Database slide
 
database slide 1
database slide 1database slide 1
database slide 1
 
Doc2pages
Doc2pagesDoc2pages
Doc2pages
 
Doc2pages
Doc2pagesDoc2pages
Doc2pages
 
PDF TEST
PDF TESTPDF TEST
PDF TEST
 
Baboom!
Baboom!Baboom!
Baboom!
 
Baboom!
Baboom!Baboom!
Baboom!
 
uat
uatuat
uat
 
test pdf
test pdftest pdf
test pdf
 
Lecture Notes in Machine Learning
Lecture Notes in Machine LearningLecture Notes in Machine Learning
Lecture Notes in Machine Learning
 
Induction of Decision Trees
Induction of Decision TreesInduction of Decision Trees
Induction of Decision Trees
 
Large-Scale Machine Learning at Twitter
Large-Scale Machine Learning at TwitterLarge-Scale Machine Learning at Twitter
Large-Scale Machine Learning at Twitter
 
A Few Useful Things to Know about Machine Learning
A Few Useful Things to Know about Machine LearningA Few Useful Things to Know about Machine Learning
A Few Useful Things to Know about Machine Learning
 
linear regression part 2
linear regression part 2linear regression part 2
linear regression part 2
 
Probability
ProbabilityProbability
Probability
 
Linear Algebra
Linear AlgebraLinear Algebra
Linear Algebra
 
Introduction
IntroductionIntroduction
Introduction
 

Introduction to Statistical Machine Learning

  • 1. Introduction to Statistical Machine Learning c 2011 Christfried Webers NICTA The Australian National University Introduction to Statistical Machine Learning Christfried Webers Statistical Machine Learning Group NICTA and College of Engineering and Computer Science The Australian National University Canberra February – June 2011 (Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning") 1of 300
  • 2. Introduction to Statistical Machine Learning c 2011 Christfried Webers NICTA The Australian National University Part VIII Probabilistic Generative Models Continuous Input Linear Classification 2 Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression 268of 300
  • 3. Introduction to Statistical Three Models for Decision Problems Machine Learning c 2011 Christfried Webers NICTA The Australian National University In increasing order of complexity Find a discriminant function f (x) which maps each input directly onto a class label. Discriminative Models Probabilistic Generative 1 Solve the inference problem of determining the posterior Models class probabilities p(Ck | x). Continuous Input 2 Use decision theory to assign each new x to one of the Discrete Features classes. Probabilistic Discriminative Models Generative Models Logistic Regression 1 Solve the inference problem of determining the Iterative Reweighted Least Squares class-conditional probabilities p(x | Ck ). Laplace Approximation 2 Also, infer the prior class probabilities p(Ck ). Bayesian Logistic 3 Use Bayes’ theorem to find the posterior p(Ck | x). Regression 4 Alternatively, model the joint distribution p(x, Ck ) directly. 5 Use decision theory to assign each new x to one of the classes. 269of 300
  • 4. Introduction to Statistical Probabilistic Generative Models Machine Learning c 2011 Christfried Webers NICTA The Australian National University Generative approach: model class-conditional densities p(x | Ck ) and priors p(Ck ) to calculate the posterior probability for class C1 p(x | C1 )p(C1 ) Probabilistic Generative p(C1 | x) = Models p(x | C1 )p(C1 ) + p(x | C2 )p(C2 ) Continuous Input 1 Discrete Features = = σ(a(x)) Probabilistic 1 + exp(−a(x)) Discriminative Models Logistic Regression where a and the logistic sigmoid function σ(a) are given by Iterative Reweighted Least Squares p(x | C1 ) p(C1 ) p(x, C1 ) Laplace Approximation a(x) = ln = ln Bayesian Logistic p(x | C2 ) p(C2 ) p(x, C2 ) Regression 1 σ(a) = . 1 + exp(−a) 270of 300
  • 5. Introduction to Statistical Logistic Sigmoid Machine Learning c 2011 Christfried Webers NICTA The Australian National 1 The logistic sigmoid function σ(a) = 1+exp(−a) University "squashing function’ because it maps the real axis into a finite interval (0, 1) σ(−a) = 1 − σ(a) d da σ(a) = σ(a) σ(−a) = σ(a) (1 − σ(a)) Probabilistic Generative Derivative Models σ Continuous Input Inverse is called logit function a(σ) = ln 1−σ Discrete Features Probabilistic Discriminative Models Logistic Regression Σa aΣ 1.0 Iterative Reweighted Least Squares 4 0.8 Laplace Approximation 2 0.6 Bayesian Logistic Regression Σ 0.2 0.4 0.6 0.8 1.0 0.4 2 0.2 4 a 6 10 5 5 10 Logistic Sigmoid σ(a) Logit a(σ) 271of 300
  • 6. Introduction to Statistical Probabilistic Generative Models - Multiclass Machine Learning c 2011 Christfried Webers NICTA The Australian National University The normalised exponential is given by p(x | Ck ) p(Ck ) exp(ak ) Probabilistic Generative p(Ck | x) = = Models j p(x | Cj ) p(Cj ) j exp(aj ) Continuous Input Discrete Features where Probabilistic ak = ln(p(x | Ck ) p(Ck )). Discriminative Models Logistic Regression Also called softmax function as it is a smoothed version of Iterative Reweighted the max function. Least Squares Laplace Approximation Example: If ak aj for all j = k, then p(Ck | x) 1, and Bayesian Logistic p(Cj | x) 0. Regression 272of 300
  • 7. Introduction to Statistical Probabil. Generative Model - Continuous Input Machine Learning c 2011 Christfried Webers NICTA The Australian National University Assume class-conditional probabilities are Gaussian, all classes share the same covariance. What can we say about the posterior probabilities? Probabilistic Generative Models 1 1 1 p(x | Ck ) = D/2 |Σ|1/2 exp − (x − µk )T Σ−1 (x − µk ) Continuous Input (2π) 2 Discrete Features Probabilistic 1 1 1 T −1 Discriminative Models = exp − x Σ x (2π)D/2 |Σ|1/2 2 Logistic Regression Iterative Reweighted 1 T −1 × exp µT Σ−1 x − µk Σ µk Least Squares k 2 Laplace Approximation Bayesian Logistic Regression where we separated the quadratic term in x and the linear term. 273of 300
  • 8. Introduction to Statistical Probabil. Generative Model - Continuous Input Machine Learning c 2011 Christfried Webers NICTA The Australian National For two classes University p(C1 | x) = σ(a(x)) and a(x) is p(x | C1 ) p(C1 ) Probabilistic Generative a(x) = ln Models p(x | C2 ) p(C2 ) Continuous Input exp µT Σ−1 x − 2 µT Σ−1 µ1 1 1 1 p(C1 ) Discrete Features = ln + ln exp µT Σ−1 x − 1 µT Σ−1 µ2 2 2 2 p(C2 ) Probabilistic Discriminative Models Logistic Regression Therefore Iterative Reweighted p(C1 | x) = σ(wT x + w0 ) Least Squares Laplace Approximation where Bayesian Logistic Regression w = Σ−1 (µ1 − µ2 ) 1 1 p(C1 ) w0 = − µT Σ−1 µ1 + µT Σ−1 µ2 + ln 1 2 2 2 p(C2 ) 274of 300
  • 9. Introduction to Statistical Probabil. Generative Model - Continuous Input Machine Learning c 2011 Christfried Webers NICTA The Australian National University Class-conditional densities for two classes (left). Posterior probability p(C1 | x) (right). Note the logistic sigmoid of a linear function of x. Probabilistic Generative Models Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression 275of 300
  • 10. Introduction to Statistical General Case - K Classes, Shared Covariance Machine Learning c 2011 Christfried Webers NICTA The Australian National University Use the normalised exponential p(x | Ck )p(Ck ) exp(ak ) p(Ck | x) = = j p(x | Cj )p(Cj ) j exp(aj ) Probabilistic Generative Models where Continuous Input ak = ln (p(x | Ck )p(Ck )) . Discrete Features to get a linear function of x Probabilistic Discriminative Models Logistic Regression ak (x) = wT x + wk0 . k Iterative Reweighted Least Squares where Laplace Approximation Bayesian Logistic Regression wk = Σ−1 µk 1 wk0 = − µT Σ−1 µk + p(Ck ). 2 k 276of 300
  • 11. Introduction to Statistical General Case - K Classes, Different Covariance Machine Learning c 2011 Christfried Webers NICTA The Australian National University If each class-conditional probability has a different covariance, the quadratic terms − 1 xT Σ−1 x do not longer 2 cancel each other out. We get a quadratic discriminant. Probabilistic Generative Models Continuous Input 2.5 Discrete Features 2 Probabilistic Discriminative Models 1.5 Logistic Regression 1 Iterative Reweighted 0.5 Least Squares 0 Laplace Approximation −0.5 Bayesian Logistic −1 Regression −1.5 −2 −2.5 −2 −1 0 1 2 277of 300
  • 12. Introduction to Statistical Maximum Likelihood Solution Machine Learning c 2011 Christfried Webers NICTA The Australian National University Given the functional form of the class-conditional densities p(x | Ck ), can we determine the parameters µ and Σ ? Not without data ;-) Given also a data set (xn , tn ) for n = 1, . . . , N. (Using the Probabilistic Generative Models coding scheme where tn = 1 corresponds to class C1 and Continuous Input tn = 0 denotes class C2 . Discrete Features Assume the class-conditional densities to be Gaussian Probabilistic Discriminative Models with the same covariance, but different mean. Logistic Regression Denote the prior probability p(C1 ) = π, and therefore Iterative Reweighted p(C2 ) = 1 − π. Least Squares Laplace Approximation Then Bayesian Logistic Regression p(xn , C1 ) = p(C1 )p(xn | C1 ) = π N (xn | µ1 , Σ) p(xn , C2 ) = p(C2 )p(xn | C2 ) = (1 − π) N (xn | µ2 , Σ) 278of 300
  • 13. Introduction to Statistical Maximum Likelihood Solution Machine Learning c 2011 Christfried Webers NICTA Thus the likelihood for the whole data set X and t is given The Australian National University by N p(t, X | π, µ1 , µ2 , Σ) = [π N (xn | µ1 , Σ)]tn × n=1 Probabilistic Generative [(1 − π) N (xn | µ2 , Σ)]1−tn Models Continuous Input Maximise the log likelihood Discrete Features The term depending on π is Probabilistic Discriminative Models Logistic Regression N Iterative Reweighted (tn ln π + (1 − tn ) ln(1 − π) Least Squares n=1 Laplace Approximation Bayesian Logistic which is maximal for Regression N 1 N1 N1 π= tn = = N N N1 + N2 n=1 where N1 is the number of data points in class C1 . 279of 300
  • 14. Introduction to Statistical Maximum Likelihood Solution Machine Learning c 2011 Christfried Webers NICTA The Australian National University Similarly, we can maximise the log likelihood (and thereby the likelihood p(t, X | π, µ1 , µ2 , Σ) ) depending on the mean µ1 or µ2 , and get Probabilistic Generative Models N Continuous Input 1 µ1 = tn x n Discrete Features N1 Probabilistic n=1 Discriminative Models N 1 Logistic Regression µ2 = (1 − tn ) xn Iterative Reweighted N2 Least Squares n=1 Laplace Approximation For each class, this are the means of all input vectors Bayesian Logistic Regression assigned to this class. 280of 300
  • 15. Introduction to Statistical Maximum Likelihood Solution Machine Learning c 2011 Christfried Webers NICTA The Australian National University Finally, the log likelihood ln p(t, X | π, µ1 , µ2 , Σ) can be Probabilistic Generative maximised for the covariance Σ resulting in Models Continuous Input N1 N2 Discrete Features Σ= S1 + S2 N N Probabilistic Discriminative Models 1 Sk = (xn − µk )(xn − µk )T Logistic Regression Nk Iterative Reweighted n∈Ck Least Squares Laplace Approximation Bayesian Logistic Regression 281of 300
  • 16. Introduction to Statistical Discrete Features - Naive Bayes Machine Learning c 2011 Christfried Webers NICTA The Australian National University Assume the input space consists of discrete features, in the simplest case xi ∈ {0, 1}. For a D-dimensional input space, a general distribution would be represented by a table with 2D entries. Probabilistic Generative Models Together with the normalisation constraint, this are 2D − 1 Continuous Input Discrete Features independent variables. Probabilistic Grows exponentially with the number of features. Discriminative Models Logistic Regression The Naive Bayes assumption is that all features Iterative Reweighted conditioned on the class Ck are independent of each other. Least Squares Laplace Approximation D Bayesian Logistic p(x | Ck ) = µxii (1 − µki )1−xi k Regression i=1 282of 300
  • 17. Introduction to Statistical Discrete Features - Naive Bayes Machine Learning c 2011 Christfried Webers NICTA The Australian National University With the naive Bayes D p(x | Ck ) = µxii (1 − µki )1−xi k i=1 Probabilistic Generative Models we can then again find the factors ak in the normalised Continuous Input exponential Discrete Features Probabilistic Discriminative Models p(x | Ck )p(Ck ) exp(ak ) p(Ck | x) = = Logistic Regression j p(x | Cj )p(Cj ) j exp(aj ) Iterative Reweighted Least Squares Laplace Approximation as a linear function of the xi Bayesian Logistic Regression D ak (x) = {xi ln µki + (1 − xi ) ln(1 − µki )} + ln p(Ck ). i=1 283of 300
  • 18. Introduction to Statistical Three Models for Decision Problems Machine Learning c 2011 Christfried Webers NICTA The Australian National University In increasing order of complexity Find a discriminant function f (x) which maps each input directly onto a class label. Discriminative Models Probabilistic Generative 1 Solve the inference problem of determining the posterior Models class probabilities p(Ck | x). Continuous Input 2 Use decision theory to assign each new x to one of the Discrete Features classes. Probabilistic Discriminative Models Generative Models Logistic Regression 1 Solve the inference problem of determining the Iterative Reweighted Least Squares class-conditional probabilities p(x | Ck ). Laplace Approximation 2 Also, infer the prior class probabilities p(Ck ). Bayesian Logistic 3 Use Bayes’ theorem to find the posterior p(Ck | x). Regression 4 Alternatively, model the joint distribution p(x, Ck ) directly. 5 Use decision theory to assign each new x to one of the classes. 284of 300
  • 19. Introduction to Statistical Probabilistic Discriminative Models Machine Learning c 2011 Christfried Webers NICTA The Australian National University Maximise a likelihood function defined through the conditional distribution p(Ck | x) directly. Probabilistic Generative Discriminative training Models Continuous Input Typically fewer parameters to be determined. Discrete Features As we learn the posteriror p(Ck | x) directly, prediction may Probabilistic Discriminative Models be better than with a generative model where the Logistic Regression class-conditional density assumptions p(x | Ck ) poorly Iterative Reweighted approximate the true distributions. Least Squares Laplace Approximation But: discriminative models can not create synthetic data, Bayesian Logistic as p(x) is not modelled. Regression 285of 300
  • 20. Introduction to Statistical Original Input versus Feature Space Machine Learning c 2011 Christfried Webers NICTA Used direct input x until now. The Australian National University All classification algorithms work also if we first apply a fixed nonlinear transformation of the inputs using a vector of basis functions φ(x). Example: Use two Gaussian basis functions centered at Probabilistic Generative the green crosses in the input space. Models Continuous Input Discrete Features Probabilistic Discriminative Models 1 Logistic Regression 1 Iterative Reweighted φ2 Least Squares x2 Laplace Approximation 0 0.5 Bayesian Logistic Regression −1 0 −1 0 x1 1 0 0.5 φ1 1 286of 300
  • 21. Introduction to Statistical Original Input versus Feature Space Machine Learning c 2011 Christfried Webers NICTA The Australian National Linear decision boundaries in the feature space University correspond to nonlinear decision boundaries in the input space. Classes which are NOT linearly separable in the input space can become linearly separable in the feature space. Probabilistic Generative Models BUT: If classes overlap in input space, they will also Continuous Input overlap in feature space. Discrete Features Nonlinear features φ(x) can not remove the overlap; but Probabilistic Discriminative Models they may increase it ! Logistic Regression Iterative Reweighted Least Squares 1 1 Laplace Approximation φ2 Bayesian Logistic x2 Regression 0 0.5 −1 0 −1 0 x1 1 0 0.5 φ1 1 287of 300
  • 22. Introduction to Statistical Original Input versus Feature Space Machine Learning c 2011 Christfried Webers NICTA The Australian National University Fixed basis functions do not adapt to the data and therefore have important limitations (see discussion in Probabilistic Generative Models Linear Regression). Continuous Input Understanding of more advanced algorithms becomes Discrete Features easier if we introduce the feature space now and use it Probabilistic Discriminative Models instead of the original input space. Logistic Regression Some applications use fixed features successfully by Iterative Reweighted avoiding the limitations. Least Squares Laplace Approximation We will therefore use φ instead of x from now on. Bayesian Logistic Regression 288of 300
  • 23. Introduction to Statistical Logistic Regression is Classification Machine Learning c 2011 Christfried Webers NICTA The Australian National University Two classes where the posterior of class C1 is a logistic sigmoid σ() acting on a linear function of the feature vector φ p(C1 | φ) = y(φ) = σ(wT φ) Probabilistic Generative Models p(C2 | φ) = 1 − p(C1 | φ) Continuous Input Model dimension is equal to dimension of the feature Discrete Features Probabilistic space M. Discriminative Models Compare this to fitting two Gaussians Logistic Regression Iterative Reweighted Least Squares 2M + M(M + 1)/2 = M(M + 5)/2 Laplace Approximation means shared covariance Bayesian Logistic Regression For larger M, the logistic regression model has a clear advantage. 289of 300
  • 24. Introduction to Statistical Logistic Regression is Classification Machine Learning c 2011 Christfried Webers NICTA The Australian National University Determine the parameter via maximum likelihood for data (φn , tn ), n = 1, . . . , N, where φn = φ(xn ). The class membership is coded as tn ∈ {0, 1}. Likelihood function Probabilistic Generative Models N Continuous Input p(t | w) = ytnn (1 − yn )1−tn Discrete Features n=1 Probabilistic Discriminative Models where yn = p(C1 | φn ). Logistic Regression Iterative Reweighted Error function : negative log likelihood resulting in the Least Squares cross-entropy error function Laplace Approximation Bayesian Logistic Regression N E(w) = − ln p(t | w) = − {tn ln yn + (1 − tn ) ln(1 − yn )} n=1 290of 300
  • 25. Introduction to Statistical Logistic Regression is Classification Machine Learning c 2011 Christfried Webers NICTA The Australian National Error function (cross-entropy error ) University N E(w) = − {tn ln yn + (1 − tn ) ln(1 − yn )} n=1 Probabilistic Generative yn = p(C1 | φn ) = σ(wT φn ) Models Continuous Input dσ Gradient of the error function (using da = σ(1 − σ) ) Discrete Features Probabilistic N Discriminative Models E(w) = (yn − tn )φn Logistic Regression n=1 Iterative Reweighted Least Squares Laplace Approximation gradient does not contain any sigmoid function Bayesian Logistic for each data point error is product of deviation yn − tn and Regression basis function φn . BUT : maximum likelihood solution can exhibit over-fitting even for many data points; should use regularised error or MAP then. 291of 300
  • 26. Introduction to Statistical Laplace Approximation Machine Learning c 2011 Christfried Webers NICTA The Australian National University Given a continous distribution p(x) which is not Gaussian, can we approximate it by a Gaussian q(x) ? Need to find a mode of p(x). Try to find a Gaussian with the same mode. Probabilistic Generative Models Continuous Input 0.8 40 Discrete Features Probabilistic 0.6 30 Discriminative Models Logistic Regression 0.4 20 Iterative Reweighted Least Squares 0.2 10 Laplace Approximation 0 0 Bayesian Logistic −2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4 Regression Non-Gaussian (yellow) and Negative log of the Gaussian approximation (red). Non-Gaussian (yellow) and Gaussian approx. (red). 292of 300
  • 27. Introduction to Statistical Laplace Approximation Machine Learning c 2011 Christfried Webers NICTA The Australian National University Assume p(x) can be written as 1 p(z) = f (z) Z Probabilistic Generative Models with normalisation Z = f (z) dz. Continuous Input Furthermore, assume Z is unknown ! Discrete Features A mode of p(z) is at a point z0 where p (z0 ) = 0. Probabilistic Discriminative Models Taylor expansion of ln f (z) at z0 Logistic Regression Iterative Reweighted Least Squares 1 ln f (z) ln f (z0 ) − A(z − z0 )2 Laplace Approximation 2 Bayesian Logistic Regression where d2 A=− ln f (z) |z=z0 dz2 293of 300
  • 28. Introduction to Statistical Laplace Approximation Machine Learning c 2011 Christfried Webers NICTA The Australian National University Exponentiating 1 ln f (z) ln f (z0 ) − A(z − z0 )2 2 Probabilistic Generative Models we get Continuous Input A f (z) f (z0 ) exp{− (z − z0 )2 }. Discrete Features 2 Probabilistic Discriminative Models And after normalisation we get the Laplace approximation Logistic Regression Iterative Reweighted 1/2 Least Squares A A q(z) = exp{− (z − z0 )2 }. Laplace Approximation 2π 2 Bayesian Logistic Regression Only defined for precision A > 0 as only then p(z) has a maximum. 294of 300
  • 29. Introduction to Statistical Laplace Approximation - Vector Space Machine Learning c 2011 Christfried Webers NICTA The Australian National Approximate p(z) for z ∈ RM University 1 p(z) = f (z). Z we get the Taylor expansion Probabilistic Generative Models 1 Continuous Input ln f (z) ln f (z0 ) − (z − z0 )T A(z − z0 ) 2 Discrete Features Probabilistic Discriminative Models where the Hessian A is defined as Logistic Regression Iterative Reweighted A=− ln f (z) |z=z0 . Least Squares Laplace Approximation The Laplace approximation of p(z) is then Bayesian Logistic Regression |A|1/2 1 q(z) = exp − (z − z0 )T A(z − z0 ) (2π)M/2 2 = N (z | z0 , A−1 ) 295of 300
  • 30. Introduction to Statistical Bayesian Logistic Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University Exact Bayesian inference for the logistic regression is Probabilistic Generative intractable. Models Continuous Input Why? Need to normalise a product of prior probabilities Discrete Features and likelihoods which itself are a product of logistic Probabilistic sigmoid functions, one for each data point. Discriminative Models Logistic Regression Evaluation of the predictive distribution also intractable. Iterative Reweighted Least Squares Therefore we will use the Laplace approximation. Laplace Approximation Bayesian Logistic Regression 296of 300
  • 31. Introduction to Statistical Bayesian Logistic Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University Assume a Gaussian prior because we want a Gaussian posterior. p(w) = N (w | m0 , S0 ) Probabilistic Generative Models for fixed hyperparameter m0 and S0 . Continuous Input Hyperparameters are parameters of a prior distribution. In Discrete Features contrast to the model parameters w, they are not learned. Probabilistic Discriminative Models For a set of training data (xn , tn ), where n = 1, . . . , N, the Logistic Regression posterior is given by Iterative Reweighted Least Squares Laplace Approximation p(w | t) ∝ p(w)p(t | w) Bayesian Logistic Regression where t = (t1 , . . . , tN )T . 297of 300
  • 32. Introduction to Statistical Bayesian Logistic Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University Using our previous result for the cross-entropy function N E(w) = − ln p(t | w) = − {tn ln yn + (1 − tn ) ln(1 − yn )} n=1 Probabilistic Generative Models we can now calculate the log of the posterior Continuous Input Discrete Features p(w | t) ∝ p(w)p(t | w) Probabilistic Discriminative Models Logistic Regression using the notation yn = σ(wT φn ) as Iterative Reweighted Least Squares 1 Laplace Approximation ln p(w | t) = − (w − m0 )T S−1 (w − m0 ) 0 2 Bayesian Logistic Regression N + {tn ln yn + (1 − tn ) ln(1 − yn )} n=1 298of 300
  • 33. Introduction to Statistical Bayesian Logistic Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University To obtain a Gaussian approximation to 1 ln p(w | t) = − (w − m0 )T S−1 (w − m0 ) 0 2 N Probabilistic Generative Models + {tn ln yn + (1 − tn ) ln(1 − yn )} Continuous Input n=1 Discrete Features Probabilistic Discriminative Models 1 Find wMAP which maximises ln p(w | t). This defines the Logistic Regression mean of the Gaussian approximation. (Note: This is a Iterative Reweighted nonlinear function in w because yn = σ(wT φn ).) Least Squares 2 Calculate the second derivative of the negative log likelihood Laplace Approximation to get the inverse covariance of the Laplace approximation Bayesian Logistic Regression N SN = − ln p(w | t) = S−1 + 0 yn (1 − yn )φn φT . n n=1 299of 300
  • 34. Introduction to Statistical Bayesian Logistic Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University The approximated Gaussian (via Laplace approximation) of the posterior distribution is now Probabilistic Generative Models Continuous Input q(w | φ) = N (w | wMAP , SN ) Discrete Features Probabilistic where Discriminative Models Logistic Regression N S−1 Iterative Reweighted SN = − ln p(w | t) = 0 + yn (1 − yn )φn φT . n Least Squares n=1 Laplace Approximation Bayesian Logistic Regression 300of 300