Introduction to Statistical Machine Learning

Introduction to Statistical
Machine Learning

c 2011
Christfried Webers
NICTA
The Australian National
University

Introduction to Statistical Machine Learning

Christfried Webers

Statistical Machine Learning Group
NICTA
and
College of Engineering and Computer Science
The Australian National University

Canberra
February – June 2011

(Many ﬁgures from C. M. Bishop, "Pattern Recognition and Machine Learning")

1of 300

Machine Learning

c 2011
Christfried Webers
NICTA
University

Part VIII
Probabilistic Generative
Models

Continuous Input

Linear Classiﬁcation 2 Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

268of 300


Three Models for Decision Problems Machine Learning

c 2011
Christfried Webers
NICTA
University
In increasing order of complexity
Find a discriminant function f (x) which maps each input
directly onto a class label.
Discriminative Models Probabilistic Generative
1 Solve the inference problem of determining the posterior Models

class probabilities p(Ck | x). Continuous Input

2 Use decision theory to assign each new x to one of the Discrete Features

classes. Probabilistic

Generative Models Logistic Regression

1 Solve the inference problem of determining the Iterative Reweighted
Least Squares
class-conditional probabilities p(x | Ck ). Laplace Approximation
2 Also, infer the prior class probabilities p(Ck ). Bayesian Logistic
3 Use Bayes’ theorem to ﬁnd the posterior p(Ck | x). Regression

4 Alternatively, model the joint distribution p(x, Ck ) directly.
5 Use decision theory to assign each new x to one of the
classes.

269of 300


Probabilistic Generative Models Machine Learning

c 2011
Christfried Webers
NICTA
University

Generative approach: model class-conditional densities
p(x | Ck ) and priors p(Ck ) to calculate the posterior
probability for class C1

p(x | C1 )p(C1 ) Probabilistic Generative

p(C1 | x) = Models

p(x | C1 )p(C1 ) + p(x | C2 )p(C2 ) Continuous Input

1 Discrete Features
= = σ(a(x)) Probabilistic
1 + exp(−a(x)) Discriminative Models

Logistic Regression
where a and the logistic sigmoid function σ(a) are given by Iterative Reweighted
Least Squares

p(x | C1 ) p(C1 ) p(x, C1 ) Laplace Approximation
a(x) = ln = ln Bayesian Logistic
p(x | C2 ) p(C2 ) p(x, C2 ) Regression

1
σ(a) = .
1 + exp(−a)

270of 300


Logistic Sigmoid Machine Learning

c 2011
Christfried Webers
NICTA
1
The logistic sigmoid function σ(a) = 1+exp(−a)
University

"squashing function’ because it maps the real axis into a
ﬁnite interval (0, 1)
σ(−a) = 1 − σ(a)
d
da σ(a) = σ(a) σ(−a) = σ(a) (1 − σ(a)) Probabilistic Generative
Derivative Models

σ Continuous Input
Inverse is called logit function a(σ) = ln 1−σ Discrete Features

Probabilistic

Logistic Regression
Σa
aΣ
1.0 Iterative Reweighted
Least Squares
4
0.8
2
0.6 Bayesian Logistic
Regression
Σ
0.2 0.4 0.6 0.8 1.0
0.4
2

0.2
4

a 6
10 5 5 10

Logistic Sigmoid σ(a) Logit a(σ)
271of 300


Probabilistic Generative Models - Multiclass Machine Learning

c 2011
Christfried Webers
NICTA
University

The normalised exponential is given by

p(x | Ck ) p(Ck ) exp(ak ) Probabilistic Generative
p(Ck | x) = = Models
j p(x | Cj ) p(Cj ) j exp(aj ) Continuous Input

Discrete Features
where Probabilistic
ak = ln(p(x | Ck ) p(Ck )). Discriminative Models

Logistic Regression
Also called softmax function as it is a smoothed version of Iterative Reweighted
the max function. Least Squares

Example: If ak aj for all j = k, then p(Ck | x) 1, and Bayesian Logistic
p(Cj | x) 0. Regression

272of 300


Probabil. Generative Model - Continuous Input Machine Learning

c 2011
Christfried Webers
NICTA
University

Assume class-conditional probabilities are Gaussian, all
classes share the same covariance. What can we say
about the posterior probabilities?
Models
1 1 1
p(x | Ck ) = D/2 |Σ|1/2
exp − (x − µk )T Σ−1 (x − µk ) Continuous Input

(2π) 2 Discrete Features

Probabilistic
1 1 1 T −1 Discriminative Models
= exp − x Σ x
(2π)D/2 |Σ|1/2 2 Logistic Regression

1 T −1
× exp µT Σ−1 x − µk Σ µk
Least Squares
k
2 Laplace Approximation

Bayesian Logistic
Regression
where we separated the quadratic term in x and the linear
term.

273of 300



c 2011
Christfried Webers
NICTA
For two classes University

p(C1 | x) = σ(a(x))
and a(x) is

p(x | C1 ) p(C1 ) Probabilistic Generative
a(x) = ln Models
p(x | C2 ) p(C2 )
Continuous Input
exp µT Σ−1 x − 2 µT Σ−1 µ1
1
1
1 p(C1 ) Discrete Features
= ln + ln
exp µT Σ−1 x − 1 µT Σ−1 µ2
2 2 2
p(C2 ) Probabilistic

Logistic Regression
Therefore Iterative Reweighted
p(C1 | x) = σ(wT x + w0 ) Least Squares


where Bayesian Logistic
Regression

w = Σ−1 (µ1 − µ2 )
1 1 p(C1 )
w0 = − µT Σ−1 µ1 + µT Σ−1 µ2 + ln
1 2
2 2 p(C2 )
274of 300



c 2011
Christfried Webers
NICTA
University

Class-conditional densities for two classes (left). Posterior
probability p(C1 | x) (right). Note the logistic sigmoid of a linear
function of x. Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic

Logistic Regression

Least Squares


Bayesian Logistic
Regression

275of 300


General Case - K Classes, Shared Covariance Machine Learning

c 2011
Christfried Webers
NICTA
University
Use the normalised exponential

p(x | Ck )p(Ck ) exp(ak )
p(Ck | x) = =
j p(x | Cj )p(Cj ) j exp(aj )
Models
where
Continuous Input
ak = ln (p(x | Ck )p(Ck )) . Discrete Features

to get a linear function of x Probabilistic

Logistic Regression
ak (x) = wT x + wk0 .
k Iterative Reweighted
Least Squares

where Laplace Approximation

Bayesian Logistic
Regression
wk = Σ−1 µk
1
wk0 = − µT Σ−1 µk + p(Ck ).
2 k

276of 300


General Case - K Classes, Different Covariance Machine Learning

c 2011
Christfried Webers
NICTA
University
If each class-conditional probability has a different
covariance, the quadratic terms − 1 xT Σ−1 x do not longer
2
cancel each other out.
We get a quadratic discriminant.
Models

Continuous Input

2.5 Discrete Features

2 Probabilistic
1.5
Logistic Regression
1
0.5 Least Squares
−0.5
Bayesian Logistic
−1 Regression

−1.5
−2
−2.5
−2 −1 0 1 2

277of 300


Maximum Likelihood Solution Machine Learning

c 2011
Christfried Webers
NICTA
University

Given the functional form of the class-conditional densities
p(x | Ck ), can we determine the parameters µ and Σ ?
Not without data ;-)
Given also a data set (xn , tn ) for n = 1, . . . , N. (Using the Probabilistic Generative
Models
coding scheme where tn = 1 corresponds to class C1 and
Continuous Input
tn = 0 denotes class C2 .
Discrete Features
Assume the class-conditional densities to be Gaussian Probabilistic
with the same covariance, but different mean.
Logistic Regression
Denote the prior probability p(C1 ) = π, and therefore Iterative Reweighted

p(C2 ) = 1 − π. Least Squares

Then Bayesian Logistic
Regression

p(xn , C1 ) = p(C1 )p(xn | C1 ) = π N (xn | µ1 , Σ)
p(xn , C2 ) = p(C2 )p(xn | C2 ) = (1 − π) N (xn | µ2 , Σ)

278of 300



c 2011
Christfried Webers
NICTA
Thus the likelihood for the whole data set X and t is given The Australian National
University
by
N
p(t, X | π, µ1 , µ2 , Σ) = [π N (xn | µ1 , Σ)]tn ×
n=1
[(1 − π) N (xn | µ2 , Σ)]1−tn Models

Continuous Input

Maximise the log likelihood Discrete Features

The term depending on π is Probabilistic

Logistic Regression
N
(tn ln π + (1 − tn ) ln(1 − π) Least Squares

n=1 Laplace Approximation

Bayesian Logistic
which is maximal for Regression

N
1 N1 N1
π= tn = =
N N N1 + N2
n=1

where N1 is the number of data points in class C1 .
279of 300



c 2011
Christfried Webers
NICTA
University

Similarly, we can maximise the log likelihood (and thereby
the likelihood p(t, X | π, µ1 , µ2 , Σ) ) depending on the mean
µ1 or µ2 , and get Probabilistic Generative
Models

N Continuous Input
1
µ1 = tn x n Discrete Features
N1 Probabilistic
n=1 Discriminative Models
N
1 Logistic Regression
µ2 = (1 − tn ) xn Iterative Reweighted
N2 Least Squares
n=1

For each class, this are the means of all input vectors Bayesian Logistic
Regression
assigned to this class.

280of 300



c 2011
Christfried Webers
NICTA
University

Finally, the log likelihood ln p(t, X | π, µ1 , µ2 , Σ) can be Probabilistic Generative
maximised for the covariance Σ resulting in Models

Continuous Input

N1 N2 Discrete Features
Σ= S1 + S2
N N Probabilistic
1
Sk = (xn − µk )(xn − µk )T Logistic Regression

Nk Iterative Reweighted
n∈Ck Least Squares


Bayesian Logistic
Regression

281of 300


Discrete Features - Naive Bayes Machine Learning

c 2011
Christfried Webers
NICTA
University

Assume the input space consists of discrete features, in
the simplest case xi ∈ {0, 1}.
For a D-dimensional input space, a general distribution
would be represented by a table with 2D entries. Probabilistic Generative
Models

Together with the normalisation constraint, this are 2D − 1 Continuous Input

Discrete Features
independent variables.
Probabilistic
Grows exponentially with the number of features. Discriminative Models

Logistic Regression
The Naive Bayes assumption is that all features Iterative Reweighted
conditioned on the class Ck are independent of each other. Least Squares

D Bayesian Logistic

p(x | Ck ) = µxii (1 − µki )1−xi
k
Regression

i=1

282of 300


Discrete Features - Naive Bayes Machine Learning

c 2011
Christfried Webers
NICTA
University

With the naive Bayes
D
p(x | Ck ) = µxii (1 − µki )1−xi
k
i=1 Probabilistic Generative
Models

we can then again ﬁnd the factors ak in the normalised Continuous Input

exponential Discrete Features

Probabilistic
p(x | Ck )p(Ck ) exp(ak )
p(Ck | x) = = Logistic Regression

j p(x | Cj )p(Cj ) j exp(aj ) Iterative Reweighted
Least Squares

as a linear function of the xi
Bayesian Logistic
Regression
D
ak (x) = {xi ln µki + (1 − xi ) ln(1 − µki )} + ln p(Ck ).
i=1

283of 300


Three Models for Decision Problems Machine Learning

c 2011
Christfried Webers
NICTA
University
In increasing order of complexity
Find a discriminant function f (x) which maps each input
directly onto a class label.
Discriminative Models Probabilistic Generative
1 Solve the inference problem of determining the posterior Models

class probabilities p(Ck | x). Continuous Input

2 Use decision theory to assign each new x to one of the Discrete Features

classes. Probabilistic

Generative Models Logistic Regression

1 Solve the inference problem of determining the Iterative Reweighted
Least Squares
class-conditional probabilities p(x | Ck ). Laplace Approximation
2 Also, infer the prior class probabilities p(Ck ). Bayesian Logistic
3 Use Bayes’ theorem to ﬁnd the posterior p(Ck | x). Regression

4 Alternatively, model the joint distribution p(x, Ck ) directly.
5 Use decision theory to assign each new x to one of the
classes.

284of 300


Probabilistic Discriminative Models Machine Learning

c 2011
Christfried Webers
NICTA
University

Maximise a likelihood function deﬁned through the
conditional distribution p(Ck | x) directly.
Discriminative training Models

Continuous Input
Typically fewer parameters to be determined.
Discrete Features
As we learn the posteriror p(Ck | x) directly, prediction may Probabilistic
be better than with a generative model where the
Logistic Regression
class-conditional density assumptions p(x | Ck ) poorly
approximate the true distributions. Least Squares

But: discriminative models can not create synthetic data,
Bayesian Logistic
as p(x) is not modelled. Regression

285of 300


Original Input versus Feature Space Machine Learning

c 2011
Christfried Webers
NICTA
Used direct input x until now. The Australian National
University

All classification algorithms work also if we first apply a
fixed nonlinear transformation of the inputs using a vector
of basis functions φ(x).
Example: Use two Gaussian basis functions centered at
the green crosses in the input space. Models

Continuous Input

Discrete Features

Probabilistic
1
Logistic Regression
1
φ2 Least Squares
x2

0 0.5 Bayesian Logistic
Regression

−1
0

−1 0 x1 1 0 0.5 φ1 1
286of 300



c 2011
Christfried Webers
NICTA
Linear decision boundaries in the feature space University

correspond to nonlinear decision boundaries in the input
space.
Classes which are NOT linearly separable in the input
space can become linearly separable in the feature space. Probabilistic Generative
Models
BUT: If classes overlap in input space, they will also Continuous Input
overlap in feature space. Discrete Features

Nonlinear features φ(x) can not remove the overlap; but Probabilistic
they may increase it ! Logistic Regression

Least Squares
1
1

φ2 Bayesian Logistic
x2
Regression

0 0.5

−1
0

−1 0 x1 1 0 0.5 φ1 1

287of 300



c 2011
Christfried Webers
NICTA
University

Fixed basis functions do not adapt to the data and
therefore have important limitations (see discussion in Probabilistic Generative
Models
Linear Regression).
Continuous Input
Understanding of more advanced algorithms becomes Discrete Features
easier if we introduce the feature space now and use it Probabilistic
instead of the original input space.
Logistic Regression
Some applications use ﬁxed features successfully by Iterative Reweighted
avoiding the limitations. Least Squares

We will therefore use φ instead of x from now on. Bayesian Logistic
Regression

288of 300


Logistic Regression is Classiﬁcation Machine Learning

c 2011
Christfried Webers
NICTA
University

Two classes where the posterior of class C1 is a logistic
sigmoid σ() acting on a linear function of the feature vector
φ
p(C1 | φ) = y(φ) = σ(wT φ) Probabilistic Generative
Models
p(C2 | φ) = 1 − p(C1 | φ) Continuous Input

Model dimension is equal to dimension of the feature Discrete Features

Probabilistic
space M. Discriminative Models

Compare this to ﬁtting two Gaussians Logistic Regression

Least Squares
2M + M(M + 1)/2 = M(M + 5)/2 Laplace Approximation
means shared covariance Bayesian Logistic
Regression

For larger M, the logistic regression model has a clear
advantage.

289of 300



c 2011
Christfried Webers
NICTA
University

Determine the parameter via maximum likelihood for data
(φn , tn ), n = 1, . . . , N, where φn = φ(xn ). The class
membership is coded as tn ∈ {0, 1}.
Likelihood function Probabilistic Generative
Models
N Continuous Input
p(t | w) = ytnn (1 − yn )1−tn Discrete Features

n=1 Probabilistic

where yn = p(C1 | φn ). Logistic Regression

Error function : negative log likelihood resulting in the Least Squares

cross-entropy error function Laplace Approximation

Bayesian Logistic
Regression
N
E(w) = − ln p(t | w) = − {tn ln yn + (1 − tn ) ln(1 − yn )}
n=1

290of 300



c 2011
Christfried Webers
NICTA
Error function (cross-entropy error ) University

N
E(w) = − {tn ln yn + (1 − tn ) ln(1 − yn )}
n=1
yn = p(C1 | φn ) = σ(wT φn ) Models

Continuous Input
dσ
Gradient of the error function (using da = σ(1 − σ) ) Discrete Features

Probabilistic
N Discriminative Models

E(w) = (yn − tn )φn Logistic Regression

n=1 Iterative Reweighted
Least Squares

gradient does not contain any sigmoid function
Bayesian Logistic
for each data point error is product of deviation yn − tn and Regression

basis function φn .
BUT : maximum likelihood solution can exhibit over-ﬁtting
even for many data points; should use regularised error or
MAP then.
291of 300


Laplace Approximation Machine Learning

c 2011
Christfried Webers
NICTA
University
Given a continous distribution p(x) which is not Gaussian,
can we approximate it by a Gaussian q(x) ?
Need to ﬁnd a mode of p(x). Try to ﬁnd a Gaussian with
the same mode.
Models

Continuous Input
0.8 40
Discrete Features

Probabilistic
0.6 30

Logistic Regression
0.4 20
Least Squares
0.2 10

0 0 Bayesian Logistic
−2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4 Regression

Non-Gaussian (yellow) and Negative log of the
Gaussian approximation (red). Non-Gaussian (yellow) and
Gaussian approx. (red).
292of 300



c 2011
Christfried Webers
NICTA
University

Assume p(x) can be written as

1
p(z) = f (z)
Z
Models
with normalisation Z = f (z) dz.
Continuous Input
Furthermore, assume Z is unknown ! Discrete Features

A mode of p(z) is at a point z0 where p (z0 ) = 0. Probabilistic

Taylor expansion of ln f (z) at z0 Logistic Regression

Least Squares
1
ln f (z) ln f (z0 ) − A(z − z0 )2 Laplace Approximation
2 Bayesian Logistic
Regression
where
d2
A=− ln f (z) |z=z0
dz2

293of 300



c 2011
Christfried Webers
NICTA
University

Exponentiating

1
ln f (z) ln f (z0 ) − A(z − z0 )2
2 Probabilistic Generative
Models

we get Continuous Input
A
f (z) f (z0 ) exp{− (z − z0 )2 }.
Discrete Features

2 Probabilistic
And after normalisation we get the Laplace approximation Logistic Regression

1/2 Least Squares
A A
q(z) = exp{− (z − z0 )2 }. Laplace Approximation
2π 2 Bayesian Logistic
Regression

Only deﬁned for precision A > 0 as only then p(z) has a
maximum.

294of 300


Laplace Approximation - Vector Space Machine Learning

c 2011
Christfried Webers
NICTA
Approximate p(z) for z ∈ RM University

1
p(z) = f (z).
Z
we get the Taylor expansion Probabilistic Generative
Models

1 Continuous Input
ln f (z) ln f (z0 ) − (z − z0 )T A(z − z0 )
2 Discrete Features

Probabilistic
where the Hessian A is deﬁned as
Logistic Regression

A=− ln f (z) |z=z0 . Least Squares


The Laplace approximation of p(z) is then Bayesian Logistic
Regression

|A|1/2 1
q(z) = exp − (z − z0 )T A(z − z0 )
(2π)M/2 2
= N (z | z0 , A−1 )
295of 300


Bayesian Logistic Regression Machine Learning

c 2011
Christfried Webers
NICTA
University

Exact Bayesian inference for the logistic regression is Probabilistic Generative
intractable. Models

Continuous Input
Why? Need to normalise a product of prior probabilities
Discrete Features
and likelihoods which itself are a product of logistic Probabilistic
sigmoid functions, one for each data point. Discriminative Models

Logistic Regression
Evaluation of the predictive distribution also intractable. Iterative Reweighted
Least Squares
Therefore we will use the Laplace approximation.

Bayesian Logistic
Regression

296of 300



c 2011
Christfried Webers
NICTA
University

Assume a Gaussian prior because we want a Gaussian
posterior.
p(w) = N (w | m0 , S0 ) Probabilistic Generative
Models
for ﬁxed hyperparameter m0 and S0 . Continuous Input

Hyperparameters are parameters of a prior distribution. In Discrete Features

contrast to the model parameters w, they are not learned. Probabilistic

For a set of training data (xn , tn ), where n = 1, . . . , N, the Logistic Regression

posterior is given by Iterative Reweighted
Least Squares

p(w | t) ∝ p(w)p(t | w) Bayesian Logistic
Regression

where t = (t1 , . . . , tN )T .

297of 300



c 2011
Christfried Webers
NICTA
University

Using our previous result for the cross-entropy function
N
E(w) = − ln p(t | w) = − {tn ln yn + (1 − tn ) ln(1 − yn )}
n=1 Probabilistic Generative
Models

we can now calculate the log of the posterior Continuous Input

Discrete Features

p(w | t) ∝ p(w)p(t | w) Probabilistic

Logistic Regression
using the notation yn = σ(wT φn ) as Iterative Reweighted
Least Squares

ln p(w | t) = − (w − m0 )T S−1 (w − m0 )
0
2 Bayesian Logistic
Regression
N
+ {tn ln yn + (1 − tn ) ln(1 − yn )}
n=1

298of 300



c 2011
Christfried Webers
NICTA
University

To obtain a Gaussian approximation to
1
ln p(w | t) = − (w − m0 )T S−1 (w − m0 )
0
2
N Probabilistic Generative
Models
+ {tn ln yn + (1 − tn ) ln(1 − yn )} Continuous Input
n=1 Discrete Features

Probabilistic
1 Find wMAP which maximises ln p(w | t). This deﬁnes the Logistic Regression
mean of the Gaussian approximation. (Note: This is a Iterative Reweighted
nonlinear function in w because yn = σ(wT φn ).) Least Squares

2 Calculate the second derivative of the negative log likelihood Laplace Approximation

to get the inverse covariance of the Laplace approximation Bayesian Logistic
Regression
N
SN = − ln p(w | t) = S−1 +
0 yn (1 − yn )φn φT .
n
n=1

299of 300



c 2011
Christfried Webers
NICTA
University

The approximated Gaussian (via Laplace approximation)
of the posterior distribution is now Probabilistic Generative
Models

Continuous Input
q(w | φ) = N (w | wMAP , SN )
Discrete Features

Probabilistic
where Discriminative Models

Logistic Regression
N
S−1
SN = − ln p(w | t) = 0 + yn (1 − yn )φn φT .
n Least Squares

n=1 Laplace Approximation

Bayesian Logistic
Regression

300of 300

Introduction to Statistical Machine Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

More from nep_test_account

More from nep_test_account (17)

Introduction to Statistical Machine Learning