linear classification

Introduction to Statistical
Machine Learning

c 2011
Christfried Webers
NICTA
The Australian National
University

Introduction to Statistical Machine Learning

Christfried Webers

Statistical Machine Learning Group
NICTA
and
College of Engineering and Computer Science
The Australian National University

Canberra
February – June 2011

(Many ﬁgures from C. M. Bishop, "Pattern Recognition and Machine Learning")

1of 267

Machine Learning

c 2011
Christfried Webers
NICTA
University

Part VII
Classiﬁcation

Generalised Linear
Model

Linear Classiﬁcation 1 Inference and Decision

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

230of 267


Classification Machine Learning

c 2011
Christfried Webers
NICTA
Goal : Given input data x, assign it to one of K discrete The Australian National
University
classes Ck where k = 1, . . . , K.
Divide the input space into different regions.

Classification

Generalised Linear
Model

Inference and Decision


Fisher’s Linear
Discriminant

The Perceptron
Algorithm

Figure: Length of the petal [in cm] for a given sepal [cm] for iris flowers
(Iris Setosa, Iris Versicolor, Iris Virginica).
231of 267


How to represent binary class labels? Machine Learning

c 2011
Christfried Webers
NICTA
University

Class labels are no longer real values as in regression, but
a discrete set. Classiﬁcation

Generalised Linear
Two classes : t ∈ {0, 1} Model

( t = 1 represents class C1 and t = 0 represents class C2 ) Inference and Decision

Can interpret the value of t as the probability of class C1 , Fisher’s Linear
with only two values possible for the probability, 0 or 1. Discriminant

The Perceptron
Note: Other conventions to map classes into integers Algorithm

possible, check the setup.

232of 267


How to represent multi-class labels? Machine Learning

c 2011
Christfried Webers
NICTA
University

If there are more than two classes ( K > 2), we call it a
multi-class setup.
Often used: 1-of-K coding scheme in which t is a vector of Classiﬁcation

length K which has all values 0 except for tj = 1, where j Generalised Linear
Model
comes from the membership in class Cj to encode. Inference and Decision

Example: Given 5 classes, {C1 , . . . , C5 }. Membership in Discriminant Functions

class C2 will be encoded as the target vector Fisher’s Linear
Discriminant

The Perceptron
t = (0, 1, 0, 0, 0)T Algorithm

Note: Other conventions to map multi-classes into integers
possible, check the setup.

233of 267


Linear Model Machine Learning

c 2011
Christfried Webers
NICTA
Idea: Use again a Linear Model as in regression: y(x, w) is The Australian National
University
a linear function of the parameters w
y(xn , w) = wT φ(xn )
But generally y(xn , w) ∈ R.
Example: Which class is y(x, w) = 0.71623 ? Classiﬁcation

Generalised Linear
Model



Fisher’s Linear
Discriminant

The Perceptron
Algorithm

234of 267


Generalised Linear Model Machine Learning

c 2011
Christfried Webers
NICTA
University
Apply a mapping f : R → Z to the linear model to get the
discrete class labels.
Generalised Linear Model

y(xn , w) = f (wT φ(xn )) Classiﬁcation

Generalised Linear
Model
Activation function: f (·)
Link function : f −1 (·) Discriminant Functions

Fisher’s Linear
sign z Discriminant
1.0
The Perceptron
0.5 Algorithm

z
0.5 0.0 0.5 1.0

0.5

1.0

Figure: Example of an activation function f (z) = sign (z) .

235of 267


Three Models for Decision Problems Machine Learning

c 2011
Christfried Webers
NICTA
University
In increasing order of complexity
Find a discriminant function f (x) which maps each input
directly onto a class label.
Discriminative Models Classiﬁcation
1 Solve the inference problem of determining the posterior Generalised Linear
class probabilities p(Ck | x). Model

2 Use decision theory to assign each new x to one of the Inference and Decision

classes. Discriminant Functions

Fisher’s Linear
Generative Models Discriminant

1 Solve the inference problem of determining the The Perceptron
Algorithm
class-conditional probabilities p(x | Ck ).
2 Also, infer the prior class probabilities p(Ck ).
3 Use Bayes’ theorem to ﬁnd the posterior p(Ck | x).
4 Alternatively, model the joint distribution p(x, Ck ) directly.
5 Use decision theory to assign each new x to one of the
classes.

236of 267


Two Classes Machine Learning

c 2011
Christfried Webers
NICTA
University

Definition
A discriminant is a function that maps from an input vector x to
one of K classes, denoted by Ck .
Classification

Generalised Linear
Model
Consider first two classes ( K = 2 ).
Construct a linear function of the inputs x Discriminant Functions

Fisher’s Linear
y(x) = wT x + w0 Discriminant

The Perceptron
Algorithm
such that x being assigned to class C1 if y(x) ≥ 0, and to
class C2 otherwise.
weight vector w
bias w0 ( sometimes −w0 called threshold )

237of 267



c 2011
Christfried Webers
NICTA
University

Decision boundary y(x) = 0 is a (D − 1)-dimensional
hyperplane in a D-dimensional input space (decision Classiﬁcation

surface). Generalised Linear
Model

w is orthogonal to any vector lying in the decision surface. Inference and Decision

Proof: Assume xA and xB are two points lying in the Fisher’s Linear
decision surface. Then, Discriminant

The Perceptron
Algorithm
0 = y(xA ) − y(xB ) = wT (xA − xB )

238of 267



c 2011
Christfried Webers
NICTA
University

The normal distance from the origin to the decision
surface is
wT x w0
=−
w w Classiﬁcation

Generalised Linear
y>0 x2 Model
y=0
R1 Inference and Decision
y<0
R2 Discriminant Functions

Fisher’s Linear
Discriminant
x
w The Perceptron
y(x) Algorithm
w

x⊥

x1
−w0
w

239of 267



c 2011
Christfried Webers
NICTA
University

The value of y(x) gives a signed measure of the
perpendicular distance r of the point x from the decision
surface, r = y(x)/ w .
Classiﬁcation
x
0 Generalised Linear

T w wT w Model
y(x) = w x⊥ + r +w0 = r + wT x⊥ + w0 = r w Inference and Decision
w w

y>0 x2 Fisher’s Linear
y=0 Discriminant
y<0 R1
R2 The Perceptron
Algorithm
x
w
y(x)
w

x⊥

x1
−w0
w

240of 267



c 2011
Christfried Webers
NICTA
University

More compact notation : Add an extra dimension to the
Classiﬁcation
input space and set the value to x0 = 1. Generalised Linear
Model
Also deﬁne w = (w0 , w) and x = (1, x)

T Discriminant Functions
y(x) = w x
Fisher’s Linear
Discriminant

Decision surface is now a D-dimensional hyperplane in a The Perceptron
Algorithm
D + 1-dimensional expanded input space.

241of 267


Multi-Class Machine Learning

c 2011
Christfried Webers
NICTA
University

Number of classes K > 2
Can we combine a number of two-class discriminant
functions using K − 1 one-versus-the-rest classiﬁers?
Classiﬁcation

Generalised Linear
Model

?

Fisher’s Linear
Discriminant
R1
R2 The Perceptron
Algorithm

C1
R3
C2
not C1
not C2

242of 267



c 2011
Christfried Webers
NICTA
University

Can we combine a number of two-class discriminant
functions using K(K − 1)/2 one-versus-one classiﬁers?
Classiﬁcation

Generalised Linear
C3 Model
C1


R1 Fisher’s Linear
R3 Discriminant
C1 ? The Perceptron
Algorithm
C3
R2
C2

C2

243of 267



c 2011
Christfried Webers
NICTA
University

Solution: Use K linear functions

yk (x) = wT x + wk0
k Classiﬁcation

Generalised Linear
Assign input x to class Ck if yk (x) > yj (x) for all j = k. Model

Decision boundary between class Ck and Cj given by
yk (x) = yj (x) Fisher’s Linear
Discriminant

The Perceptron
Algorithm
Rj
Ri

Rk
xB
xA x

244of 267


Least Squares for Classification Machine Learning

c 2011
Christfried Webers
NICTA
University

Regression with a linear function of the model parameters
and minimisation of sum-of-squares error function resulted Classification
in a closed-from solution for the parameter values. Generalised Linear
Model
Is this also possible for classification?
Given input data x belonging to one of K classes Ck . Discriminant Functions

Use 1-of-K binary coding scheme. Fisher’s Linear
Discriminant

Each class is described by its own linear model The Perceptron
Algorithm

yk (x) = wT x + wk0
k k = 1, . . . , K

245of 267


Least Squares for Classification Machine Learning

c 2011
Christfried Webers
NICTA
University

With the conventions
wk0
wk = ∈ RD+1
wk Classification

1 D+1
Generalised Linear
x= ∈R Model
x Inference and Decision

W = w1 ... wK ∈ R(D+1)×K Discriminant Functions

Fisher’s Linear
Discriminant
we get for the discriminant function (vector valued) The Perceptron
Algorithm

y(x) = WT x ∈ RK .

For a new input x, the class is then defined by the index of
the largest value in the row vector y(x)

246of 267


Determine W Machine Learning

c 2011
Christfried Webers
NICTA
University

Given a training set {xn , t} where n = 1, . . . , N, and t is the
class in the 1-of-K coding scheme.
Deﬁne a matrix T where row n corresponds to tT .
n Classiﬁcation

The sum-of-squares error can now be written as Generalised Linear
Model

1
ED (W) = tr (XW − T)T (XW − T) Discriminant Functions
2 Fisher’s Linear
Discriminant

The minimum of ED (W) will be reached for The Perceptron
Algorithm

W = (XT X)−1 XT T = X† T

where X† is the pseudo-inverse of X.

247of 267


Discriminant Function for Multi-Class Machine Learning

c 2011
Christfried Webers
NICTA
University

The discriminant function y(x) is therefore

y(x) = WT x = TT (X† )T x,

where X is given by the training data, and x is the new Classiﬁcation

Generalised Linear
input. Model

Interesting property: If for every tn the same linear Inference and Decision

constraint aT tn + b = 0 holds, then the prediction y(x) will Discriminant Functions

Fisher’s Linear
also obey the same constraint Discriminant

The Perceptron
aT y(x) + b = 0. Algorithm

For the 1-of-K coding scheme, the sum of all components
in tn is one, and therefore all components of y(x) will sum
to one. BUT: the components are not probabilities, as they
are not constraint to the interval (0, 1).

248of 267


Deﬁciencies of the Least Squares Approach Machine Learning

c 2011
Christfried Webers
NICTA
University

Magenta curve : Decision Boundary for the least squares
approach ( Green curve : Decision boundary for the logistic
regression model described later)
4 4 Classiﬁcation

Generalised Linear
2 2 Model

0 0

−2 −2 Fisher’s Linear
Discriminant

−4 −4 The Perceptron
Algorithm

−6 −6

−8 −8

−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8

249of 267


Deﬁciencies of the Least Squares Approach Machine Learning

c 2011
Christfried Webers
NICTA
University

Magenta curve : Decision Boundary for the least squares
approach ( Green curve : Decision boundary for the logistic
regression model described later)
6 6 Classiﬁcation

Generalised Linear
4 4 Model

2 2

Fisher’s Linear
0 0 Discriminant

The Perceptron
−2 −2 Algorithm

−4 −4

−6 −6
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6

250of 267


Fisher’s Linear Discriminant Machine Learning

c 2011
Christfried Webers
NICTA
University

View linear classification as dimensionality reduction.

y(x) = wT x
Classification

Generalised Linear
If y ≥ −w0 then class C1 , otherwise C2 . Model

But there are many projections from a D-dimensional input Inference and Decision

space onto one dimension.
Fisher’s Linear
Projection always means loss of information. Discriminant

The Perceptron
For classification we want to preserve the class separation Algorithm

in one dimension.
Can we find a projection which maximally preserves the
class separation ?

251of 267



c 2011
Christfried Webers
NICTA
University

Samples from two classes in a two-dimensional input space
and their histogram when projected to two different
one-dimensional spaces.
Classiﬁcation

Generalised Linear
Model
4 4 Inference and Decision

2 2 Fisher’s Linear
Discriminant

The Perceptron
0 0
Algorithm

−2 −2

−2 2 6 −2 2 6

252of 267


Fisher’s Linear Discriminant - First Try Machine Learning

c 2011
Christfried Webers
NICTA
Given N1 input data of class C1 , and N2 input data of class The Australian National
University
C2 , calculate the centres of the two classes
1 1
m1 = xn , m2 = xn
N1 N2
n∈C1 n∈C2

Choose w so as to maximise the projection of the class Classiﬁcation

means onto w Generalised Linear
Model

m1 − m2 = wT (m1 − m2 ) Inference and Decision


Problem with non-uniform covariance Fisher’s Linear
Discriminant

4 The Perceptron
Algorithm

2

0

−2

−2 2 6

253of 267



c 2011
Christfried Webers
NICTA
Measure also the within-class variance for each class The Australian National
University

s2 =
k (yn − mk )2
n∈Ck

where yn = wT xn .
Classiﬁcation
Maximise the Fisher criterion Generalised Linear
Model
2
(m2 − m1 ) Inference and Decision
J(w) =
s2 + s2
1 2

Fisher’s Linear
Discriminant
4
The Perceptron
Algorithm

2

0

−2

−2 2 6

254of 267



c 2011
Christfried Webers
NICTA
University

The Fisher criterion can be rewritten as
wT SB w
J(w) = Classiﬁcation
wT S W w Generalised Linear
Model
SB is the between-class covariance Inference and Decision

SB = (m2 − m1 )(m2 − m1 )T Fisher’s Linear
Discriminant

SW is the within-class covariance The Perceptron
Algorithm

SW = (xn − m1 )(xn − m1 )T + (xn − m2 )(xn − m2 )T
n∈C1 n∈C2

255of 267



c 2011
Christfried Webers
NICTA
University

The Fisher criterion
wT SB w Classiﬁcation
J(w) =
wT S W w Generalised Linear
Model

has a maximum for Fisher’s linear discriminant

w∝ S−1 (m2
W − m1 ) Fisher’s Linear
Discriminant

The Perceptron
Algorithm
Fisher’s linear discriminant is NOT a discriminant, but can
be used to construct one by choosing a threshold y0 in the
projection space.

256of 267


Fisher’s Discriminant For Multi-Class Machine Learning

c 2011
Christfried Webers
NICTA
Assume that the dimensionality of the input space D is The Australian National
University
greater than the number of classes K.
Use D > 1 linear ’features’ yk = wT x and write everything
in vector form (no bias involved!)

y = wT x. Classiﬁcation

Generalised Linear
The within-class covariance is then the sum of the Model

covariances for all K classes
K Fisher’s Linear
Discriminant
SW = Sk
The Perceptron
k=1 Algorithm

where

Sk = (xn − mk )(xn − mk )T
n∈Ck
1
mk = xn
Nk
n∈Ck
257of 267



c 2011
Christfried Webers
NICTA
Between-class covariance University

K
SB = (mk − m)(mk − m)T .
k=1
Classiﬁcation
where m is the total mean of the input data Generalised Linear
Model
N Inference and Decision
1
m= x. Discriminant Functions
N
n=1 Fisher’s Linear
Discriminant

One possible way to deﬁne a function of W which is large The Perceptron
Algorithm
when the between-class covariance is large and the
within-class covariance is small is given by

J(W) = tr (WSW WT )−1 (WSB WT )

The maximum of J(W) is determined by the D
eigenvectors of S−1 SB with the largest eigenvalues.
W
258of 267



c 2011
Christfried Webers
NICTA
University

Classiﬁcation
How many linear ’features’ can one ﬁnd with this method? Generalised Linear
Model
SB is of rank at most K − 1 because of the sum of K rank
one matrices and the global constraint via m. Discriminant Functions

Projection onto the subspace spanned by SB can not have Fisher’s Linear
Discriminant
more than K − 1 linear features.
The Perceptron
Algorithm

259of 267


The Perceptron Algorithm Machine Learning

c 2011
Christfried Webers
NICTA
University

Frank Rosenblatt (1928 - 1969)
"Principles of neurodynamics: Perceptrons and the theory
of brain mechanisms" (Spartan Books, 1962)
Classiﬁcation

Generalised Linear
Model



Fisher’s Linear
Discriminant

The Perceptron
Algorithm

260of 267



c 2011
Christfried Webers
NICTA
University

Perceptron ("MARK 1") was the ﬁrst computer which could
learn new skills by trial and error
Classiﬁcation

Generalised Linear
Model



Fisher’s Linear
Discriminant

The Perceptron
Algorithm

261of 267



c 2011
Christfried Webers
NICTA
Two class model University

Create feature vector φ(x) by a ﬁxed nonlinear
transformation of the input x.
Generalised linear model
Classiﬁcation
y(x) = f (wT φ(x)) Generalised Linear
Model

with φ(x) containing some bias element φ0 (x) = 1. Inference and Decision

nonlinear activation function Fisher’s Linear
Discriminant

+1, a ≥ 0 The Perceptron
f (a) = Algorithm

−1, a < 0

Target coding for perceptron

+1, if C1
t=
−1, if C2
262of 267


The Perceptron Algorithm - Error Function Machine Learning

c 2011
Christfried Webers
NICTA
University

Idea : Minimise total number of misclassified patterns.
Problem : As a function of w, this is piecewise constant
Classification
and therefore the gradient is zero almost everywhere. Generalised Linear
Model
Better idea: Using the (−1, +1) target coding scheme, we
want all patterns to satisfy wT φ(xn )tn > 0.
Perceptron Criterion : Add the errors for all patterns Fisher’s Linear
Discriminant
belonging to the set of misclassified patterns M
The Perceptron
Algorithm

EP (w) = − wT φ(xn )tn
n∈M

263of 267


Perceptron - Stochastic Gradient Descent Machine Learning

c 2011
Christfried Webers
NICTA
University

Perceptron Criterion (with notation φn = φ(xn ) )

EP (w) = − wT φn tn
Classiﬁcation
n∈M
Generalised Linear
Model
One iteration at step τ Inference and Decision
1 Choose a training pair (xn , tn ) Discriminant Functions
2 Update the weight vector w by Fisher’s Linear
Discriminant

w(τ +1) = w(τ ) − η EP (w) = w(τ ) + ηφn tn The Perceptron
Algorithm

As y(x, w) does not depend on the norm of w, one can set
η=1
w(τ +1) = w(τ ) + φn tn

264of 267


The Perceptron Algorithm - Update 1 Machine Learning

c 2011
Christfried Webers
NICTA
University

Update of the perceptron weights from a misclassiﬁed pattern
(green)

w(τ +1) = w(τ ) + φn tn
Classiﬁcation

Generalised Linear
Model

1 1

Fisher’s Linear
0.5 0.5 Discriminant

The Perceptron
Algorithm

0 0

−0.5 −0.5

−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

265of 267


The Perceptron Algorithm - Update 2 Machine Learning

c 2011
Christfried Webers
NICTA
University

Update of the perceptron weights from a misclassiﬁed pattern
(green)

w(τ +1) = w(τ ) + φn tn
Classiﬁcation

Generalised Linear
Model

1 1

Fisher’s Linear
0.5 0.5 Discriminant

The Perceptron
Algorithm

0 0

−0.5 −0.5

−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

266of 267


The Perceptron Algorithm - Convergence Machine Learning

c 2011
Christfried Webers
NICTA
University

Does the algorithm converge ?
For a single update step

−w(τ +1)T φn tn = −w(τ )T φn tn − (φn tn )T φn tn < −w(τ )T φn tn Classification

Generalised Linear
Model
T
because (φn tn ) φn tn = φn tn > 0. Inference and Decision

BUT: contributions to the error from the other misclassified Discriminant Functions

patterns might have increased. Fisher’s Linear
Discriminant

AND: some correctly classified patterns might now be The Perceptron
Algorithm
misclassified.
Perceptron Convergence Theorem : If the training set is
linearly separable, the perceptron algorithm is guaranteed
to find a solution in a finite number of steps.

267of 267

linear classification

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (19)

Ähnlich wie linear classification

Ähnlich wie linear classification (6)

Mehr von nep_test_account

Mehr von nep_test_account (13)

linear classification