A Few Useful Things to Know about Machine Learning
linear classification
1. Introduction to Statistical
Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Introduction to Statistical Machine Learning
Christfried Webers
Statistical Machine Learning Group
NICTA
and
College of Engineering and Computer Science
The Australian National University
Canberra
February – June 2011
(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")
1of 267
2. Introduction to Statistical
Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Part VII
Classification
Generalised Linear
Model
Linear Classification 1 Inference and Decision
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
230of 267
3. Introduction to Statistical
Classification Machine Learning
c 2011
Christfried Webers
NICTA
Goal : Given input data x, assign it to one of K discrete The Australian National
University
classes Ck where k = 1, . . . , K.
Divide the input space into different regions.
Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
Figure: Length of the petal [in cm] for a given sepal [cm] for iris flowers
(Iris Setosa, Iris Versicolor, Iris Virginica).
231of 267
4. Introduction to Statistical
How to represent binary class labels? Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Class labels are no longer real values as in regression, but
a discrete set. Classification
Generalised Linear
Two classes : t ∈ {0, 1} Model
( t = 1 represents class C1 and t = 0 represents class C2 ) Inference and Decision
Discriminant Functions
Can interpret the value of t as the probability of class C1 , Fisher’s Linear
with only two values possible for the probability, 0 or 1. Discriminant
The Perceptron
Note: Other conventions to map classes into integers Algorithm
possible, check the setup.
232of 267
5. Introduction to Statistical
How to represent multi-class labels? Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
If there are more than two classes ( K > 2), we call it a
multi-class setup.
Often used: 1-of-K coding scheme in which t is a vector of Classification
length K which has all values 0 except for tj = 1, where j Generalised Linear
Model
comes from the membership in class Cj to encode. Inference and Decision
Example: Given 5 classes, {C1 , . . . , C5 }. Membership in Discriminant Functions
class C2 will be encoded as the target vector Fisher’s Linear
Discriminant
The Perceptron
t = (0, 1, 0, 0, 0)T Algorithm
Note: Other conventions to map multi-classes into integers
possible, check the setup.
233of 267
6. Introduction to Statistical
Linear Model Machine Learning
c 2011
Christfried Webers
NICTA
Idea: Use again a Linear Model as in regression: y(x, w) is The Australian National
University
a linear function of the parameters w
y(xn , w) = wT φ(xn )
But generally y(xn , w) ∈ R.
Example: Which class is y(x, w) = 0.71623 ? Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
234of 267
7. Introduction to Statistical
Generalised Linear Model Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Apply a mapping f : R → Z to the linear model to get the
discrete class labels.
Generalised Linear Model
y(xn , w) = f (wT φ(xn )) Classification
Generalised Linear
Model
Activation function: f (·)
Inference and Decision
Link function : f −1 (·) Discriminant Functions
Fisher’s Linear
sign z Discriminant
1.0
The Perceptron
0.5 Algorithm
z
0.5 0.0 0.5 1.0
0.5
1.0
Figure: Example of an activation function f (z) = sign (z) .
235of 267
8. Introduction to Statistical
Three Models for Decision Problems Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
In increasing order of complexity
Find a discriminant function f (x) which maps each input
directly onto a class label.
Discriminative Models Classification
1 Solve the inference problem of determining the posterior Generalised Linear
class probabilities p(Ck | x). Model
2 Use decision theory to assign each new x to one of the Inference and Decision
classes. Discriminant Functions
Fisher’s Linear
Generative Models Discriminant
1 Solve the inference problem of determining the The Perceptron
Algorithm
class-conditional probabilities p(x | Ck ).
2 Also, infer the prior class probabilities p(Ck ).
3 Use Bayes’ theorem to find the posterior p(Ck | x).
4 Alternatively, model the joint distribution p(x, Ck ) directly.
5 Use decision theory to assign each new x to one of the
classes.
236of 267
9. Introduction to Statistical
Two Classes Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Definition
A discriminant is a function that maps from an input vector x to
one of K classes, denoted by Ck .
Classification
Generalised Linear
Model
Consider first two classes ( K = 2 ).
Inference and Decision
Construct a linear function of the inputs x Discriminant Functions
Fisher’s Linear
y(x) = wT x + w0 Discriminant
The Perceptron
Algorithm
such that x being assigned to class C1 if y(x) ≥ 0, and to
class C2 otherwise.
weight vector w
bias w0 ( sometimes −w0 called threshold )
237of 267
10. Introduction to Statistical
Two Classes Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Decision boundary y(x) = 0 is a (D − 1)-dimensional
hyperplane in a D-dimensional input space (decision Classification
surface). Generalised Linear
Model
w is orthogonal to any vector lying in the decision surface. Inference and Decision
Discriminant Functions
Proof: Assume xA and xB are two points lying in the Fisher’s Linear
decision surface. Then, Discriminant
The Perceptron
Algorithm
0 = y(xA ) − y(xB ) = wT (xA − xB )
238of 267
11. Introduction to Statistical
Two Classes Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
The normal distance from the origin to the decision
surface is
wT x w0
=−
w w Classification
Generalised Linear
y>0 x2 Model
y=0
R1 Inference and Decision
y<0
R2 Discriminant Functions
Fisher’s Linear
Discriminant
x
w The Perceptron
y(x) Algorithm
w
x⊥
x1
−w0
w
239of 267
12. Introduction to Statistical
Two Classes Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
The value of y(x) gives a signed measure of the
perpendicular distance r of the point x from the decision
surface, r = y(x)/ w .
Classification
x
0 Generalised Linear
T w wT w Model
y(x) = w x⊥ + r +w0 = r + wT x⊥ + w0 = r w Inference and Decision
w w
Discriminant Functions
y>0 x2 Fisher’s Linear
y=0 Discriminant
y<0 R1
R2 The Perceptron
Algorithm
x
w
y(x)
w
x⊥
x1
−w0
w
240of 267
13. Introduction to Statistical
Two Classes Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
More compact notation : Add an extra dimension to the
Classification
input space and set the value to x0 = 1. Generalised Linear
Model
Also define w = (w0 , w) and x = (1, x)
Inference and Decision
T Discriminant Functions
y(x) = w x
Fisher’s Linear
Discriminant
Decision surface is now a D-dimensional hyperplane in a The Perceptron
Algorithm
D + 1-dimensional expanded input space.
241of 267
14. Introduction to Statistical
Multi-Class Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Number of classes K > 2
Can we combine a number of two-class discriminant
functions using K − 1 one-versus-the-rest classifiers?
Classification
Generalised Linear
Model
Inference and Decision
?
Discriminant Functions
Fisher’s Linear
Discriminant
R1
R2 The Perceptron
Algorithm
C1
R3
C2
not C1
not C2
242of 267
15. Introduction to Statistical
Multi-Class Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Number of classes K > 2
Can we combine a number of two-class discriminant
functions using K(K − 1)/2 one-versus-one classifiers?
Classification
Generalised Linear
C3 Model
C1
Inference and Decision
Discriminant Functions
R1 Fisher’s Linear
R3 Discriminant
C1 ? The Perceptron
Algorithm
C3
R2
C2
C2
243of 267
16. Introduction to Statistical
Multi-Class Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Number of classes K > 2
Solution: Use K linear functions
yk (x) = wT x + wk0
k Classification
Generalised Linear
Assign input x to class Ck if yk (x) > yj (x) for all j = k. Model
Inference and Decision
Decision boundary between class Ck and Cj given by
Discriminant Functions
yk (x) = yj (x) Fisher’s Linear
Discriminant
The Perceptron
Algorithm
Rj
Ri
Rk
xB
xA x
244of 267
17. Introduction to Statistical
Least Squares for Classification Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Regression with a linear function of the model parameters
and minimisation of sum-of-squares error function resulted Classification
in a closed-from solution for the parameter values. Generalised Linear
Model
Is this also possible for classification?
Inference and Decision
Given input data x belonging to one of K classes Ck . Discriminant Functions
Use 1-of-K binary coding scheme. Fisher’s Linear
Discriminant
Each class is described by its own linear model The Perceptron
Algorithm
yk (x) = wT x + wk0
k k = 1, . . . , K
245of 267
18. Introduction to Statistical
Least Squares for Classification Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
With the conventions
wk0
wk = ∈ RD+1
wk Classification
1 D+1
Generalised Linear
x= ∈R Model
x Inference and Decision
W = w1 ... wK ∈ R(D+1)×K Discriminant Functions
Fisher’s Linear
Discriminant
we get for the discriminant function (vector valued) The Perceptron
Algorithm
y(x) = WT x ∈ RK .
For a new input x, the class is then defined by the index of
the largest value in the row vector y(x)
246of 267
19. Introduction to Statistical
Determine W Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Given a training set {xn , t} where n = 1, . . . , N, and t is the
class in the 1-of-K coding scheme.
Define a matrix T where row n corresponds to tT .
n Classification
The sum-of-squares error can now be written as Generalised Linear
Model
Inference and Decision
1
ED (W) = tr (XW − T)T (XW − T) Discriminant Functions
2 Fisher’s Linear
Discriminant
The minimum of ED (W) will be reached for The Perceptron
Algorithm
W = (XT X)−1 XT T = X† T
where X† is the pseudo-inverse of X.
247of 267
20. Introduction to Statistical
Discriminant Function for Multi-Class Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
The discriminant function y(x) is therefore
y(x) = WT x = TT (X† )T x,
where X is given by the training data, and x is the new Classification
Generalised Linear
input. Model
Interesting property: If for every tn the same linear Inference and Decision
constraint aT tn + b = 0 holds, then the prediction y(x) will Discriminant Functions
Fisher’s Linear
also obey the same constraint Discriminant
The Perceptron
aT y(x) + b = 0. Algorithm
For the 1-of-K coding scheme, the sum of all components
in tn is one, and therefore all components of y(x) will sum
to one. BUT: the components are not probabilities, as they
are not constraint to the interval (0, 1).
248of 267
21. Introduction to Statistical
Deficiencies of the Least Squares Approach Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Magenta curve : Decision Boundary for the least squares
approach ( Green curve : Decision boundary for the logistic
regression model described later)
4 4 Classification
Generalised Linear
2 2 Model
Inference and Decision
0 0
Discriminant Functions
−2 −2 Fisher’s Linear
Discriminant
−4 −4 The Perceptron
Algorithm
−6 −6
−8 −8
−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
249of 267
22. Introduction to Statistical
Deficiencies of the Least Squares Approach Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Magenta curve : Decision Boundary for the least squares
approach ( Green curve : Decision boundary for the logistic
regression model described later)
6 6 Classification
Generalised Linear
4 4 Model
Inference and Decision
2 2
Discriminant Functions
Fisher’s Linear
0 0 Discriminant
The Perceptron
−2 −2 Algorithm
−4 −4
−6 −6
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
250of 267
23. Introduction to Statistical
Fisher’s Linear Discriminant Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
View linear classification as dimensionality reduction.
y(x) = wT x
Classification
Generalised Linear
If y ≥ −w0 then class C1 , otherwise C2 . Model
But there are many projections from a D-dimensional input Inference and Decision
Discriminant Functions
space onto one dimension.
Fisher’s Linear
Projection always means loss of information. Discriminant
The Perceptron
For classification we want to preserve the class separation Algorithm
in one dimension.
Can we find a projection which maximally preserves the
class separation ?
251of 267
24. Introduction to Statistical
Fisher’s Linear Discriminant Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Samples from two classes in a two-dimensional input space
and their histogram when projected to two different
one-dimensional spaces.
Classification
Generalised Linear
Model
4 4 Inference and Decision
Discriminant Functions
2 2 Fisher’s Linear
Discriminant
The Perceptron
0 0
Algorithm
−2 −2
−2 2 6 −2 2 6
252of 267
25. Introduction to Statistical
Fisher’s Linear Discriminant - First Try Machine Learning
c 2011
Christfried Webers
NICTA
Given N1 input data of class C1 , and N2 input data of class The Australian National
University
C2 , calculate the centres of the two classes
1 1
m1 = xn , m2 = xn
N1 N2
n∈C1 n∈C2
Choose w so as to maximise the projection of the class Classification
means onto w Generalised Linear
Model
m1 − m2 = wT (m1 − m2 ) Inference and Decision
Discriminant Functions
Problem with non-uniform covariance Fisher’s Linear
Discriminant
4 The Perceptron
Algorithm
2
0
−2
−2 2 6
253of 267
26. Introduction to Statistical
Fisher’s Linear Discriminant Machine Learning
c 2011
Christfried Webers
NICTA
Measure also the within-class variance for each class The Australian National
University
s2 =
k (yn − mk )2
n∈Ck
where yn = wT xn .
Classification
Maximise the Fisher criterion Generalised Linear
Model
2
(m2 − m1 ) Inference and Decision
J(w) =
s2 + s2
1 2
Discriminant Functions
Fisher’s Linear
Discriminant
4
The Perceptron
Algorithm
2
0
−2
−2 2 6
254of 267
27. Introduction to Statistical
Fisher’s Linear Discriminant Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
The Fisher criterion can be rewritten as
wT SB w
J(w) = Classification
wT S W w Generalised Linear
Model
SB is the between-class covariance Inference and Decision
Discriminant Functions
SB = (m2 − m1 )(m2 − m1 )T Fisher’s Linear
Discriminant
SW is the within-class covariance The Perceptron
Algorithm
SW = (xn − m1 )(xn − m1 )T + (xn − m2 )(xn − m2 )T
n∈C1 n∈C2
255of 267
28. Introduction to Statistical
Fisher’s Linear Discriminant Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
The Fisher criterion
wT SB w Classification
J(w) =
wT S W w Generalised Linear
Model
Inference and Decision
has a maximum for Fisher’s linear discriminant
Discriminant Functions
w∝ S−1 (m2
W − m1 ) Fisher’s Linear
Discriminant
The Perceptron
Algorithm
Fisher’s linear discriminant is NOT a discriminant, but can
be used to construct one by choosing a threshold y0 in the
projection space.
256of 267
29. Introduction to Statistical
Fisher’s Discriminant For Multi-Class Machine Learning
c 2011
Christfried Webers
NICTA
Assume that the dimensionality of the input space D is The Australian National
University
greater than the number of classes K.
Use D > 1 linear ’features’ yk = wT x and write everything
in vector form (no bias involved!)
y = wT x. Classification
Generalised Linear
The within-class covariance is then the sum of the Model
Inference and Decision
covariances for all K classes
Discriminant Functions
K Fisher’s Linear
Discriminant
SW = Sk
The Perceptron
k=1 Algorithm
where
Sk = (xn − mk )(xn − mk )T
n∈Ck
1
mk = xn
Nk
n∈Ck
257of 267
30. Introduction to Statistical
Fisher’s Discriminant For Multi-Class Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
Between-class covariance University
K
SB = (mk − m)(mk − m)T .
k=1
Classification
where m is the total mean of the input data Generalised Linear
Model
N Inference and Decision
1
m= x. Discriminant Functions
N
n=1 Fisher’s Linear
Discriminant
One possible way to define a function of W which is large The Perceptron
Algorithm
when the between-class covariance is large and the
within-class covariance is small is given by
J(W) = tr (WSW WT )−1 (WSB WT )
The maximum of J(W) is determined by the D
eigenvectors of S−1 SB with the largest eigenvalues.
W
258of 267
31. Introduction to Statistical
Fisher’s Discriminant For Multi-Class Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Classification
How many linear ’features’ can one find with this method? Generalised Linear
Model
SB is of rank at most K − 1 because of the sum of K rank
Inference and Decision
one matrices and the global constraint via m. Discriminant Functions
Projection onto the subspace spanned by SB can not have Fisher’s Linear
Discriminant
more than K − 1 linear features.
The Perceptron
Algorithm
259of 267
32. Introduction to Statistical
The Perceptron Algorithm Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Frank Rosenblatt (1928 - 1969)
"Principles of neurodynamics: Perceptrons and the theory
of brain mechanisms" (Spartan Books, 1962)
Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
260of 267
33. Introduction to Statistical
The Perceptron Algorithm Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Perceptron ("MARK 1") was the first computer which could
learn new skills by trial and error
Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
261of 267
34. Introduction to Statistical
The Perceptron Algorithm Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
Two class model University
Create feature vector φ(x) by a fixed nonlinear
transformation of the input x.
Generalised linear model
Classification
y(x) = f (wT φ(x)) Generalised Linear
Model
with φ(x) containing some bias element φ0 (x) = 1. Inference and Decision
Discriminant Functions
nonlinear activation function Fisher’s Linear
Discriminant
+1, a ≥ 0 The Perceptron
f (a) = Algorithm
−1, a < 0
Target coding for perceptron
+1, if C1
t=
−1, if C2
262of 267
35. Introduction to Statistical
The Perceptron Algorithm - Error Function Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Idea : Minimise total number of misclassified patterns.
Problem : As a function of w, this is piecewise constant
Classification
and therefore the gradient is zero almost everywhere. Generalised Linear
Model
Better idea: Using the (−1, +1) target coding scheme, we
Inference and Decision
want all patterns to satisfy wT φ(xn )tn > 0.
Discriminant Functions
Perceptron Criterion : Add the errors for all patterns Fisher’s Linear
Discriminant
belonging to the set of misclassified patterns M
The Perceptron
Algorithm
EP (w) = − wT φ(xn )tn
n∈M
263of 267
36. Introduction to Statistical
Perceptron - Stochastic Gradient Descent Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Perceptron Criterion (with notation φn = φ(xn ) )
EP (w) = − wT φn tn
Classification
n∈M
Generalised Linear
Model
One iteration at step τ Inference and Decision
1 Choose a training pair (xn , tn ) Discriminant Functions
2 Update the weight vector w by Fisher’s Linear
Discriminant
w(τ +1) = w(τ ) − η EP (w) = w(τ ) + ηφn tn The Perceptron
Algorithm
As y(x, w) does not depend on the norm of w, one can set
η=1
w(τ +1) = w(τ ) + φn tn
264of 267
37. Introduction to Statistical
The Perceptron Algorithm - Update 1 Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Update of the perceptron weights from a misclassified pattern
(green)
w(τ +1) = w(τ ) + φn tn
Classification
Generalised Linear
Model
Inference and Decision
1 1
Discriminant Functions
Fisher’s Linear
0.5 0.5 Discriminant
The Perceptron
Algorithm
0 0
−0.5 −0.5
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
265of 267
38. Introduction to Statistical
The Perceptron Algorithm - Update 2 Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Update of the perceptron weights from a misclassified pattern
(green)
w(τ +1) = w(τ ) + φn tn
Classification
Generalised Linear
Model
Inference and Decision
1 1
Discriminant Functions
Fisher’s Linear
0.5 0.5 Discriminant
The Perceptron
Algorithm
0 0
−0.5 −0.5
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
266of 267
39. Introduction to Statistical
The Perceptron Algorithm - Convergence Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Does the algorithm converge ?
For a single update step
−w(τ +1)T φn tn = −w(τ )T φn tn − (φn tn )T φn tn < −w(τ )T φn tn Classification
Generalised Linear
Model
T
because (φn tn ) φn tn = φn tn > 0. Inference and Decision
BUT: contributions to the error from the other misclassified Discriminant Functions
patterns might have increased. Fisher’s Linear
Discriminant
AND: some correctly classified patterns might now be The Perceptron
Algorithm
misclassified.
Perceptron Convergence Theorem : If the training set is
linearly separable, the perceptron algorithm is guaranteed
to find a solution in a finite number of steps.
267of 267