Support Vector Machines are one of the main tool in classical Machine Learning toolbox. This was one of the lectures of a full course I taught in University of Moratuwa, Sri Lanka on 2023 second half of the year.
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module Statistical & Machine Learning
1. DA 2111 – Statistical & Machine Learning
Lecture 7 – Support Vector Machines (SVM)
Maninda Edirisooriya
manindaw@uom.lk
2. Classification Problem (remember?)
• Say we have two X variables
• In Binary Classification our goal
is to classify data points into two
known classes, Positive or
Negative
• When we can separate classes
with a linear decision boundary
we call it as Linearly Separable
Negative
Positive
X2
Y = β0 + β1*X1 + β2*X2 = 0
(Decision boundary equation)
Y > 0
Positive
Y < 0
Negative
X1
3. SVM Related Notation
• To make SVM related math easier we have to divide 𝛃 parameters into two types
• 𝛃0 as b or intercept
• 𝛃1, 𝛃2, … 𝛃n as W1 , W2 , … Wn or coefficients
• Say, W = [W1 , W2 , … Wn ]T
• Then,
𝜷 =
𝛽0
𝛽1
𝛽2
.
.
𝛽𝑛
=
𝑏
𝐖
4. SVM Classification
• Let’s consider a Linearly
Separable case (i.e. Hard SVM)
• SVM tries to find the hyperplane
(i.e. W and b) maximizing the
minimum distance from
hyperplane to its nearest data
points
• Nearest data points (that are
exactly unit 1 distance from the
hyperplane) are called Support
Vectors Negative
Positive
X2
Y = b + W1*X1 + W2*X2 = WT*X + b = 0
(SVM Decision boundary/hyperplane equation)
Y > 1
Positive
Y < -1
Negative
X1
Support Vectors
5. Primal Problem
• SVM Decision Boundary can be represented in Geometric sense, and
optimized as follows for finding W and b
• Minimize
𝟏
𝟐
∥W∥2+𝑪 σ𝟏=𝟏
𝑵
𝝃𝒊
• Subject to
1. Yi(WTXi+b) ≥ 𝟏 − 𝝃𝒊 for all i=1, 2, ..., N
2. 𝛏𝐢 ≥ 0 for all i=1, 2, ..., N
• This is the Geometric Problem or the Primal Problem of SVMs
• Here,
• A ∥W∥ is the Euclidian Norm of the Weight Vector, W = [W1 , W2 , … Wn]T
• In order to allow misclassified data points (i.e. to generalize to a Soft SVM), 𝝃𝒊,
known as the Slack Variable, is reduced from 1
• C is the Regularization Parameter that adds L1 regularization
• The goal is to minimize the objective function while ensuring that all data
points are correctly classified with a margin of at least 1 − 𝝃𝒊
6. Primal Problem
• By trying to minimizing ∥W∥ we mathematically try to minimize the
distance between the solution hyperplane and the support vectors
• The constrain part Yi(WTXi+b) ≥ 𝟏 tries to maintain at least a unit
distance between the hyperplane and its nearest vectors (Support
Vectors)
• In the constraint, reducing the term 𝝃𝒊 from 1 (in the right hand side)
allows Soft SVMs, where this minimum gap of 1 cannot be
maintained
• Regularization parameter C sets the tradeoff between maximizing the
margin and minimizing classification errors
• A smaller C emphasizes a wider margin and tolerates some misclassifications,
while a larger C results in a narrower margin and fewer misclassifications
7. Effect of C
Source: https://stats.stackexchange.com/questions/31066/what-is-the-influence-of-c-in-svms-with-linear-kernel
8. Modelling Non-Linear Functions
• Though the parameters 𝝃𝒊 and C
can address misclassifications still
this can only model nearly linearly
separable classes
• Highly non-linear classes (e.g.: that
needs a circular hyperplane)
cannot be modelled with a linear
hyperplane
• But such data may be linearly
separable if they are represented in
a higher dimensional space
Negative Positive
X2
Y = β0 + β1∗X1
2 + β2∗X2
2
(Circular decision boundary)
Y > 0
Positive
Y < 0
Negative
X1
9. Separate in a Higher Dimension
Source: https://www.hackerearth.com/blog/developers/simple-tutorial-svm-parameter-tuning-python-r
10. Dual Problem
• It is difficult to do that dimension increase transformation with Primal
Problem
• But Primal Problem can be converted to an equivalent problem
known as the Dual Problem or the Functional Problem to find 𝜶𝒊s
(known as Lagrange Multipliers)
• Maximize σ𝒊=𝟏
𝑵
𝜶𝒊 −
𝟏
𝟐
𝒊=𝟏
𝑵
𝒋=𝟏
𝑵
𝜶𝒊𝜶𝒋 𝒀𝒊𝒀𝒋𝑿𝒊
𝑻
𝑿𝒋
• Subject to
1. C ≥ 𝜶𝒊 ≥ 𝟎 for all i=1, 2, ..., N and
2. σ𝒊=𝟏
𝑵
𝜶𝒊𝒀𝒊 = 𝟎
11. Dual Problem
• With 𝜶𝒊 values we can,
• Find the Support Vectors and then
• Find the desicion boundary (hyperplane)
• Note that here we have
• Only 𝜶𝒊 values to be found which are less in number compared to W and b
values
• We have 𝑿𝒊
𝑻
𝑿𝒋 term which can be used for Kernel Trick (discussed later) to
handle non-linear desicion boundaries
• Once we have solved the Dual Problem the solution to the equivalant
Primal Problem can also be found
12. Solution Function from Dual Problem
• First find 𝜶𝒊s for Support Vectors that are greater than zero
• Using these support vectors calculate weight vector, W = 𝜮𝑺𝜶𝒊𝒀𝒊𝑿𝐢
• Where, 𝜮𝑺
: Sum of all the support vectors
• Calculate bias b using any support vector, b = 𝒀𝒊- WT𝑿𝐢
• As the solution function (Hypothesis Function) output is defined by
its sign, f(X) = sign(WT𝑿𝒊+b) where sign is a function that returns the
sign of a value
13. Transform to Higher Dimensions
• The data points in can be transformed to a higher dimensional space by
applying some mathematical technique
• E.g.: X1, X2 → X1, X2, X3 by defining X3 such that X3 = X1
2 + X2
2
• Then when it becomes linearly separable in the new space with higher
number of dimensions, a SVM can be applied for classification
• But this becomes highly computationally expensive when the number of
dimensions in the new space is very large (e.g.: 106 dimensions) or infinite
• But there is a way to modify the original function to get a similar effect,
without calculating the coordinates in the new high dimensional space
• That is possible by simply updating the distance between data points like
they are in a higher dimensional space, using the technique, Kernel Trick
14. Kernel Function
• Kernel Function is a function that measures the similarity/distance
between two data points in a vector space
• A Kernel Function should be Positive Definite as well
• Examples (when X and Y are vectors with same dimensions)
• Linear Kernel: K(X, Y) = XTY
• Polynomial Kernel: K(X, Y) = (XTY + r)n where r ≥ 0 and n ≥ 1
• Gaussian (Radial Basis Function) Kernel: K(X, Y) = ⅇ
−
𝑿−𝒀 𝟐
𝟐𝝈𝟐 where 𝝈 > 0
15. Linear Kernel
• A Linear Kernel gives the inner product (or dot product) of two
vectors in the same vector space
• Linear Kernel does not change the number of dimensions when the
similarity is given between the two vectors
• Example, (Say X and Y are vectors where X = [X1, X2]T and Y = [Y1, Y2]T)
• K(X, Y) = XTY = X1Y1+ X2Y2
16. Polynomial Kernel
• A Polynomial Kernel maps the X variables to a higher degree
polynomial up to the degree of the function
• When r=0, K(X, Y) = (XTY + r)n becomes (XTY)n where
• When n=1 it becomes a Linear Kernel XTY having a linear decision boundary
• When n=2, it becomes a quadratic kernel and the decision boundary is a
quadratic surface
• When n=3, it becomes a cubic kernel and the decision boundary is a cubic
surface
• And so on …
• When r > 1, r shifts the decision boundary
17. Gaussian (Radial Basis Function) Kernel
• RBF Kernel has infinite dimensions and can model highly non-linear
decision boundaries
• Can also be represented as K(X, Y) = ⅇ−𝜸 𝑿−𝒀 𝟐
where 𝜸 can tune the
bias-variance tradeoff
• Low 𝜸 relates to smoother hyperplane OR higher bias and lower variance
• High 𝜸 relates to wiggly hyperplane OR lower bias and higher variance
• Cross-validation can be used to find the best value for 𝜸
18. Kernel Trick
• In the Dual Problem, σ𝒊=𝟏
𝑵
𝜶𝒊 −
𝟏
𝟐
𝒊=𝟏
𝑵
𝒋=𝟏
𝑵
𝜶𝒊𝜶𝒋 𝒀𝒊𝒀𝒋𝑿𝒊
𝑻
𝑿𝒋 , which
has to be maximized, contains a Kernel Function on X variables to
themselves (i.e. 𝑿𝒊
𝑻
𝑿𝒋 which is a Linear Kernel) that can be replaced with
another Kernel Function
• This new Kernel Function can be a Polynomial Kernel or a Gaussian Kernel
(RBF Kernel) or any other valid Kernel function
• New Dual Problems needs to maximize becomes,
𝒊=𝟏
𝑵
𝜶𝒊 −
𝟏
𝟐
ා
𝒊=𝟏
𝑵
𝒋=𝟏
𝑵
𝜶𝒊𝜶𝒋 𝒀𝒊𝒀𝒋𝑲(𝑿𝒊, 𝑿𝒋)
19. SVM with Kernel
• The SVM with Kernel will become,
f(X) = 𝜮𝑺𝜶𝒊𝒀𝒊𝑲 𝑿, 𝑿𝒊 + 𝒃
• Where,
• 𝑲 𝑿, 𝑿𝒊 : Kernel function between X values with all support vectors 𝐗𝐢
• 𝜶𝒊: Lagrange multipliers learnt during the training
• 𝜮𝑺: Sum of all the support vectors
20. One Hour Homework
• Officially we have one more hour to do after the end of the lectures
• Therefore, for this week’s extra hour you have a homework
• Support Vector Machine is an important tool in Machine Learning for dealing
with highly non-linear relatively small datasets
• This lesson only explained the formulas in SVM but not proved any, as that
involves heavy mathematical calculations. You can try if you like
• Then search for the real world applications with SVMs and understand when
it has to be used as a ML tool
• Good Luck!