SVM.pdf

Hiba BELLAFKIH 2022-2023
Support Vector Machines
Supervised Machine Learning
1

Contents
2
03 Linearly Separable
Data Points
Hard Margin Classification
04 Linearly Separable
Data Points
Soft Margin Classification
02 Motivation
05 Non-Linearly
Separable Data
Kernel Trick 06 SVM Regression
01 Classification

Classiﬁcation
3
Binary Classiﬁcation
In this type, the machine should classify an instance as only one of two classes.
1
-1

Classiﬁcation
4
Multiclass Classiﬁcation ( one vs all)
In this type, the machine should classify an instance as only one of three classes or
more.

Classification
5
Multiclass Classification (one vs one)
We split a multi-class classification into one binary classification problem per each
pair of classes
Classes : [Red, Blue, Green, Yellow]
❏ Red vs Blue
❏ Red vs Green
❏ Red vs Yellow
❏ Blue vs Green
❏ Blue vs Yellow
❏ Green vs Yellow
N(N-1)/2 for N
classes.

Motivation
6
Classiﬁed training data set
(Example with well separated data points)
Best separator to be chosen
How do we chose the separator ?

Motivation
7
We add a margin to each side of the
separator
Wanted result :
➔ The separator with the largest
margin to the nearest data
point.
➔ The lowest generalization error.
Generalization error : a measure of how accurately
an algorithm is able to predict outcome values for
previously unseen data.

Motivation
8
Feature X2
Feature X1
Support Vectors
Decision Boundary
Classiﬁed +1
Classiﬁed -1
Data points

Linearly Separable Data Points
Mathematical Interpretation of Optimal Hyperplane
9
Our data form :
Then the SVM problem is going to be :

10
The margin is calculated with subtraction of distance between hyperplane H2 and
origin from distance between hyperplane H1 and origin which results in M = 2/|w|

11
1. Formulation of SVM problem

12
2. Finding parameters with respect to w, b and lambda (learning parameters).
Lagrange multipliers is an algorithm that ﬁnds where the
gradient of a function points in the same direction as the
gradients of its constraints, while also satisfying said
constraints.

13
3. Finding value of parameters that minimize ||w||.
Finding λ* => Finding w*
Finding λ* => Finding b*
Result : Switch to
optimizing λ
Solution ? **Dual Optimization Formulation**
When we move from primal to dual formulation we
switch from minimizing to maximizing the loss function.

14
1. Formulation and substitution value from primal problem

15
2. Simplify the loss function equation after substitution

16
3. Final optimization to get the value of λ
Above maximization operation can be solved with the
SMO ( sequential minimization optimization)
algorithms .

17
Once we get the value of λ we can get w from the equation below

18
And using the value of w , λ we will calculate b as following :

The reason behind it?
19
Hard margin ⇔ Perfect Separation ⇔ Overﬁtting
To allow the model to make a few mistakes while classifying the points.

20
Add a slack variable as a penalty for every miss-classification for each data point
represented by ξ (xi). So, no penalty means the data point is correctly classified, ξ = 0, and at
any miss classification ξ > 1, as a penalty.

21
Slack for every variable should be as low as possible and further regularized by
hyperparameter C
❏ If C = 0, means less complex boundary as classifier would be not penalized by slack, as
a result, the optimum hyperplane can use it anywhere and accept all large
misclassifications. And as a result, the decision boundary would be linear and under
fitted.
❏ If C = infinitely high, then even small slacks would be highly penalized and the
classifier can't afford to misclassify points and therefore overfitting. So parameter C is
important.
In machine learning, a hyperparameter is a parameter
whose value is used to control the learning process.

22
The new formulation becomes :

23
Primal gradient based optimization method to use gradient descent to update the
parameters of classiﬁer.
The optimisation algorithm becomes :

24
Primal gradient based optimization method to use gradient descent to update the
parameters of classiﬁer.
The optimisation algorithm becomes :

Non Linearly Separable Data Points
Kernel Trick
SVM Visualization
25
SVM Visualization

Kernel Trick
26
ORIGINAL
DATA
ORIGINAL
INNER
PROD
TRANS
DATA
TRANS
INNER
PROD
?
N*F
N*f
N²
N²

Kernel Trick
Mercer’s Theorem
27
According to Mercer’s theorem, if a function K(a, b) respects a few mathematical con‐
ditions called Mercer’s conditions (e.g., K must be continuous and symmetric in its
arguments so that K(a, b) = K(b, a), etc.), then there exists a function ϕ that maps a
and b into another space (possibly with much higher dimensions) such that
K(a, b) =ϕ(a)⊺ ϕ(b). You can use K as a kernel because you know ϕ exists, even if you don’t
know what ϕ is.

Kernel Trick
28
Optimization Problem becomes :

Kernel Trick
Most commonly used kernels
29
❏ Linear Kernel
❏ Polynomial Kernel Function
❏ Gaussian function
❏ Gaussian Radial basis function

SVM Regression
Concept
30
To use SVMs for regression instead of classification, the trick is to reverse the objective: instead
of trying to fit the largest possible street between two classes while limiting margin violations,
SVM Regression tries to fit as many instances as possible on the street while limiting margin
violations (i.e., instances off the street). The width of the street is controlled by a
hyperparameter, ϵ.

SVM.pdf

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie SVM.pdf

Ähnlich wie SVM.pdf (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

SVM.pdf