Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Nächste SlideShare
×

# A Simple Review on SVM

A simple review on SVM

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Als Erste(r) kommentieren

• Gehören Sie zu den Ersten, denen das gefällt!

### A Simple Review on SVM

1. 1. A Simple Review on SVM Honglin Yu Australian National University, NICTA September 2, 2013
2. 2. Outline 1 The Tutorial Routine Overview Linear SVC in Separable Case: Largest Margin Classiﬁer Soft Margin Solving SVM Kernel Trick and Non-linear SVM 2 Some Topics Why the Name: Support Vectors? Why SVC Works Well: A Simple Example Relation with Logistic Regression etc. 3 Packages
3. 3. The Tutorial Routine Some Topics Packages Overview SVM (Support Vector Machines) are supervised learning methods It includes both methods for classiﬁcation and regression In this talk, we focus on binary classiﬁcations.
4. 4. The Tutorial Routine Some Topics Packages Symbols training data: (x1, y1), ..., (xm, ym) ∈ X × {±1} patterns: xi , i = 1, 2, ..., m pattern space: X targets: yi , i = 1, 2, ..., m features: xi = Φ(xi ) feature space: H feature mapping: Φ : X → H
5. 5. The Tutorial Routine Some Topics Packages Separable Case: Largest Margin Classiﬁer Figure: Simplest Case “Separable” means: ∃ line w · x + b = 0 correctly separates all the training data. “Margin”: d+ + d− (d± = min yi =±1 dist(xi , w · x + b = 0)) In this case, the SVC just looks for a line maximizing the margins.
6. 6. The Tutorial Routine Some Topics Packages Separable Case: Largest Margin Classiﬁer Another form of expressing separable: yi (w · xi + b) > 0 Because the training data is ﬁnite, ∃ , yi (w · xi + b) ≥ This is equivalent to yi (w · xi + b ) ≥ 1 w · xi + b = 0 and w · xi + b = 0 are same line. We can directly write the constraints as yi (w · xi + b) ≥ 1 This removes the scaling redundancy in w, b
7. 7. The Tutorial Routine Some Topics Packages We also want the separating plane to place in the middle (which means d+ = d−). So the optimization problem can be formulated as arg max w,b (2 min x |w · xi + b| |w| ) s.t. yi (w · xi + b) ≥ 1, i = 1, 2, ..., N (1) This is equivalent to: arg min w,b |w|2 s.t. yi (w · xi + b) ≥ 1, i = 1, 2, ..., N (2) But, until now, it can only be conﬁrmed that Eq.2 is only the necessary condition of ﬁnding the plane we want (correct and in the middle)
8. 8. The Tutorial Routine Some Topics Packages Largest Margin Classiﬁer It can be proved that, when the data is separable, for the following problem min w,b 1 2 ||w||2 s.t. yi · (w · xi + b) ≥ 1, i = 1, ..., m. (3) we have, 1 When the ||w|| is minimized, the equality holds for some x. 2 The equality holds at least for some xi , xj where yi yj < 0. 3 Based on 1) and 2) we can calculate that the margin is 2 ||w|| , so the margin is maximized.
9. 9. The Tutorial Routine Some Topics Packages Proof of Previous Slide (Warning: My Proof) 1 If ∃c > 1 that ∀xi , yi · (w · xi + b) ≥ c, then w c and b c also satisfy the constraints and the length is smaller. 2 If not, assume that ∃c > 1, yi · (w · xi + b) ≥ 1, where yi = 1 yi · (w · xi + b) ≥ c, where yi = −1 (4) Add c−1 2 to each side where yi = 1, minus c−1 2 to each side where yi = −1, we can get: yi · (w · xi + b + c − 1 2 ) ≥ c + 1 2 (5) Because c+1 2 > 1, similar to 1), the |w| here is not the smallest 3 Pick x1, x2 where the equality holds and y1y2 < 0, the margin is just the distance between x1 and the line y2 · (w · x + b) = 1 which can be easily calculated as 2 ||w|| .
10. 10. The Tutorial Routine Some Topics Packages Non Separable Case Figure: Non separable case: miss classiﬁed points exist
11. 11. The Tutorial Routine Some Topics Packages Non Separable Case Constraints yi (w · xi + b) ≥ 1, i = 1, 2, ..., m can not be satisﬁed Solution: add slack variables ξi , reformulate form problem as, min w,b,ξ 1 2 ||w||2 + C m i=1 ξi s.t. yi ((w · xi) + b) ≥ 1 − ξi , i = 1, 2, ..., m ξi ≥ 0 (6) Show the trade oﬀ (C) between margins ( 1 |w| ) and penalty (ξi ).
12. 12. The Tutorial Routine Some Topics Packages Solving SVM: Lagrangian Dual Constraint optimization → Lagrangian Dual Primal form: min w,b,ξ 1 2 ||w||2 + C m i=1 ξi s.t. yi ((w · xi) + b) ≥ 1 − ξi , i = 1, 2, ..., m ξi ≥ 0 (7) The Primal Lagrangian: L(w, b, ξ, α, µ) = 1 2 ||w||2 +C i ξi − i αi {yi (w·x+b−1−ξi )}− i µi ξi Because [7] is convex, Karush-Kuhn-Tucker conditions hold.
13. 13. The Tutorial Routine Some Topics Packages Applying KKT Conditions Stationarity ∂L ∂w = 0 → w = i αi yi xi ∂L ∂b = 0 → i αi yi = 0 ∂L ∂ξ = 0 → C − αi − µi = 0, ∀i Primal Feasibility: yi ((w · xi) + b) ≥ 1 − ξi , ∀i Dual Feasibility: αi ≥ 0, ui ≥ 0 Complementary Slackness, ∀i µi ξi = 0 αi {yi (w · xi + b) − 1 + ξi } = 0 When αi = 0, corresponding xi is called support vectors
14. 14. The Tutorial Routine Some Topics Packages Dual Form Using the equations derived from KKT conditions, remove w, b, ξi , µi in the primal form to get the dual form: min α i αi − 1 2 i,j αi αj yi yj xT i xj s.t. i αi yi = 0 C ≥ αi ≥ 0 (8) And the decision function is:¯y = sign( i αi yi xT i x + b) (b = yk − w · xk, ∀k, C > αk > 0)
15. 15. The Tutorial Routine Some Topics Packages We Need Nonlinear Classiﬁer -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Figure: Case that linear classiﬁer can not handle Finding appropriate form of curves is hard, but we can transform the data!
16. 16. The Tutorial Routine Some Topics Packages Mapping Training Data to Feature Space Φ(x) = (x, x2)T Figure: Feature Mapping Helps Classiﬁcation To solve nonlinear classiﬁcation problem, we can deﬁne some mapping Φ : X → H and do linear classiﬁcation on feature space H
17. 17. The Tutorial Routine Some Topics Packages Recap the Dual Form: An important Fact Dual form: min α i αi − 1 2 i,j αi αj yi yj xT i xj s.t. i αi yi = 0 C ≥ αi ≥ 0 (9) Decision function: ¯y = sign( i αi yi xT i x + b) To train SVC or use SVC to predict, we only need to know the inner product between xs! If we want to apply linear SVC in H, we do NOT need to know Φ(x), we ONLY need to know k(x, x ) =< Φ(x), Φ(x ) >. And k(x, x ) is called “kernel function”.
18. 18. The Tutorial Routine Some Topics Packages Kernel Functions The input of kernel function k : X × X → R is two patterns x, x in X, the output is the canonical inner product between Φ(x), Φ(x ) in H By using k(·, ·), we can implicitly transform the data by some Φ(·) (which is often with inﬁnite dimension) E.g. for k(x, x ) = (xx + 1)2, Φ(x) = (x2, √ 2x, 1)T But not for all functions X × X → R, we can ﬁnd corresponding Φ(x). Kernel functions should satisfy Mercer’s conditions
19. 19. The Tutorial Routine Some Topics Packages Conditions of Kernel Functions Necessity: Kernel Matrix K = [k(xi , xj )]m×m must be positive semideﬁnite: tT Kt = i,j ti tj k(xi , xj ) = i,j ti tj < Φ(xi ), Φ(xj ) > =< i ti Φ(xi ), j tj Φ(xj ) >= | i ti Φ(xi )|2 ≥ 0 Suﬃciency in Continuous Form: Mercer’s Condition: For any symmetric function k : X × X → R which is square integrable in X × X, if it satisﬁes X×X k(x, x )f (x)f (x )dxdx ≥ 0 for all f ∈ L2(X) there exist functions φi : X → R and numbers λi ≥ 0 that, k(x, x ) = i λi φi (x)φi (x ) for all x, x in X
20. 20. The Tutorial Routine Some Topics Packages Commonly Used Kernel Functions Linear Kernel: k(x, x ) = x T x RBF Kernel: k(x, x ) = e−γ|x−x |2 , for gamma = 1 2 (from wiki) Polynomial Kernel: k(x, x ) = (γx T x + r)d , for γ = 1, d = 2 (from wiki) etc.
21. 21. The Tutorial Routine Some Topics Packages Mechanical Analogy Remember from KKT conditions, ∂L ∂w = 0 → w = i αi yi xi ∂L ∂w = 0 → i αi yi = 0 imagine every support vector xi exerts a force Fi = αi yi w |w| on the “separating plane + margin”, we have, Forces = i αi yi w |w| = w |w| i αi yi = 0 Torques = i xi × (αi yi w |w| ) = ( i αi yi xi ) × w |w| = w × w |w| = 0 This is why {xi } are called “support vectors”
22. 22. The Tutorial Routine Some Topics Packages Why SVC Works Well Let’s ﬁrst consider using linear regression to do classiﬁcation, the decision function is ¯y = sign(w · x + b) Figure: Feature Mapping Helps Classiﬁcation In SVM, we only considers about the boundaries
23. 23. The Tutorial Routine Some Topics Packages Min-Loss Framework Primal form: min w,b,ξ 1 2 ||w||2 + C m i=1 ξi s.t. yi ((w · xi) + b) ≥ 1 − ξi , i = 1, 2, ..., m ξi ≥ 0 (10) Rewrite into min-loss form, min w,b,ξ 1 2 ||w||2 + C m i=1 max{0, (1 − yi ((w · xi) + b))} (11) This is called hinge loss.
24. 24. The Tutorial Routine Some Topics Packages See C-SVM and LMC from a Uniﬁed Direction Rewriting LMC classiﬁer, min w 1 2 ||w||2 + m i=0 ∞ · (sign(1 − y(w · xi + b)) + 1) (12) Regularised Logistic Regression (y ∈ {0, 1}, not {−1, 1}, pi = 1 1+e−w·xi ) min w 1 2 ||w||2 + m i=0 −(yi log(pi ) + (1 − yi )log(1 − pi )) (13)
25. 25. The Tutorial Routine Some Topics Packages Relation with Logistic Regression etc. Figure: black:0-1 loss; red: logistic loss (−log( 1 1+e−yi w·x )); blue: hinge loss; green: quadratic loss. “0-1 loss” and “hinge loss” are not aﬀected by correctly classiﬁed outliers. BTW, logistic regression can also be “kernelised”.
26. 26. The Tutorial Routine Some Topics Packages Commonly Used Packages libsvm(liblinear), svmlight and sklearn (python wrap-up of libsvm) Code example in sklearn import numpy as np X = np . a r r a y ([[ −1 , −1] , [ −2 , −1] , [ 1 , 1 ] , [ 2 , 1 ] ] ) y = np . a r r a y ( [ 1 , 1 , 2 , 2 ] ) from s k l e a r n . svm import SVC c l f = SVC() c l f . f i t (X, y ) c l f . p r e d i c t ([[ −0.8 , −1]])
27. 27. The Tutorial Routine Some Topics Packages Things Not Covered Algorithms (SMO, SGD) Generalisation bound and VC dimension ν-SVM, one-class SVM etc. SVR etc.