Introduction to logistic regression

1. Introduction to Machine Learning Logistic Regression and More on Optimization Andres Mendez-Vazquez September 3, 2017 1 / 98

2. Images/cinvestav Outline 1 Logistic Regression Introduction Constraints The Initial Model The Two Case Class Graphic Interpretation Fitting The Model The Two Class Case The Final Log-Likelihood The Newton-Raphson Algorithm Matrix Notation 2 More on Optimization Methods Using Cholesky Decomposition Cholesky Decomposition The Proposed Method Quasi-Newton Method The Second Order Approximation The Wolfe Condition Proving BFGS The BFGS Algorithm A Neat Trick: Coordinate Ascent Coordinate Ascent Algorithm Conclusion 2 / 98

4. Images/cinvestav Assume the following Let Y1, Y2, ..., YN independent random variables Taking values in the set {0, 1} Now, you have a set of ﬁxed vectors x1, x2, ..., xN Mapped to a series of numbers by a weight vector w wT x1, wT x2, ..., wT xN 4 / 98

7. Images/cinvestav In our simplest from There is a suspected relation Between θi = P (Yi = 1) and wT xi Thus we have y = 1 wT x + e > 0 0 else Note: Where e is an error with a certain distribution!!! 5 / 98

8. Images/cinvestav In our simplest from There is a suspected relation Between θi = P (Yi = 1) and wT xi Thus we have y = 1 wT x + e > 0 0 else Note: Where e is an error with a certain distribution!!! 5 / 98

9. Images/cinvestav For Example, Graphically We have 6 / 98

10. Images/cinvestav It is better to user a logit version We have 7 / 98

11. Images/cinvestav Logit Distribution PDF with support z ∈ (−∞, ∞) ,µ location and s scale p (x|µ, s) = exp −z−µ s s 1 + exp −z−µ s 2 With a CDF P (Y < z) = z −∞ p (y|µ, s) dy = 1 1 + exp −z−µ s 8 / 98

12. Images/cinvestav Logit Distribution PDF with support z ∈ (−∞, ∞) ,µ location and s scale p (x|µ, s) = exp −z−µ s s 1 + exp −z−µ s 2 With a CDF P (Y < z) = z −∞ p (y|µ, s) dy = 1 1 + exp −z−µ s 8 / 98

14. Images/cinvestav In Bayesian Classiﬁcation Assignment of a pattern It is performed by using the posterior probabilities, P (ωi|x) And given K classes, we want K i=1 P (ωi|x) = 1 Such that each 0 ≤ P (ωi|x) ≤ 1 10 / 98

17. Images/cinvestav Observation This is a typical example of the discriminative approach Where the distribution of data is of no interest. 11 / 98

22. Images/cinvestav Further We have The model is speciﬁed in terms of K − 1 log-odds or logit transformations. And Although the model uses the last class as the denominator in the odds-ratios. The choice of denominator is arbitrary Because the estimates are equivariant under this choice. 14 / 98

28. Images/cinvestav Basically We have P (ωK|x) = 1 1 + K−1 l=1 exp wT l x Then P (ωi|x) 1 1+ K−1 l=1 exp{wT l x} = exp wT l x For i = 1, 2, ..., k − 1 P (ωi|x) = exp wT i x 1 + K−1 l=1 exp wT l x 16 / 98

31. Images/cinvestav Additionally For K P (ωK|x) = 1 1 + K−1 l=1 exp wT l x Easy to see They sum to one. 17 / 98

32. Images/cinvestav Additionally For K P (ωK|x) = 1 1 + K−1 l=1 exp wT l x Easy to see They sum to one. 17 / 98

33. Images/cinvestav A Note in Notation Given all these parameters, we summarized them Θ = {w1, w2, ..., wK−1} Therefore P (ωl|X = x) = pl (x|Θ) 18 / 98

34. Images/cinvestav A Note in Notation Given all these parameters, we summarized them Θ = {w1, w2, ..., wK−1} Therefore P (ωl|X = x) = pl (x|Θ) 18 / 98

36. Images/cinvestav In the two class case We have P1 (ω1|x) = exp wT x 1 + exp {wT x} P2 (ω2|x) = 1 1 + exp {wT x} A similar model P1 (ω1|x) = exp −wT x 1 + exp {−wT x} P2 (ω2|x) = 1 1 + exp {−wT x} 20 / 98

37. Images/cinvestav In the two class case We have P1 (ω1|x) = exp wT x 1 + exp {wT x} P2 (ω2|x) = 1 1 + exp {wT x} A similar model P1 (ω1|x) = exp −wT x 1 + exp {−wT x} P2 (ω2|x) = 1 1 + exp {−wT x} 20 / 98

39. Images/cinvestav We have the following split Using f (x) = wT x 22 / 98

40. Images/cinvestav Then, we have the mapping to We have Class 1 Class 2 23 / 98

42. Images/cinvestav A Classic application of Maximum Likelihood Given a sequence of smaples iid x1, x2, ..., xN We have the following pdf p (x1, x2, ..., xN |Θ) = N i=1 p (xi|Θ) 25 / 98

43. Images/cinvestav A Classic application of Maximum Likelihood Given a sequence of smaples iid x1, x2, ..., xN We have the following pdf p (x1, x2, ..., xN |Θ) = N i=1 p (xi|Θ) 25 / 98

44. Images/cinvestav P (G|X) Distribution P (G|X) completeley specify the conditional distribution We have a multinomial distribution which under the log-likelihood of N observations: L (Θ) = log p (x1, x2, ..., xN |Θ) = log N i=1 pgi (xi|θ) = N i=1 log pgi (xi|θ) Where gi represent the class that xi belongs. gi = 1 if xi ∈ Class 1 2 if xi ∈ Class 2 26 / 98

49. Images/cinvestav How do we integrate this into a Cost Function? Clearly, we have two distributions We need to represent the distributions into the functions pgi (xi|θ). Why not to have all the distributions into this function pgi (xi|θ) = K−1 l=1 p (xi|wl)I{xi∈ωl} It is easy with the two classes Given that we have a binary situation!!! 27 / 98

53. Images/cinvestav Given the two case We have then a Bernoulli distribution p1 (xi|w) =   exp wT x 1 + exp {wT x}   yi p2 (xi|w) = 1 1 + exp {wT x} 1−yi With yi = 1 if xi ∈ Class 1 yi = 0 if xi ∈ Class 2 29 / 98

54. Images/cinvestav Given the two case We have then a Bernoulli distribution p1 (xi|w) =   exp wT x 1 + exp {wT x}   yi p2 (xi|w) = 1 1 + exp {wT x} 1−yi With yi = 1 if xi ∈ Class 1 yi = 0 if xi ∈ Class 2 29 / 98

56. Images/cinvestav We have the following Cost Function L (w) = N i=1 {yi log p1 (xi|w) + (1 − yi) log (1 − p2 (xi|w))} After some reductions L (w) = N i=1 yiwT xi − log 1 + exp wT xi 31 / 98

57. Images/cinvestav We have the following Cost Function L (w) = N i=1 {yi log p1 (xi|w) + (1 − yi) log (1 − p2 (xi|w))} After some reductions L (w) = N i=1 yiwT xi − log 1 + exp wT xi 31 / 98

58. Images/cinvestav Now, we derive and set it to zero We have ∂L (w) ∂w = N i=1 xi yi − exp wT xi 1 + exp {wT xi} = 0 Which are d + 1 equations nonlinear N i=1 xi (yi − p (xi|w)) =                N i=1 1 × yi − exp{wT xi} 1+exp{wT xi} N i=1 xi 1 yi − exp{wT xi} 1+exp{wT xi} N i=1 xi 2 yi − exp{wT xi} 1+exp{wT xi} ... N i=1 xi d yi − exp{wT xi} 1+exp{wT xi}                = 0 It is know as a scoring function. 32 / 98

59. Images/cinvestav Now, we derive and set it to zero We have ∂L (w) ∂w = N i=1 xi yi − exp wT xi 1 + exp {wT xi} = 0 Which are d + 1 equations nonlinear N i=1 xi (yi − p (xi|w)) =                N i=1 1 × yi − exp{wT xi} 1+exp{wT xi} N i=1 xi 1 yi − exp{wT xi} 1+exp{wT xi} N i=1 xi 2 yi − exp{wT xi} 1+exp{wT xi} ... N i=1 xi d yi − exp{wT xi} 1+exp{wT xi}                = 0 It is know as a scoring function. 32 / 98

60. Images/cinvestav Finally In other words you 1 d + 1 nonlinear equations in w. 2 For example, from the ﬁrst equation: N i=1 yi = N i=1 p (xi|w) The expected number of class ones matches the observed number. And hence also class twos. 33 / 98

65. Images/cinvestav To solve the previous equations We use the Newton-Raphson Method It comes from the ﬁrst Taylor Approximation f (x + h) ≈ f (x) + hf (x) Thus we have for a root r of function f We have 1 Assume a good estimate of r, x0 2 Thus we have r = x0 + h 3 Or h = r − x0 35 / 98

70. Images/cinvestav We have then From Taylor 0 = f (r) = f (x0 + h) ≈ f (x0) + hf (x0) Thus, as long f (x0) is not close to 0 h ≈ − f (x0) f (x0) Thus r = x0 + h ≈ x0 − f (x0) f (x0) 36 / 98

73. Images/cinvestav We have our ﬁnal improving We have x1 ≈ x0 − f (x0) f (x0) 37 / 98

74. Images/cinvestav Then, on the scoring function For this, we need the Hessian of the function ∂2L (w) ∂w∂wT = − N i=1 xixT i   exp wT xi 1 + exp {wT xi}    1 − exp wT xi 1 + exp {wT xi}   Thus, we have at a starting point wold wnew = wold − ∂L (w) ∂w∂wT −1 ∂L (w) ∂w 38 / 98

75. Images/cinvestav Then, on the scoring function For this, we need the Hessian of the function ∂2L (w) ∂w∂wT = − N i=1 xixT i   exp wT xi 1 + exp {wT xi}    1 − exp wT xi 1 + exp {wT xi}   Thus, we have at a starting point wold wnew = wold − ∂L (w) ∂w∂wT −1 ∂L (w) ∂w 38 / 98

77. Images/cinvestav We can rewrite all as matrix notations Assume 1 Let y denotes the vector of yi 2 X is the data matrix N × (d + 1) 3 p the vector of ﬁtted probabilities with the ith element p xi|wold 4 W a N × N diagonal matrix of weights with the ith diagonal element p xi|wold 1 − p xi|wold 40 / 98

81. Images/cinvestav Then, we have For each updating term ∂L (w) ∂w = XT (y − p) ∂L (w) ∂w∂wT = −XT WX 41 / 98

82. Images/cinvestav Then, we have Then, the Newton Step wnew = wold + XT WX −1 XT (y − p) = Iwold + XT WX −1 XT I (y − p) = XT WX −1 XT WXwold + XT WX −1 XT WW−1 (y − p) = XT WX −1 XT W Xwold + W−1 (y − p) = XT WX −1 XT Wz 42 / 98

87. Images/cinvestav Then We have Re-expressed the Newton step as a weighted least squares step. With a the adjusted response as z = Xwold + W−1 (y − p) 43 / 98

88. Images/cinvestav Then We have Re-expressed the Newton step as a weighted least squares step. With a the adjusted response as z = Xwold + W−1 (y − p) 43 / 98

89. Images/cinvestav This New Algorithm It is know as Iteratively Re-weighted Least Squares or IRLS After all at each iteration, it solves A weighted Least Square Problem wnew ← arg min w (z − Xw)T W (z − Xw) 44 / 98

90. Images/cinvestav This New Algorithm It is know as Iteratively Re-weighted Least Squares or IRLS After all at each iteration, it solves A weighted Least Square Problem wnew ← arg min w (z − Xw)T W (z − Xw) 44 / 98

91. Images/cinvestav Observations Good Starting Point w = 0 However, convergence is never guaranteed!!! However Typically the algorithm does converge, since the log-likelihood is concave. But overshooting can occur. 45 / 98

92. Images/cinvestav Observations Good Starting Point w = 0 However, convergence is never guaranteed!!! However Typically the algorithm does converge, since the log-likelihood is concave. But overshooting can occur. 45 / 98

93. Images/cinvestav Observations L (θj) = log n j=1 p (xj|θj) Halving Solve the Problem Perfect!!! 46 / 98

94. Images/cinvestav Observations L (θj) = log n j=1 p (xj|θj) Halving Solve the Problem Perfect!!! 46 / 98

96. Images/cinvestav We can decompose the matrix Given A = XT WX and Y = XT Wz You have Ax = Y Thus you need to obtain x = A−1 Y 48 / 98

97. Images/cinvestav We can decompose the matrix Given A = XT WX and Y = XT Wz You have Ax = Y Thus you need to obtain x = A−1 Y 48 / 98

98. Images/cinvestav This can be seen as a system of linear equations As you can see We start with a set of linear equations with d + 1 unknowns: x1, x2, ..., xd+1    a11x1 + a12x2 + ... + a1d+1xd+1 = y1 a21x1 + a22x2 + ... + a2d+1xd+1 = y2 ... ... ad+11x1 + ad+12x2 + ... + ad+1d+1xn = yd+1 Thus A set of values for x1, x2, ..., xn that satisfy all of the equations simultaneously is said to be a solution to these equations. Note it is easier to solve the system Ax = Y How? 49 / 98

102. Images/cinvestav What is the Cholesky Decomposition? It is a method that factorize a matrix A ∈ Cd+1×d+1 is a positive deﬁnite Hermitian matrix Positive deﬁnite matrix xT Ax > 0 for all x ∈ Cd+1×d+1 Hermitian matrix A = AT 51 / 98

105. Images/cinvestav Therefore Cholesky decomposes A into lower or upper triangular matrix and their conjugate transpose A = LLT A = RT R In our particular case Given that XT WX ∈ Rd+1×d+1 with XT WX = XT WX T , we have that XT WX = LLT Thus, we can use the Cholensky decomposition 1 Cholesky decomposition is of order O d3 and requires 1 6d3 FLOP operations. 52 / 98

109. Images/cinvestav We have The matrices A = XT WX ∈ Rd+1×d+1 and X = A−1 AX = I From Cholensky RT RX = I If we deﬁne RX = B RT B = I 54 / 98

112. Images/cinvestav Now If B = RT −1 = L−1 for L = RT 1 We note that the inverse of the lower triangular matrix L is lower triangular. 2 The diagonal entries of L−1 are the reciprocal of diagonal entries of L      a1,1 0 · · · 0 a2,1 a2,2 0 ... ... ... ... 0 ad+1,1 ad+1,2 · · · ad+1,d+1           b1,1 0 · · · 0 b2,1 b2,2 0 ... ... ... ... 0 bd+1,1 bd+1,2 · · · bd+1,d+1      =      1 0 · · · 0 0 1 0 ... ... 0 ... 0 0 · · · 0 1      55 / 98

113. Images/cinvestav Now Now, we construct the following matrix S si,j =    1 li,i if i = j 0 otherwise 56 / 98

114. Images/cinvestav Now, we have The matrix S is the correct solution to upper diagonal element of the matrix B i.e. sij = bij for i ≤ j ≤ d + 1 Then, we use backward substitution to solve xi,j at equation Rxi = si Assuming: X = [x1, x2, ..., xd+1] S = [s1, s2, ..., sd+1] 57 / 98

115. Images/cinvestav Now, we have The matrix S is the correct solution to upper diagonal element of the matrix B i.e. sij = bij for i ≤ j ≤ d + 1 Then, we use backward substitution to solve xi,j at equation Rxi = si Assuming: X = [x1, x2, ..., xd+1] S = [s1, s2, ..., sd+1] 57 / 98

116. Images/cinvestav Back Substitution Back substitution Since R is upper-triangular, we can rewrite the system Rxi = si as r1,1x1,i + r1,2x2,i + ... + r1,d−1xd−1,i + r1,dxd,i + r1,d+1xd+1,i = s1,i r2,2x2 + ... + r2,d−1xd−1,i + r2,dxd,i + r2,d+1xd+1,i = s2,i ... rd−1,d−1xd−1,i + rd−1,dxd,i + rd−1,d+1xd+1,i = sd−1,i rd,dxd,i + rd,d+1xd,i = sd,i rd+1,d+1xd+1,i = sd+1,i 58 / 98

117. Images/cinvestav Then We solve only for xij such that We have i < j ≤ N (Upper triangle elements). xji = xij In our case the same value given that we live on the reals. 59 / 98

118. Images/cinvestav Then We solve only for xij such that We have i < j ≤ N (Upper triangle elements). xji = xij In our case the same value given that we live on the reals. 59 / 98

119. Images/cinvestav Complexity Equation solving requires 1 3(d + 1)3 multiply operations. The total number of multiply operations for matrix inverse Including Cholesky decomposition is 1 2 (d + 1)3 Therefore We have complexity O d3 !!! Per iteration!!! 60 / 98

123. Images/cinvestav However As Good Computer Scientists We want to obtain a better complexity as O n2 !!! We can obtain such improvements Using Quasi Newton Methods Let’s us to develop the solution For the most popular one Broyden-Fletcher-Goldfarb-Shanno (BFGS) method 62 / 98

127. Images/cinvestav The Second Order Approximation We have f (x) ≈ f (xk) + f (xk) (x − xk) + 1 2 (x − xk)T Hf (xk) (x − xk) We develop a new equation based in the previous idea by using x = xk + p m (xk, p) = f (xk) + f (xk) p + 1 2 pT Bkp Here Bk is an d + 1 × d + 1 symmetric positive deﬁnite matrix that will be updated through the entire process 64 / 98

130. Images/cinvestav Therefore We have m (xk, 0) = f (xk) ∂m (xk, 0) ∂p = f (xk) + Bk0 = f (xk) Therefore, the minimizer of this function can be used as a search direction pk = −B−1 k f (xk) We have the following iteration xk+1 = xk + αkpk Where pk is a descent direction. 65 / 98

133. Images/cinvestav Descent Direction Deﬁnition For a given point x ∈ Rd+1 a direction p is called a descent direction, if there exists α > 0, α ∈ R such that f (x + αp) < f (xk) Lemma Consider a point x ∈ Rd+1. Any direction p satisfying d, f (x) < 0 is a descent direction From this −B−1 k f (xk) T f (xk) = − f (xk)T B−1 k T f (xk) < 0 66 / 98

137. Images/cinvestav Where αk is chosen to satisfy the Wolfe condition f (xk + αkpk) ≤ f (xk) + c1αk f (xk)T pk f (xk + αkpk)T pk ≥ c2 f (xk)T pk with 0 < c1 < c2 < 1 68 / 98

138. Images/cinvestav Graphically f (xk + αkpk) ≤ f (xk) + c1αk f (xk)T pk (Armijo Condition) acceptable acceptable Note In practice, c1 is is chosen to be quite small, c1 = 10−4. 69 / 98

139. Images/cinvestav However, Using That Condition alone gets you too small α s to progress fast Therefore, f (xk + αkpk)T pk ≥ c2 f (xk)T pk This avoid small steps of αk Note the following The left hand side is the derivative dφ(αk) dαk i.e. the condition ensures that the slope of φ (αk) is greater than dφ(0) dαk 70 / 98

140. Images/cinvestav However, Using That Condition alone gets you too small α s to progress fast Therefore, f (xk + αkpk)T pk ≥ c2 f (xk)T pk This avoid small steps of αk Note the following The left hand side is the derivative dφ(αk) dαk i.e. the condition ensures that the slope of φ (αk) is greater than dφ(0) dαk 70 / 98

141. Images/cinvestav Meaning if the slope φ (αk) is strongly negative We can reduce f signiﬁcantly by moving further along the chosen direction. if the slope is only slightly negative or even positive We cannot expect much more decrease in f in this direction. It might be good to terminate the linear search. 71 / 98

142. Images/cinvestav Meaning if the slope φ (αk) is strongly negative We can reduce f signiﬁcantly by moving further along the chosen direction. if the slope is only slightly negative or even positive We cannot expect much more decrease in f in this direction. It might be good to terminate the linear search. 71 / 98

143. Images/cinvestav Graphically We have acceptable acceptable Minimal Tangent Slope 72 / 98

145. Images/cinvestav Now, Instead of computing Bk at each iteration Instead Davidon proposed to update it in a simple manner to account for the curvature measured during the most recent step. Imagine we generated xk+1 and wish to generate a new quadratic model m (xk+1, p) = f (xk+1) + f (xk+1) p + 1 2 pT Bk+1p What requirements should we impose on Bk+1? One reasonable requirement is that the gradient of m should match the gradient of the objective function f at the latest two iterates xk and xk+1. 74 / 98

148. Images/cinvestav Now Since m (xk+1, 0) is precisely f (xk+1) The second of these conditions is satisﬁed automatically. The ﬁrst condition can be written mathematically m (xk+1, −αkpk) = f (xk+1) − αkBk+1p = f (xk) We have αkBk+1p = f (xk+1) − f (xk) 75 / 98

151. Images/cinvestav Then, we can simplify the notation For this sk = xk+1 − xk yk = f (xk+1) − f (xk) We have the secant equation Bk+1sk = yk Given that Bk+1 is symmetric positive, the displacement sk and the change in gradient yk sT k Bk+1sk = sT k yk > 0 Making possible to Bk+1 to map sk into yk!!! 76 / 98

154. Images/cinvestav Interpretation If our update satisﬁes the secant condition Bk+1 (xk+1 − xk) = f (xk+1) − f (xk) This implies that for the second-order approximation at xk+1 of our target function f fapprox (xk+1) = f (xk) + f (xk)T (xk+1 − xk) + 1 2 (xk+1 − xk)T Bk+1 (xk+1 − xk) 77 / 98

155. Images/cinvestav Interpretation If our update satisﬁes the secant condition Bk+1 (xk+1 − xk) = f (xk+1) − f (xk) This implies that for the second-order approximation at xk+1 of our target function f fapprox (xk+1) = f (xk) + f (xk)T (xk+1 − xk) + 1 2 (xk+1 − xk)T Bk+1 (xk+1 − xk) 77 / 98

156. Images/cinvestav Then Getting the gradient over xk+1 − xk fapprox (xk+1) = f (xk) + Bk+1 (xk+1 − xk) = f (xk) + f (xk+1) − f (xk) = f (xk+1) 78 / 98

159. Images/cinvestav Now, We have two cases When f is strongly convex The inequality will be satisﬁed for any two points xk and xk+1 This condition will not always hold for non-convex functions We need to enforce sT k yk > 0 explicitly The previous condition hold on the The Wolfe or Strong Wolfe condition. 79 / 98

162. Images/cinvestav How? We have the second part, f (xk + αkpk)T pk ≥ c2 f (xk)T pk By remembering xk+1 = xk + αkpk. f (xk+1)T sk ≥ c2 f (xk)T sk ⇒ yT k sk = [ f (xk+1) − f (xk)]T sk Then yT k sk = f (xk+1)T sk − f (xk)T sk ≥ c2 f (xk)T [αkpk] − f (xk)T [αkpk] = (c2 − 1) αk f (xk)T pk 80 / 98

167. Images/cinvestav Then Given that c2 − 1 < 0 and αk > 0 (c2 − 1) αk f (xk)T pk = (c2 − 1) αk f (xk)T −B−1 k f (xk) > 0 Thus, the curvature condition holds!!! Then, the secant equation always has a solution under the curvature conditions. In fact, it admits an infinite number if solution, Given Since there are d(d+1) 2 degrees in a symmetric matrix. But under constrains you have 2d constraint imposed by the secant equation and positive definitives. Leaving d(d+1) 2 − 2d = d2+d−4d d = d2−3d d degrees of freedom... enough for infinitely many solutions. 81 / 98

170. Images/cinvestav Given that we want a unique solution We could solve the following problem min B B − Bk s.t. B = BT Bsk = yk Many matrix norms give rise to diﬀerent Quasi-Newton Methos We use the weighted Frobenius norm (Scale invariant), deﬁned as A W = W 1 2 AW 1 2 F Where A F = d+1 i=2 d+1 j=1 a2 ij 82 / 98

173. Images/cinvestav In our case, we choose We have the average Hessian Gk = 1 0 2 f (xk + ταkpk) dτ The property yk = Gkαkpk = Gksk It follows from Taylor’s Theorem!!! 83 / 98

174. Images/cinvestav In our case, we choose We have the average Hessian Gk = 1 0 2 f (xk + ταkpk) dτ The property yk = Gkαkpk = Gksk It follows from Taylor’s Theorem!!! 83 / 98

175. Images/cinvestav For the BFGS In the previous setup we will get Bk, then the inverse update of it Hk = B−1 k In BFGS we go directly for the inverse by setting up: min H H − Hk s.t. H = HT Hyk = sk A unique solution will be Hk+1 = I − ρkskyT k Hk I − ρkyksT k + ρsksT k (1) where ρk = 1 yT k sk 84 / 98

176. Images/cinvestav For the BFGS In the previous setup we will get Bk, then the inverse update of it Hk = B−1 k In BFGS we go directly for the inverse by setting up: min H H − Hk s.t. H = HT Hyk = sk A unique solution will be Hk+1 = I − ρkskyT k Hk I − ρkyksT k + ρsksT k (1) where ρk = 1 yT k sk 84 / 98

177. Images/cinvestav Problem There is no magic formula to find an initial H0 We can use specific information about the problem: For instance by setting it to the inverse of an approximate Hessian calculated by finite differences at x0 In our case, we have ∂L(w) ∂w∂wT , or in matrix format XT WX, we could get initial setup 85 / 98

181. Images/cinvestav Algorithm (BFGS Method) Quasi-Newton Algorithm Starting point x0, Convergence tolerance e, Inverse Hessian approximation H0 1 k ← 0 2 while f (xk+1) > e 3 Compute search direction pk = −Hk f (xk+1) 4 Set xk+1 = xk + αkpkwhere αk is obtained from a linear search (Under Wolfe conditions). 5 Deﬁne sk = xk+1 − xk and yk = f (xk+1) − f (xk) 6 Compute Hk+1 by means of (Eq. 1) 7 k ← k + 1 87 / 98

189. Images/cinvestav Complexity The Cost of update or inverse update is O d2 operations per iteration. For More Nocedal, Jorge & Wright, Stephen J. (1999). Numerical Optimization. Springer-Verlag. 88 / 98

190. Images/cinvestav Complexity The Cost of update or inverse update is O d2 operations per iteration. For More Nocedal, Jorge & Wright, Stephen J. (1999). Numerical Optimization. Springer-Verlag. 88 / 98

192. Images/cinvestav Given the following Because the likelihood is convex, 90 / 98

193. Images/cinvestav Caution Here, we change the labeling to yi = ±1 with p (yi = ±1|x, w) = σ ywT x = 1 1 + exp {−ywT x} Thus, we have the following log likelihood under regularization λ > 0 L (w) = − N i=1 log 1 + exp −yiwT xi − λ 2 wT w It is possible to get a Gradient Descent wl (w) = N i=1 1 − 1 1 + exp {−yiwT xi} yixi − λw 91 / 98

196. Images/cinvestav Danger Will Robinson!!! Gradient descent using resembles the Perceptron learning algorithm Problem!!! It will always converge for a suitable step size, regardless of whether the classes are separable!!! 92 / 98

198. Images/cinvestav Here We will simplify our work By stating the algorithm for coordinate ascent Then A more precise version will be given 94 / 98

199. Images/cinvestav Here We will simplify our work By stating the algorithm for coordinate ascent Then A more precise version will be given 94 / 98

200. Images/cinvestav Coordinate Ascent Algorithm Input Max, an initial w0 1 counter ← 0 2 while counter < Max 3 for i ← 1, ..., d 4 Randomly pick i 5 Compute a step size δ∗ by approximately maximize arg minδ f (x + δei) 6 xi ← xi + δ∗ Where ei = 0 · · · 0 1 ← i 0 · · · 0 T 95 / 98

201. Images/cinvestav Coordinate Ascent Algorithm Input Max, an initial w0 1 counter ← 0 2 while counter < Max 3 for i ← 1, ..., d 4 Randomly pick i 5 Compute a step size δ∗ by approximately maximize arg minδ f (x + δei) 6 xi ← xi + δ∗ Where ei = 0 · · · 0 1 ← i 0 · · · 0 T 95 / 98

202. Images/cinvestav In the case of Logistic Regression Thus, we can optimize each wk alternatively by a coordinate-wise Newton update wnew k = wold k + −λwold k + N i=1 1 − 1 1+exp{−yiwT xi} yixik λ + N i=1 x2 ik 1 1+exp{−yiwT xi} 1 − 1 1+exp{−yiwT xi} Complexity of this update Item Complexity N i=1 1 − 1 1+exp{−yiwT xi} yixik O (N) N i=1 x2 ik 1 1+exp{−yiwT xi} 1 − 1 1+exp{−yiwT xi} O (N) Total Complexity O (N) For all the dimensions O (Nd) 96 / 98

203. Images/cinvestav In the case of Logistic Regression Thus, we can optimize each wk alternatively by a coordinate-wise Newton update wnew k = wold k + −λwold k + N i=1 1 − 1 1+exp{−yiwT xi} yixik λ + N i=1 x2 ik 1 1+exp{−yiwT xi} 1 − 1 1+exp{−yiwT xi} Complexity of this update Item Complexity N i=1 1 − 1 1+exp{−yiwT xi} yixik O (N) N i=1 x2 ik 1 1+exp{−yiwT xi} 1 − 1 1+exp{−yiwT xi} O (N) Total Complexity O (N) For all the dimensions O (Nd) 96 / 98

205. Images/cinvestav We have the following Complexities per iteration Complexities Method Per Iteration Convergence Rate Cholesky Decomposition d3 2 = O d3 Quadratic Quasi-Newton BFGS O d2 Superlinearly Coordinate Ascent O (Nd) Not established 98 / 98

Introduction to logistic regression

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Introduction to logistic regression

Ähnlich wie Introduction to logistic regression (20)

Mehr von Andres Mendez-Vazquez

Mehr von Andres Mendez-Vazquez (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Introduction to logistic regression