1. Introduction to Machine Learning
Logistic Regression and More on Optimization
Andres Mendez-Vazquez
September 3, 2017
1 / 98
2. Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
2 / 98
3. Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
3 / 98
4. Images/cinvestav
Assume the following
Let Y1, Y2, ..., YN independent random variables
Taking values in the set {0, 1}
Now, you have a set of fixed vectors
x1, x2, ..., xN
Mapped to a series of numbers by a weight vector w
wT
x1, wT
x2, ..., wT
xN
4 / 98
5. Images/cinvestav
Assume the following
Let Y1, Y2, ..., YN independent random variables
Taking values in the set {0, 1}
Now, you have a set of fixed vectors
x1, x2, ..., xN
Mapped to a series of numbers by a weight vector w
wT
x1, wT
x2, ..., wT
xN
4 / 98
6. Images/cinvestav
Assume the following
Let Y1, Y2, ..., YN independent random variables
Taking values in the set {0, 1}
Now, you have a set of fixed vectors
x1, x2, ..., xN
Mapped to a series of numbers by a weight vector w
wT
x1, wT
x2, ..., wT
xN
4 / 98
7. Images/cinvestav
In our simplest from
There is a suspected relation
Between θi = P (Yi = 1) and wT xi
Thus we have
y =
1 wT x + e > 0
0 else
Note: Where e is an error with a certain distribution!!!
5 / 98
8. Images/cinvestav
In our simplest from
There is a suspected relation
Between θi = P (Yi = 1) and wT xi
Thus we have
y =
1 wT x + e > 0
0 else
Note: Where e is an error with a certain distribution!!!
5 / 98
11. Images/cinvestav
Logit Distribution
PDF with support z ∈ (−∞, ∞) ,µ location and s scale
p (x|µ, s) =
exp −z−µ
s
s 1 + exp −z−µ
s
2
With a CDF
P (Y < z) =
z
−∞
p (y|µ, s) dy =
1
1 + exp −z−µ
s
8 / 98
12. Images/cinvestav
Logit Distribution
PDF with support z ∈ (−∞, ∞) ,µ location and s scale
p (x|µ, s) =
exp −z−µ
s
s 1 + exp −z−µ
s
2
With a CDF
P (Y < z) =
z
−∞
p (y|µ, s) dy =
1
1 + exp −z−µ
s
8 / 98
13. Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
9 / 98
14. Images/cinvestav
In Bayesian Classification
Assignment of a pattern
It is performed by using the posterior probabilities, P (ωi|x)
And given K classes, we want
K
i=1
P (ωi|x) = 1
Such that each
0 ≤ P (ωi|x) ≤ 1
10 / 98
15. Images/cinvestav
In Bayesian Classification
Assignment of a pattern
It is performed by using the posterior probabilities, P (ωi|x)
And given K classes, we want
K
i=1
P (ωi|x) = 1
Such that each
0 ≤ P (ωi|x) ≤ 1
10 / 98
16. Images/cinvestav
In Bayesian Classification
Assignment of a pattern
It is performed by using the posterior probabilities, P (ωi|x)
And given K classes, we want
K
i=1
P (ωi|x) = 1
Such that each
0 ≤ P (ωi|x) ≤ 1
10 / 98
18. Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
12 / 98
19. Images/cinvestav
The Model
We have the following under the extended features
log
P (ω1|x)
P (ωK|x)
= wT
1 x
log
P (ω2|x)
P (ωK|x)
= wT
2 x
...
...
log
P (ωK−1|x)
P (ωK|x)
= wT
K−1x
13 / 98
20. Images/cinvestav
The Model
We have the following under the extended features
log
P (ω1|x)
P (ωK|x)
= wT
1 x
log
P (ω2|x)
P (ωK|x)
= wT
2 x
...
...
log
P (ωK−1|x)
P (ωK|x)
= wT
K−1x
13 / 98
21. Images/cinvestav
The Model
We have the following under the extended features
log
P (ω1|x)
P (ωK|x)
= wT
1 x
log
P (ω2|x)
P (ωK|x)
= wT
2 x
...
...
log
P (ωK−1|x)
P (ωK|x)
= wT
K−1x
13 / 98
22. Images/cinvestav
Further
We have
The model is specified in terms of K − 1 log-odds or logit transformations.
And
Although the model uses the last class as the denominator in the
odds-ratios.
The choice of denominator is arbitrary
Because the estimates are equivariant under this choice.
14 / 98
23. Images/cinvestav
Further
We have
The model is specified in terms of K − 1 log-odds or logit transformations.
And
Although the model uses the last class as the denominator in the
odds-ratios.
The choice of denominator is arbitrary
Because the estimates are equivariant under this choice.
14 / 98
24. Images/cinvestav
Further
We have
The model is specified in terms of K − 1 log-odds or logit transformations.
And
Although the model uses the last class as the denominator in the
odds-ratios.
The choice of denominator is arbitrary
Because the estimates are equivariant under this choice.
14 / 98
25. Images/cinvestav
It is possible to show that
We have that, for l = 1, 2, ..., K − 1
P (ωl|x)
P (ωK|x)
= exp wT
l x
Therefore
K−1
l=1
P (ωl|x)
P (ωK|x)
=
K−1
l=1
exp wT
l x
Thus
1 − P (ωK|x)
P (ωK|x)
=
K−1
l=1
exp wT
l x
15 / 98
26. Images/cinvestav
It is possible to show that
We have that, for l = 1, 2, ..., K − 1
P (ωl|x)
P (ωK|x)
= exp wT
l x
Therefore
K−1
l=1
P (ωl|x)
P (ωK|x)
=
K−1
l=1
exp wT
l x
Thus
1 − P (ωK|x)
P (ωK|x)
=
K−1
l=1
exp wT
l x
15 / 98
27. Images/cinvestav
It is possible to show that
We have that, for l = 1, 2, ..., K − 1
P (ωl|x)
P (ωK|x)
= exp wT
l x
Therefore
K−1
l=1
P (ωl|x)
P (ωK|x)
=
K−1
l=1
exp wT
l x
Thus
1 − P (ωK|x)
P (ωK|x)
=
K−1
l=1
exp wT
l x
15 / 98
28. Images/cinvestav
Basically
We have
P (ωK|x) =
1
1 + K−1
l=1 exp wT
l x
Then
P (ωi|x)
1
1+
K−1
l=1
exp{wT
l
x}
= exp wT
l x
For i = 1, 2, ..., k − 1
P (ωi|x) =
exp wT
i x
1 + K−1
l=1 exp wT
l x
16 / 98
29. Images/cinvestav
Basically
We have
P (ωK|x) =
1
1 + K−1
l=1 exp wT
l x
Then
P (ωi|x)
1
1+
K−1
l=1
exp{wT
l
x}
= exp wT
l x
For i = 1, 2, ..., k − 1
P (ωi|x) =
exp wT
i x
1 + K−1
l=1 exp wT
l x
16 / 98
30. Images/cinvestav
Basically
We have
P (ωK|x) =
1
1 + K−1
l=1 exp wT
l x
Then
P (ωi|x)
1
1+
K−1
l=1
exp{wT
l
x}
= exp wT
l x
For i = 1, 2, ..., k − 1
P (ωi|x) =
exp wT
i x
1 + K−1
l=1 exp wT
l x
16 / 98
33. Images/cinvestav
A Note in Notation
Given all these parameters, we summarized them
Θ = {w1, w2, ..., wK−1}
Therefore
P (ωl|X = x) = pl (x|Θ)
18 / 98
34. Images/cinvestav
A Note in Notation
Given all these parameters, we summarized them
Θ = {w1, w2, ..., wK−1}
Therefore
P (ωl|X = x) = pl (x|Θ)
18 / 98
35. Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
19 / 98
36. Images/cinvestav
In the two class case
We have
P1 (ω1|x) =
exp wT x
1 + exp {wT x}
P2 (ω2|x) =
1
1 + exp {wT x}
A similar model
P1 (ω1|x) =
exp −wT x
1 + exp {−wT x}
P2 (ω2|x) =
1
1 + exp {−wT x}
20 / 98
37. Images/cinvestav
In the two class case
We have
P1 (ω1|x) =
exp wT x
1 + exp {wT x}
P2 (ω2|x) =
1
1 + exp {wT x}
A similar model
P1 (ω1|x) =
exp −wT x
1 + exp {−wT x}
P2 (ω2|x) =
1
1 + exp {−wT x}
20 / 98
38. Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
21 / 98
41. Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
24 / 98
42. Images/cinvestav
A Classic application of Maximum Likelihood
Given a sequence of smaples iid
x1, x2, ..., xN
We have the following pdf
p (x1, x2, ..., xN |Θ) =
N
i=1
p (xi|Θ)
25 / 98
43. Images/cinvestav
A Classic application of Maximum Likelihood
Given a sequence of smaples iid
x1, x2, ..., xN
We have the following pdf
p (x1, x2, ..., xN |Θ) =
N
i=1
p (xi|Θ)
25 / 98
44. Images/cinvestav
P (G|X) Distribution
P (G|X) completeley specify the conditional distribution
We have a multinomial distribution which under the log-likelihood of N
observations:
L (Θ) = log p (x1, x2, ..., xN |Θ) = log
N
i=1
pgi (xi|θ) =
N
i=1
log pgi (xi|θ)
Where
gi represent the class that xi belongs.
gi =
1 if xi ∈ Class 1
2 if xi ∈ Class 2
26 / 98
45. Images/cinvestav
P (G|X) Distribution
P (G|X) completeley specify the conditional distribution
We have a multinomial distribution which under the log-likelihood of N
observations:
L (Θ) = log p (x1, x2, ..., xN |Θ) = log
N
i=1
pgi (xi|θ) =
N
i=1
log pgi (xi|θ)
Where
gi represent the class that xi belongs.
gi =
1 if xi ∈ Class 1
2 if xi ∈ Class 2
26 / 98
46. Images/cinvestav
P (G|X) Distribution
P (G|X) completeley specify the conditional distribution
We have a multinomial distribution which under the log-likelihood of N
observations:
L (Θ) = log p (x1, x2, ..., xN |Θ) = log
N
i=1
pgi (xi|θ) =
N
i=1
log pgi (xi|θ)
Where
gi represent the class that xi belongs.
gi =
1 if xi ∈ Class 1
2 if xi ∈ Class 2
26 / 98
47. Images/cinvestav
P (G|X) Distribution
P (G|X) completeley specify the conditional distribution
We have a multinomial distribution which under the log-likelihood of N
observations:
L (Θ) = log p (x1, x2, ..., xN |Θ) = log
N
i=1
pgi (xi|θ) =
N
i=1
log pgi (xi|θ)
Where
gi represent the class that xi belongs.
gi =
1 if xi ∈ Class 1
2 if xi ∈ Class 2
26 / 98
48. Images/cinvestav
P (G|X) Distribution
P (G|X) completeley specify the conditional distribution
We have a multinomial distribution which under the log-likelihood of N
observations:
L (Θ) = log p (x1, x2, ..., xN |Θ) = log
N
i=1
pgi (xi|θ) =
N
i=1
log pgi (xi|θ)
Where
gi represent the class that xi belongs.
gi =
1 if xi ∈ Class 1
2 if xi ∈ Class 2
26 / 98
49. Images/cinvestav
How do we integrate this into a Cost Function?
Clearly, we have two distributions
We need to represent the distributions into the functions pgi (xi|θ).
Why not to have all the distributions into this function
pgi (xi|θ) =
K−1
l=1
p (xi|wl)I{xi∈ωl}
It is easy with the two classes
Given that we have a binary situation!!!
27 / 98
50. Images/cinvestav
How do we integrate this into a Cost Function?
Clearly, we have two distributions
We need to represent the distributions into the functions pgi (xi|θ).
Why not to have all the distributions into this function
pgi (xi|θ) =
K−1
l=1
p (xi|wl)I{xi∈ωl}
It is easy with the two classes
Given that we have a binary situation!!!
27 / 98
51. Images/cinvestav
How do we integrate this into a Cost Function?
Clearly, we have two distributions
We need to represent the distributions into the functions pgi (xi|θ).
Why not to have all the distributions into this function
pgi (xi|θ) =
K−1
l=1
p (xi|wl)I{xi∈ωl}
It is easy with the two classes
Given that we have a binary situation!!!
27 / 98
52. Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
28 / 98
53. Images/cinvestav
Given the two case
We have then a Bernoulli distribution
p1 (xi|w) =
exp wT x
1 + exp {wT x}
yi
p2 (xi|w) =
1
1 + exp {wT x}
1−yi
With
yi = 1 if xi ∈ Class 1
yi = 0 if xi ∈ Class 2
29 / 98
54. Images/cinvestav
Given the two case
We have then a Bernoulli distribution
p1 (xi|w) =
exp wT x
1 + exp {wT x}
yi
p2 (xi|w) =
1
1 + exp {wT x}
1−yi
With
yi = 1 if xi ∈ Class 1
yi = 0 if xi ∈ Class 2
29 / 98
55. Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
30 / 98
56. Images/cinvestav
We have the following
Cost Function
L (w) =
N
i=1
{yi log p1 (xi|w) +
(1 − yi) log (1 − p2 (xi|w))}
After some reductions
L (w) =
N
i=1
yiwT
xi − log 1 + exp wT
xi
31 / 98
57. Images/cinvestav
We have the following
Cost Function
L (w) =
N
i=1
{yi log p1 (xi|w) +
(1 − yi) log (1 − p2 (xi|w))}
After some reductions
L (w) =
N
i=1
yiwT
xi − log 1 + exp wT
xi
31 / 98
58. Images/cinvestav
Now, we derive and set it to zero
We have
∂L (w)
∂w
=
N
i=1
xi yi −
exp wT
xi
1 + exp {wT xi}
= 0
Which are d + 1 equations nonlinear
N
i=1
xi (yi − p (xi|w)) =
N
i=1 1 × yi −
exp{wT
xi}
1+exp{wT xi}
N
i=1 xi
1 yi −
exp{wT
xi}
1+exp{wT xi}
N
i=1 xi
2 yi −
exp{wT
xi}
1+exp{wT xi}
...
N
i=1 xi
d yi −
exp{wT
xi}
1+exp{wT xi}
= 0
It is know as a scoring function.
32 / 98
59. Images/cinvestav
Now, we derive and set it to zero
We have
∂L (w)
∂w
=
N
i=1
xi yi −
exp wT
xi
1 + exp {wT xi}
= 0
Which are d + 1 equations nonlinear
N
i=1
xi (yi − p (xi|w)) =
N
i=1 1 × yi −
exp{wT
xi}
1+exp{wT xi}
N
i=1 xi
1 yi −
exp{wT
xi}
1+exp{wT xi}
N
i=1 xi
2 yi −
exp{wT
xi}
1+exp{wT xi}
...
N
i=1 xi
d yi −
exp{wT
xi}
1+exp{wT xi}
= 0
It is know as a scoring function.
32 / 98
60. Images/cinvestav
Finally
In other words you
1 d + 1 nonlinear equations in w.
2 For example, from the first equation:
N
i=1
yi =
N
i=1
p (xi|w)
The expected number of class ones matches the observed number.
And hence also class twos.
33 / 98
61. Images/cinvestav
Finally
In other words you
1 d + 1 nonlinear equations in w.
2 For example, from the first equation:
N
i=1
yi =
N
i=1
p (xi|w)
The expected number of class ones matches the observed number.
And hence also class twos.
33 / 98
62. Images/cinvestav
Finally
In other words you
1 d + 1 nonlinear equations in w.
2 For example, from the first equation:
N
i=1
yi =
N
i=1
p (xi|w)
The expected number of class ones matches the observed number.
And hence also class twos.
33 / 98
63. Images/cinvestav
Finally
In other words you
1 d + 1 nonlinear equations in w.
2 For example, from the first equation:
N
i=1
yi =
N
i=1
p (xi|w)
The expected number of class ones matches the observed number.
And hence also class twos.
33 / 98
64. Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
34 / 98
65. Images/cinvestav
To solve the previous equations
We use the Newton-Raphson Method
It comes from the first Taylor Approximation
f (x + h) ≈ f (x) + hf (x)
Thus we have for a root r of function f
We have
1 Assume a good estimate of r, x0
2 Thus we have r = x0 + h
3 Or h = r − x0
35 / 98
66. Images/cinvestav
To solve the previous equations
We use the Newton-Raphson Method
It comes from the first Taylor Approximation
f (x + h) ≈ f (x) + hf (x)
Thus we have for a root r of function f
We have
1 Assume a good estimate of r, x0
2 Thus we have r = x0 + h
3 Or h = r − x0
35 / 98
67. Images/cinvestav
To solve the previous equations
We use the Newton-Raphson Method
It comes from the first Taylor Approximation
f (x + h) ≈ f (x) + hf (x)
Thus we have for a root r of function f
We have
1 Assume a good estimate of r, x0
2 Thus we have r = x0 + h
3 Or h = r − x0
35 / 98
68. Images/cinvestav
To solve the previous equations
We use the Newton-Raphson Method
It comes from the first Taylor Approximation
f (x + h) ≈ f (x) + hf (x)
Thus we have for a root r of function f
We have
1 Assume a good estimate of r, x0
2 Thus we have r = x0 + h
3 Or h = r − x0
35 / 98
69. Images/cinvestav
To solve the previous equations
We use the Newton-Raphson Method
It comes from the first Taylor Approximation
f (x + h) ≈ f (x) + hf (x)
Thus we have for a root r of function f
We have
1 Assume a good estimate of r, x0
2 Thus we have r = x0 + h
3 Or h = r − x0
35 / 98
70. Images/cinvestav
We have then
From Taylor
0 = f (r) = f (x0 + h) ≈ f (x0) + hf (x0)
Thus, as long f (x0) is not close to 0
h ≈ −
f (x0)
f (x0)
Thus
r = x0 + h ≈ x0 −
f (x0)
f (x0)
36 / 98
71. Images/cinvestav
We have then
From Taylor
0 = f (r) = f (x0 + h) ≈ f (x0) + hf (x0)
Thus, as long f (x0) is not close to 0
h ≈ −
f (x0)
f (x0)
Thus
r = x0 + h ≈ x0 −
f (x0)
f (x0)
36 / 98
72. Images/cinvestav
We have then
From Taylor
0 = f (r) = f (x0 + h) ≈ f (x0) + hf (x0)
Thus, as long f (x0) is not close to 0
h ≈ −
f (x0)
f (x0)
Thus
r = x0 + h ≈ x0 −
f (x0)
f (x0)
36 / 98
74. Images/cinvestav
Then, on the scoring function
For this, we need the Hessian of the function
∂2L (w)
∂w∂wT
= −
N
i=1
xixT
i
exp wT xi
1 + exp {wT xi}
1 −
exp wT xi
1 + exp {wT xi}
Thus, we have at a starting point wold
wnew
= wold
−
∂L (w)
∂w∂wT
−1
∂L (w)
∂w
38 / 98
75. Images/cinvestav
Then, on the scoring function
For this, we need the Hessian of the function
∂2L (w)
∂w∂wT
= −
N
i=1
xixT
i
exp wT xi
1 + exp {wT xi}
1 −
exp wT xi
1 + exp {wT xi}
Thus, we have at a starting point wold
wnew
= wold
−
∂L (w)
∂w∂wT
−1
∂L (w)
∂w
38 / 98
76. Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
39 / 98
77. Images/cinvestav
We can rewrite all as matrix notations
Assume
1 Let y denotes the vector of yi
2 X is the data matrix N × (d + 1)
3 p the vector of fitted probabilities with the ith element p xi|wold
4 W a N × N diagonal matrix of weights with the ith diagonal element
p xi|wold
1 − p xi|wold
40 / 98
78. Images/cinvestav
We can rewrite all as matrix notations
Assume
1 Let y denotes the vector of yi
2 X is the data matrix N × (d + 1)
3 p the vector of fitted probabilities with the ith element p xi|wold
4 W a N × N diagonal matrix of weights with the ith diagonal element
p xi|wold
1 − p xi|wold
40 / 98
79. Images/cinvestav
We can rewrite all as matrix notations
Assume
1 Let y denotes the vector of yi
2 X is the data matrix N × (d + 1)
3 p the vector of fitted probabilities with the ith element p xi|wold
4 W a N × N diagonal matrix of weights with the ith diagonal element
p xi|wold
1 − p xi|wold
40 / 98
80. Images/cinvestav
We can rewrite all as matrix notations
Assume
1 Let y denotes the vector of yi
2 X is the data matrix N × (d + 1)
3 p the vector of fitted probabilities with the ith element p xi|wold
4 W a N × N diagonal matrix of weights with the ith diagonal element
p xi|wold
1 − p xi|wold
40 / 98
89. Images/cinvestav
This New Algorithm
It is know as
Iteratively Re-weighted Least Squares or IRLS
After all at each iteration, it solves
A weighted Least Square Problem
wnew
← arg min
w
(z − Xw)T
W (z − Xw)
44 / 98
90. Images/cinvestav
This New Algorithm
It is know as
Iteratively Re-weighted Least Squares or IRLS
After all at each iteration, it solves
A weighted Least Square Problem
wnew
← arg min
w
(z − Xw)T
W (z − Xw)
44 / 98
91. Images/cinvestav
Observations
Good Starting Point w = 0
However, convergence is never guaranteed!!!
However
Typically the algorithm does converge, since the log-likelihood is
concave.
But overshooting can occur.
45 / 98
92. Images/cinvestav
Observations
Good Starting Point w = 0
However, convergence is never guaranteed!!!
However
Typically the algorithm does converge, since the log-likelihood is
concave.
But overshooting can occur.
45 / 98
95. Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
47 / 98
96. Images/cinvestav
We can decompose the matrix
Given A = XT
WX and Y = XT
Wz
You have
Ax = Y
Thus you need to obtain
x = A−1
Y
48 / 98
97. Images/cinvestav
We can decompose the matrix
Given A = XT
WX and Y = XT
Wz
You have
Ax = Y
Thus you need to obtain
x = A−1
Y
48 / 98
98. Images/cinvestav
This can be seen as a system of linear equations
As you can see
We start with a set of linear equations with d + 1 unknowns:
x1, x2, ..., xd+1
a11x1 + a12x2 + ... + a1d+1xd+1 = y1
a21x1 + a22x2 + ... + a2d+1xd+1 = y2
...
...
ad+11x1 + ad+12x2 + ... + ad+1d+1xn = yd+1
Thus
A set of values for x1, x2, ..., xn that satisfy all of the equations
simultaneously is said to be a solution to these equations.
Note it is easier to solve the system Ax = Y
How?
49 / 98
99. Images/cinvestav
This can be seen as a system of linear equations
As you can see
We start with a set of linear equations with d + 1 unknowns:
x1, x2, ..., xd+1
a11x1 + a12x2 + ... + a1d+1xd+1 = y1
a21x1 + a22x2 + ... + a2d+1xd+1 = y2
...
...
ad+11x1 + ad+12x2 + ... + ad+1d+1xn = yd+1
Thus
A set of values for x1, x2, ..., xn that satisfy all of the equations
simultaneously is said to be a solution to these equations.
Note it is easier to solve the system Ax = Y
How?
49 / 98
100. Images/cinvestav
This can be seen as a system of linear equations
As you can see
We start with a set of linear equations with d + 1 unknowns:
x1, x2, ..., xd+1
a11x1 + a12x2 + ... + a1d+1xd+1 = y1
a21x1 + a22x2 + ... + a2d+1xd+1 = y2
...
...
ad+11x1 + ad+12x2 + ... + ad+1d+1xn = yd+1
Thus
A set of values for x1, x2, ..., xn that satisfy all of the equations
simultaneously is said to be a solution to these equations.
Note it is easier to solve the system Ax = Y
How?
49 / 98
101. Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
50 / 98
102. Images/cinvestav
What is the Cholesky Decomposition?
It is a method that factorize a matrix
A ∈ Cd+1×d+1 is a positive definite Hermitian matrix
Positive definite matrix
xT
Ax > 0 for all x ∈ Cd+1×d+1
Hermitian matrix
A = AT
51 / 98
103. Images/cinvestav
What is the Cholesky Decomposition?
It is a method that factorize a matrix
A ∈ Cd+1×d+1 is a positive definite Hermitian matrix
Positive definite matrix
xT
Ax > 0 for all x ∈ Cd+1×d+1
Hermitian matrix
A = AT
51 / 98
104. Images/cinvestav
What is the Cholesky Decomposition?
It is a method that factorize a matrix
A ∈ Cd+1×d+1 is a positive definite Hermitian matrix
Positive definite matrix
xT
Ax > 0 for all x ∈ Cd+1×d+1
Hermitian matrix
A = AT
51 / 98
105. Images/cinvestav
Therefore
Cholesky decomposes A into lower or upper triangular matrix and
their conjugate transpose
A = LLT
A = RT R
In our particular case
Given that XT WX ∈ Rd+1×d+1 with XT WX = XT WX
T
, we have
that
XT
WX = LLT
Thus, we can use the Cholensky decomposition
1 Cholesky decomposition is of order O d3 and requires 1
6d3 FLOP
operations.
52 / 98
106. Images/cinvestav
Therefore
Cholesky decomposes A into lower or upper triangular matrix and
their conjugate transpose
A = LLT
A = RT R
In our particular case
Given that XT WX ∈ Rd+1×d+1 with XT WX = XT WX
T
, we have
that
XT
WX = LLT
Thus, we can use the Cholensky decomposition
1 Cholesky decomposition is of order O d3 and requires 1
6d3 FLOP
operations.
52 / 98
107. Images/cinvestav
Therefore
Cholesky decomposes A into lower or upper triangular matrix and
their conjugate transpose
A = LLT
A = RT R
In our particular case
Given that XT WX ∈ Rd+1×d+1 with XT WX = XT WX
T
, we have
that
XT
WX = LLT
Thus, we can use the Cholensky decomposition
1 Cholesky decomposition is of order O d3 and requires 1
6d3 FLOP
operations.
52 / 98
108. Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
53 / 98
114. Images/cinvestav
Now, we have
The matrix S is the correct solution to upper diagonal element of the
matrix B
i.e. sij = bij for i ≤ j ≤ d + 1
Then, we use backward substitution to solve xi,j at equation Rxi = si
Assuming:
X = [x1, x2, ..., xd+1]
S = [s1, s2, ..., sd+1]
57 / 98
115. Images/cinvestav
Now, we have
The matrix S is the correct solution to upper diagonal element of the
matrix B
i.e. sij = bij for i ≤ j ≤ d + 1
Then, we use backward substitution to solve xi,j at equation Rxi = si
Assuming:
X = [x1, x2, ..., xd+1]
S = [s1, s2, ..., sd+1]
57 / 98
116. Images/cinvestav
Back Substitution
Back substitution
Since R is upper-triangular, we can rewrite the system Rxi = si as
r1,1x1,i + r1,2x2,i + ... + r1,d−1xd−1,i + r1,dxd,i + r1,d+1xd+1,i = s1,i
r2,2x2 + ... + r2,d−1xd−1,i + r2,dxd,i + r2,d+1xd+1,i = s2,i
...
rd−1,d−1xd−1,i + rd−1,dxd,i + rd−1,d+1xd+1,i = sd−1,i
rd,dxd,i + rd,d+1xd,i = sd,i
rd+1,d+1xd+1,i = sd+1,i
58 / 98
117. Images/cinvestav
Then
We solve only for xij such that
We have i < j ≤ N (Upper triangle elements).
xji = xij
In our case the same value given that we live on the reals.
59 / 98
118. Images/cinvestav
Then
We solve only for xij such that
We have i < j ≤ N (Upper triangle elements).
xji = xij
In our case the same value given that we live on the reals.
59 / 98
119. Images/cinvestav
Complexity
Equation solving requires
1
3(d + 1)3 multiply operations.
The total number of multiply operations for matrix inverse
Including Cholesky decomposition is 1
2 (d + 1)3
Therefore
We have complexity O d3 !!! Per iteration!!!
60 / 98
120. Images/cinvestav
Complexity
Equation solving requires
1
3(d + 1)3 multiply operations.
The total number of multiply operations for matrix inverse
Including Cholesky decomposition is 1
2 (d + 1)3
Therefore
We have complexity O d3 !!! Per iteration!!!
60 / 98
121. Images/cinvestav
Complexity
Equation solving requires
1
3(d + 1)3 multiply operations.
The total number of multiply operations for matrix inverse
Including Cholesky decomposition is 1
2 (d + 1)3
Therefore
We have complexity O d3 !!! Per iteration!!!
60 / 98
122. Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
61 / 98
123. Images/cinvestav
However
As Good Computer Scientists
We want to obtain a better complexity as O n2 !!!
We can obtain such improvements
Using Quasi Newton Methods
Let’s us to develop the solution
For the most popular one
Broyden-Fletcher-Goldfarb-Shanno (BFGS) method
62 / 98
124. Images/cinvestav
However
As Good Computer Scientists
We want to obtain a better complexity as O n2 !!!
We can obtain such improvements
Using Quasi Newton Methods
Let’s us to develop the solution
For the most popular one
Broyden-Fletcher-Goldfarb-Shanno (BFGS) method
62 / 98
125. Images/cinvestav
However
As Good Computer Scientists
We want to obtain a better complexity as O n2 !!!
We can obtain such improvements
Using Quasi Newton Methods
Let’s us to develop the solution
For the most popular one
Broyden-Fletcher-Goldfarb-Shanno (BFGS) method
62 / 98
126. Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
63 / 98
127. Images/cinvestav
The Second Order Approximation
We have
f (x) ≈ f (xk) + f (xk) (x − xk) +
1
2
(x − xk)T
Hf (xk) (x − xk)
We develop a new equation based in the previous idea by using
x = xk + p
m (xk, p) = f (xk) + f (xk) p +
1
2
pT
Bkp
Here
Bk is an d + 1 × d + 1 symmetric positive definite matrix that will be
updated through the entire process
64 / 98
128. Images/cinvestav
The Second Order Approximation
We have
f (x) ≈ f (xk) + f (xk) (x − xk) +
1
2
(x − xk)T
Hf (xk) (x − xk)
We develop a new equation based in the previous idea by using
x = xk + p
m (xk, p) = f (xk) + f (xk) p +
1
2
pT
Bkp
Here
Bk is an d + 1 × d + 1 symmetric positive definite matrix that will be
updated through the entire process
64 / 98
129. Images/cinvestav
The Second Order Approximation
We have
f (x) ≈ f (xk) + f (xk) (x − xk) +
1
2
(x − xk)T
Hf (xk) (x − xk)
We develop a new equation based in the previous idea by using
x = xk + p
m (xk, p) = f (xk) + f (xk) p +
1
2
pT
Bkp
Here
Bk is an d + 1 × d + 1 symmetric positive definite matrix that will be
updated through the entire process
64 / 98
130. Images/cinvestav
Therefore
We have
m (xk, 0) = f (xk)
∂m (xk, 0)
∂p
= f (xk) + Bk0 = f (xk)
Therefore, the minimizer of this function can be used as a search
direction
pk = −B−1
k f (xk)
We have the following iteration
xk+1 = xk + αkpk
Where pk is a descent direction.
65 / 98
131. Images/cinvestav
Therefore
We have
m (xk, 0) = f (xk)
∂m (xk, 0)
∂p
= f (xk) + Bk0 = f (xk)
Therefore, the minimizer of this function can be used as a search
direction
pk = −B−1
k f (xk)
We have the following iteration
xk+1 = xk + αkpk
Where pk is a descent direction.
65 / 98
132. Images/cinvestav
Therefore
We have
m (xk, 0) = f (xk)
∂m (xk, 0)
∂p
= f (xk) + Bk0 = f (xk)
Therefore, the minimizer of this function can be used as a search
direction
pk = −B−1
k f (xk)
We have the following iteration
xk+1 = xk + αkpk
Where pk is a descent direction.
65 / 98
133. Images/cinvestav
Descent Direction
Definition
For a given point x ∈ Rd+1 a direction p is called a descent direction, if
there exists α > 0, α ∈ R such that
f (x + αp) < f (xk)
Lemma
Consider a point x ∈ Rd+1. Any direction p satisfying d, f (x) < 0 is a
descent direction
From this
−B−1
k f (xk)
T
f (xk) = − f (xk)T
B−1
k
T
f (xk) < 0
66 / 98
134. Images/cinvestav
Descent Direction
Definition
For a given point x ∈ Rd+1 a direction p is called a descent direction, if
there exists α > 0, α ∈ R such that
f (x + αp) < f (xk)
Lemma
Consider a point x ∈ Rd+1. Any direction p satisfying d, f (x) < 0 is a
descent direction
From this
−B−1
k f (xk)
T
f (xk) = − f (xk)T
B−1
k
T
f (xk) < 0
66 / 98
135. Images/cinvestav
Descent Direction
Definition
For a given point x ∈ Rd+1 a direction p is called a descent direction, if
there exists α > 0, α ∈ R such that
f (x + αp) < f (xk)
Lemma
Consider a point x ∈ Rd+1. Any direction p satisfying d, f (x) < 0 is a
descent direction
From this
−B−1
k f (xk)
T
f (xk) = − f (xk)T
B−1
k
T
f (xk) < 0
66 / 98
136. Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
67 / 98
137. Images/cinvestav
Where
αk is chosen to satisfy the Wolfe condition
f (xk + αkpk) ≤ f (xk) + c1αk f (xk)T
pk
f (xk + αkpk)T
pk ≥ c2 f (xk)T
pk
with 0 < c1 < c2 < 1
68 / 98
138. Images/cinvestav
Graphically
f (xk + αkpk) ≤ f (xk) + c1αk f (xk)T
pk (Armijo Condition)
acceptable acceptable
Note In practice, c1 is is chosen to be quite small, c1 = 10−4.
69 / 98
139. Images/cinvestav
However, Using That Condition alone gets you too small
α s to progress fast
Therefore, f (xk + αkpk)T
pk ≥ c2 f (xk)T
pk
This avoid small steps of αk
Note the following
The left hand side is the derivative dφ(αk)
dαk
i.e. the condition ensures that
the slope of φ (αk) is greater than dφ(0)
dαk
70 / 98
140. Images/cinvestav
However, Using That Condition alone gets you too small
α s to progress fast
Therefore, f (xk + αkpk)T
pk ≥ c2 f (xk)T
pk
This avoid small steps of αk
Note the following
The left hand side is the derivative dφ(αk)
dαk
i.e. the condition ensures that
the slope of φ (αk) is greater than dφ(0)
dαk
70 / 98
141. Images/cinvestav
Meaning
if the slope φ (αk) is strongly negative
We can reduce f significantly by moving further along the chosen
direction.
if the slope is only slightly negative or even positive
We cannot expect much more decrease in f in this direction.
It might be good to terminate the linear search.
71 / 98
142. Images/cinvestav
Meaning
if the slope φ (αk) is strongly negative
We can reduce f significantly by moving further along the chosen
direction.
if the slope is only slightly negative or even positive
We cannot expect much more decrease in f in this direction.
It might be good to terminate the linear search.
71 / 98
144. Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
73 / 98
145. Images/cinvestav
Now, Instead of computing Bk at each iteration
Instead
Davidon proposed to update it in a simple manner to account for the
curvature measured during the most recent step.
Imagine we generated xk+1 and wish to generate a new quadratic
model
m (xk+1, p) = f (xk+1) + f (xk+1) p +
1
2
pT
Bk+1p
What requirements should we impose on Bk+1?
One reasonable requirement is that the gradient of m should match the
gradient of the objective function f at the latest two iterates xk and xk+1.
74 / 98
146. Images/cinvestav
Now, Instead of computing Bk at each iteration
Instead
Davidon proposed to update it in a simple manner to account for the
curvature measured during the most recent step.
Imagine we generated xk+1 and wish to generate a new quadratic
model
m (xk+1, p) = f (xk+1) + f (xk+1) p +
1
2
pT
Bk+1p
What requirements should we impose on Bk+1?
One reasonable requirement is that the gradient of m should match the
gradient of the objective function f at the latest two iterates xk and xk+1.
74 / 98
147. Images/cinvestav
Now, Instead of computing Bk at each iteration
Instead
Davidon proposed to update it in a simple manner to account for the
curvature measured during the most recent step.
Imagine we generated xk+1 and wish to generate a new quadratic
model
m (xk+1, p) = f (xk+1) + f (xk+1) p +
1
2
pT
Bk+1p
What requirements should we impose on Bk+1?
One reasonable requirement is that the gradient of m should match the
gradient of the objective function f at the latest two iterates xk and xk+1.
74 / 98
148. Images/cinvestav
Now
Since m (xk+1, 0) is precisely f (xk+1)
The second of these conditions is satisfied automatically.
The first condition can be written mathematically
m (xk+1, −αkpk) = f (xk+1) − αkBk+1p = f (xk)
We have
αkBk+1p = f (xk+1) − f (xk)
75 / 98
149. Images/cinvestav
Now
Since m (xk+1, 0) is precisely f (xk+1)
The second of these conditions is satisfied automatically.
The first condition can be written mathematically
m (xk+1, −αkpk) = f (xk+1) − αkBk+1p = f (xk)
We have
αkBk+1p = f (xk+1) − f (xk)
75 / 98
150. Images/cinvestav
Now
Since m (xk+1, 0) is precisely f (xk+1)
The second of these conditions is satisfied automatically.
The first condition can be written mathematically
m (xk+1, −αkpk) = f (xk+1) − αkBk+1p = f (xk)
We have
αkBk+1p = f (xk+1) − f (xk)
75 / 98
151. Images/cinvestav
Then, we can simplify the notation
For this
sk = xk+1 − xk
yk = f (xk+1) − f (xk)
We have the secant equation
Bk+1sk = yk
Given that Bk+1 is symmetric positive, the displacement sk and the
change in gradient yk
sT
k Bk+1sk = sT
k yk > 0
Making possible to Bk+1 to map sk into yk!!!
76 / 98
152. Images/cinvestav
Then, we can simplify the notation
For this
sk = xk+1 − xk
yk = f (xk+1) − f (xk)
We have the secant equation
Bk+1sk = yk
Given that Bk+1 is symmetric positive, the displacement sk and the
change in gradient yk
sT
k Bk+1sk = sT
k yk > 0
Making possible to Bk+1 to map sk into yk!!!
76 / 98
153. Images/cinvestav
Then, we can simplify the notation
For this
sk = xk+1 − xk
yk = f (xk+1) − f (xk)
We have the secant equation
Bk+1sk = yk
Given that Bk+1 is symmetric positive, the displacement sk and the
change in gradient yk
sT
k Bk+1sk = sT
k yk > 0
Making possible to Bk+1 to map sk into yk!!!
76 / 98
154. Images/cinvestav
Interpretation
If our update satisfies the secant condition
Bk+1 (xk+1 − xk) = f (xk+1) − f (xk)
This implies that for the second-order approximation at xk+1 of our
target function f
fapprox (xk+1) = f (xk) + f (xk)T
(xk+1 − xk) +
1
2
(xk+1 − xk)T
Bk+1 (xk+1 − xk)
77 / 98
155. Images/cinvestav
Interpretation
If our update satisfies the secant condition
Bk+1 (xk+1 − xk) = f (xk+1) − f (xk)
This implies that for the second-order approximation at xk+1 of our
target function f
fapprox (xk+1) = f (xk) + f (xk)T
(xk+1 − xk) +
1
2
(xk+1 − xk)T
Bk+1 (xk+1 − xk)
77 / 98
159. Images/cinvestav
Now, We have two cases
When f is strongly convex
The inequality will be satisfied for any two points xk and xk+1
This condition will not always hold for non-convex functions
We need to enforce sT
k yk > 0 explicitly
The previous condition hold on the
The Wolfe or Strong Wolfe condition.
79 / 98
160. Images/cinvestav
Now, We have two cases
When f is strongly convex
The inequality will be satisfied for any two points xk and xk+1
This condition will not always hold for non-convex functions
We need to enforce sT
k yk > 0 explicitly
The previous condition hold on the
The Wolfe or Strong Wolfe condition.
79 / 98
161. Images/cinvestav
Now, We have two cases
When f is strongly convex
The inequality will be satisfied for any two points xk and xk+1
This condition will not always hold for non-convex functions
We need to enforce sT
k yk > 0 explicitly
The previous condition hold on the
The Wolfe or Strong Wolfe condition.
79 / 98
162. Images/cinvestav
How?
We have the second part, f (xk + αkpk)T
pk ≥ c2 f (xk)T
pk
By remembering xk+1 = xk + αkpk.
f (xk+1)T
sk ≥ c2 f (xk)T
sk ⇒ yT
k sk = [ f (xk+1) − f (xk)]T
sk
Then
yT
k sk = f (xk+1)T
sk − f (xk)T
sk
≥ c2 f (xk)T
[αkpk] − f (xk)T
[αkpk]
= (c2 − 1) αk f (xk)T
pk
80 / 98
163. Images/cinvestav
How?
We have the second part, f (xk + αkpk)T
pk ≥ c2 f (xk)T
pk
By remembering xk+1 = xk + αkpk.
f (xk+1)T
sk ≥ c2 f (xk)T
sk ⇒ yT
k sk = [ f (xk+1) − f (xk)]T
sk
Then
yT
k sk = f (xk+1)T
sk − f (xk)T
sk
≥ c2 f (xk)T
[αkpk] − f (xk)T
[αkpk]
= (c2 − 1) αk f (xk)T
pk
80 / 98
164. Images/cinvestav
How?
We have the second part, f (xk + αkpk)T
pk ≥ c2 f (xk)T
pk
By remembering xk+1 = xk + αkpk.
f (xk+1)T
sk ≥ c2 f (xk)T
sk ⇒ yT
k sk = [ f (xk+1) − f (xk)]T
sk
Then
yT
k sk = f (xk+1)T
sk − f (xk)T
sk
≥ c2 f (xk)T
[αkpk] − f (xk)T
[αkpk]
= (c2 − 1) αk f (xk)T
pk
80 / 98
165. Images/cinvestav
How?
We have the second part, f (xk + αkpk)T
pk ≥ c2 f (xk)T
pk
By remembering xk+1 = xk + αkpk.
f (xk+1)T
sk ≥ c2 f (xk)T
sk ⇒ yT
k sk = [ f (xk+1) − f (xk)]T
sk
Then
yT
k sk = f (xk+1)T
sk − f (xk)T
sk
≥ c2 f (xk)T
[αkpk] − f (xk)T
[αkpk]
= (c2 − 1) αk f (xk)T
pk
80 / 98
166. Images/cinvestav
How?
We have the second part, f (xk + αkpk)T
pk ≥ c2 f (xk)T
pk
By remembering xk+1 = xk + αkpk.
f (xk+1)T
sk ≥ c2 f (xk)T
sk ⇒ yT
k sk = [ f (xk+1) − f (xk)]T
sk
Then
yT
k sk = f (xk+1)T
sk − f (xk)T
sk
≥ c2 f (xk)T
[αkpk] − f (xk)T
[αkpk]
= (c2 − 1) αk f (xk)T
pk
80 / 98
167. Images/cinvestav
Then
Given that c2 − 1 < 0 and αk > 0
(c2 − 1) αk f (xk)T
pk = (c2 − 1) αk f (xk)T
−B−1
k f (xk) > 0
Thus, the curvature condition holds!!!
Then, the secant equation always has a solution under the curvature
conditions.
In fact, it admits an infinite number if solution,
Given
Since there are d(d+1)
2 degrees in a symmetric matrix.
But under constrains you have 2d constraint imposed by the secant
equation and positive definitives.
Leaving d(d+1)
2 − 2d = d2+d−4d
d = d2−3d
d degrees of freedom... enough
for infinitely many solutions.
81 / 98
168. Images/cinvestav
Then
Given that c2 − 1 < 0 and αk > 0
(c2 − 1) αk f (xk)T
pk = (c2 − 1) αk f (xk)T
−B−1
k f (xk) > 0
Thus, the curvature condition holds!!!
Then, the secant equation always has a solution under the curvature
conditions.
In fact, it admits an infinite number if solution,
Given
Since there are d(d+1)
2 degrees in a symmetric matrix.
But under constrains you have 2d constraint imposed by the secant
equation and positive definitives.
Leaving d(d+1)
2 − 2d = d2+d−4d
d = d2−3d
d degrees of freedom... enough
for infinitely many solutions.
81 / 98
169. Images/cinvestav
Then
Given that c2 − 1 < 0 and αk > 0
(c2 − 1) αk f (xk)T
pk = (c2 − 1) αk f (xk)T
−B−1
k f (xk) > 0
Thus, the curvature condition holds!!!
Then, the secant equation always has a solution under the curvature
conditions.
In fact, it admits an infinite number if solution,
Given
Since there are d(d+1)
2 degrees in a symmetric matrix.
But under constrains you have 2d constraint imposed by the secant
equation and positive definitives.
Leaving d(d+1)
2 − 2d = d2+d−4d
d = d2−3d
d degrees of freedom... enough
for infinitely many solutions.
81 / 98
170. Images/cinvestav
Given that we want a unique solution
We could solve the following problem
min
B
B − Bk
s.t. B = BT
Bsk = yk
Many matrix norms give rise to different Quasi-Newton Methos
We use the weighted Frobenius norm (Scale invariant), defined as
A W = W
1
2 AW
1
2
F
Where
A F =
d+1
i=2
d+1
j=1
a2
ij
82 / 98
171. Images/cinvestav
Given that we want a unique solution
We could solve the following problem
min
B
B − Bk
s.t. B = BT
Bsk = yk
Many matrix norms give rise to different Quasi-Newton Methos
We use the weighted Frobenius norm (Scale invariant), defined as
A W = W
1
2 AW
1
2
F
Where
A F =
d+1
i=2
d+1
j=1
a2
ij
82 / 98
172. Images/cinvestav
Given that we want a unique solution
We could solve the following problem
min
B
B − Bk
s.t. B = BT
Bsk = yk
Many matrix norms give rise to different Quasi-Newton Methos
We use the weighted Frobenius norm (Scale invariant), defined as
A W = W
1
2 AW
1
2
F
Where
A F =
d+1
i=2
d+1
j=1
a2
ij
82 / 98
173. Images/cinvestav
In our case, we choose
We have the average Hessian
Gk =
1
0
2
f (xk + ταkpk) dτ
The property yk = Gkαkpk = Gksk
It follows from Taylor’s Theorem!!!
83 / 98
174. Images/cinvestav
In our case, we choose
We have the average Hessian
Gk =
1
0
2
f (xk + ταkpk) dτ
The property yk = Gkαkpk = Gksk
It follows from Taylor’s Theorem!!!
83 / 98
175. Images/cinvestav
For the BFGS
In the previous setup we will get Bk, then the inverse update of it
Hk = B−1
k
In BFGS we go directly for the inverse by setting up:
min
H
H − Hk
s.t. H = HT
Hyk = sk
A unique solution will be
Hk+1 = I − ρkskyT
k Hk I − ρkyksT
k + ρsksT
k (1)
where ρk = 1
yT
k
sk
84 / 98
176. Images/cinvestav
For the BFGS
In the previous setup we will get Bk, then the inverse update of it
Hk = B−1
k
In BFGS we go directly for the inverse by setting up:
min
H
H − Hk
s.t. H = HT
Hyk = sk
A unique solution will be
Hk+1 = I − ρkskyT
k Hk I − ρkyksT
k + ρsksT
k (1)
where ρk = 1
yT
k
sk
84 / 98
177. Images/cinvestav
Problem
There is no magic formula to find an initial H0
We can use specific information about the problem:
For instance by setting it to the inverse of an approximate Hessian
calculated by finite differences at x0
In our case, we have ∂L(w)
∂w∂wT , or in matrix format XT WX, we could
get initial setup
85 / 98
178. Images/cinvestav
Problem
There is no magic formula to find an initial H0
We can use specific information about the problem:
For instance by setting it to the inverse of an approximate Hessian
calculated by finite differences at x0
In our case, we have ∂L(w)
∂w∂wT , or in matrix format XT WX, we could
get initial setup
85 / 98
179. Images/cinvestav
Problem
There is no magic formula to find an initial H0
We can use specific information about the problem:
For instance by setting it to the inverse of an approximate Hessian
calculated by finite differences at x0
In our case, we have ∂L(w)
∂w∂wT , or in matrix format XT WX, we could
get initial setup
85 / 98
180. Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
86 / 98
181. Images/cinvestav
Algorithm (BFGS Method)
Quasi-Newton Algorithm
Starting point x0, Convergence tolerance e, Inverse Hessian approximation H0
1 k ← 0
2 while f (xk+1) > e
3 Compute search direction pk = −Hk f (xk+1)
4 Set xk+1 = xk + αkpkwhere αk is obtained from a
linear search (Under Wolfe conditions).
5 Define sk = xk+1 − xk and yk = f (xk+1) − f (xk)
6 Compute Hk+1 by means of (Eq. 1)
7 k ← k + 1
87 / 98
182. Images/cinvestav
Algorithm (BFGS Method)
Quasi-Newton Algorithm
Starting point x0, Convergence tolerance e, Inverse Hessian approximation H0
1 k ← 0
2 while f (xk+1) > e
3 Compute search direction pk = −Hk f (xk+1)
4 Set xk+1 = xk + αkpkwhere αk is obtained from a
linear search (Under Wolfe conditions).
5 Define sk = xk+1 − xk and yk = f (xk+1) − f (xk)
6 Compute Hk+1 by means of (Eq. 1)
7 k ← k + 1
87 / 98
183. Images/cinvestav
Algorithm (BFGS Method)
Quasi-Newton Algorithm
Starting point x0, Convergence tolerance e, Inverse Hessian approximation H0
1 k ← 0
2 while f (xk+1) > e
3 Compute search direction pk = −Hk f (xk+1)
4 Set xk+1 = xk + αkpkwhere αk is obtained from a
linear search (Under Wolfe conditions).
5 Define sk = xk+1 − xk and yk = f (xk+1) − f (xk)
6 Compute Hk+1 by means of (Eq. 1)
7 k ← k + 1
87 / 98
184. Images/cinvestav
Algorithm (BFGS Method)
Quasi-Newton Algorithm
Starting point x0, Convergence tolerance e, Inverse Hessian approximation H0
1 k ← 0
2 while f (xk+1) > e
3 Compute search direction pk = −Hk f (xk+1)
4 Set xk+1 = xk + αkpkwhere αk is obtained from a
linear search (Under Wolfe conditions).
5 Define sk = xk+1 − xk and yk = f (xk+1) − f (xk)
6 Compute Hk+1 by means of (Eq. 1)
7 k ← k + 1
87 / 98
185. Images/cinvestav
Algorithm (BFGS Method)
Quasi-Newton Algorithm
Starting point x0, Convergence tolerance e, Inverse Hessian approximation H0
1 k ← 0
2 while f (xk+1) > e
3 Compute search direction pk = −Hk f (xk+1)
4 Set xk+1 = xk + αkpkwhere αk is obtained from a
linear search (Under Wolfe conditions).
5 Define sk = xk+1 − xk and yk = f (xk+1) − f (xk)
6 Compute Hk+1 by means of (Eq. 1)
7 k ← k + 1
87 / 98
186. Images/cinvestav
Algorithm (BFGS Method)
Quasi-Newton Algorithm
Starting point x0, Convergence tolerance e, Inverse Hessian approximation H0
1 k ← 0
2 while f (xk+1) > e
3 Compute search direction pk = −Hk f (xk+1)
4 Set xk+1 = xk + αkpkwhere αk is obtained from a
linear search (Under Wolfe conditions).
5 Define sk = xk+1 − xk and yk = f (xk+1) − f (xk)
6 Compute Hk+1 by means of (Eq. 1)
7 k ← k + 1
87 / 98
187. Images/cinvestav
Algorithm (BFGS Method)
Quasi-Newton Algorithm
Starting point x0, Convergence tolerance e, Inverse Hessian approximation H0
1 k ← 0
2 while f (xk+1) > e
3 Compute search direction pk = −Hk f (xk+1)
4 Set xk+1 = xk + αkpkwhere αk is obtained from a
linear search (Under Wolfe conditions).
5 Define sk = xk+1 − xk and yk = f (xk+1) − f (xk)
6 Compute Hk+1 by means of (Eq. 1)
7 k ← k + 1
87 / 98
188. Images/cinvestav
Algorithm (BFGS Method)
Quasi-Newton Algorithm
Starting point x0, Convergence tolerance e, Inverse Hessian approximation H0
1 k ← 0
2 while f (xk+1) > e
3 Compute search direction pk = −Hk f (xk+1)
4 Set xk+1 = xk + αkpkwhere αk is obtained from a
linear search (Under Wolfe conditions).
5 Define sk = xk+1 − xk and yk = f (xk+1) − f (xk)
6 Compute Hk+1 by means of (Eq. 1)
7 k ← k + 1
87 / 98
189. Images/cinvestav
Complexity
The
Cost of update or inverse update is O d2 operations per iteration.
For More
Nocedal, Jorge & Wright, Stephen J. (1999). Numerical
Optimization. Springer-Verlag.
88 / 98
190. Images/cinvestav
Complexity
The
Cost of update or inverse update is O d2 operations per iteration.
For More
Nocedal, Jorge & Wright, Stephen J. (1999). Numerical
Optimization. Springer-Verlag.
88 / 98
191. Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
89 / 98
193. Images/cinvestav
Caution
Here, we change the labeling to yi = ±1 with
p (yi = ±1|x, w) = σ ywT
x =
1
1 + exp {−ywT x}
Thus, we have the following log likelihood under regularization λ > 0
L (w) = −
N
i=1
log 1 + exp −yiwT
xi −
λ
2
wT
w
It is possible to get a Gradient Descent
wl (w) =
N
i=1
1 −
1
1 + exp {−yiwT xi}
yixi − λw
91 / 98
194. Images/cinvestav
Caution
Here, we change the labeling to yi = ±1 with
p (yi = ±1|x, w) = σ ywT
x =
1
1 + exp {−ywT x}
Thus, we have the following log likelihood under regularization λ > 0
L (w) = −
N
i=1
log 1 + exp −yiwT
xi −
λ
2
wT
w
It is possible to get a Gradient Descent
wl (w) =
N
i=1
1 −
1
1 + exp {−yiwT xi}
yixi − λw
91 / 98
195. Images/cinvestav
Caution
Here, we change the labeling to yi = ±1 with
p (yi = ±1|x, w) = σ ywT
x =
1
1 + exp {−ywT x}
Thus, we have the following log likelihood under regularization λ > 0
L (w) = −
N
i=1
log 1 + exp −yiwT
xi −
λ
2
wT
w
It is possible to get a Gradient Descent
wl (w) =
N
i=1
1 −
1
1 + exp {−yiwT xi}
yixi − λw
91 / 98
196. Images/cinvestav
Danger Will Robinson!!!
Gradient descent using resembles the Perceptron learning algorithm
Problem!!! It will always converge for a suitable step size, regardless of
whether the classes are separable!!!
92 / 98
197. Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
93 / 98
200. Images/cinvestav
Coordinate Ascent
Algorithm
Input Max, an initial w0
1 counter ← 0
2 while counter < Max
3 for i ← 1, ..., d
4 Randomly pick i
5 Compute a step size δ∗
by approximately
maximize arg minδ f (x + δei)
6 xi ← xi + δ∗
Where
ei = 0 · · · 0 1 ← i 0 · · · 0
T
95 / 98
201. Images/cinvestav
Coordinate Ascent
Algorithm
Input Max, an initial w0
1 counter ← 0
2 while counter < Max
3 for i ← 1, ..., d
4 Randomly pick i
5 Compute a step size δ∗
by approximately
maximize arg minδ f (x + δei)
6 xi ← xi + δ∗
Where
ei = 0 · · · 0 1 ← i 0 · · · 0
T
95 / 98
202. Images/cinvestav
In the case of Logistic Regression
Thus, we can optimize each wk alternatively by a coordinate-wise
Newton update
wnew
k = wold
k +
−λwold
k + N
i=1 1 − 1
1+exp{−yiwT xi}
yixik
λ + N
i=1 x2
ik
1
1+exp{−yiwT xi}
1 − 1
1+exp{−yiwT xi}
Complexity of this update
Item Complexity
N
i=1 1 − 1
1+exp{−yiwT xi}
yixik O (N)
N
i=1 x2
ik
1
1+exp{−yiwT xi}
1 − 1
1+exp{−yiwT xi}
O (N)
Total Complexity O (N)
For all the dimensions O (Nd)
96 / 98
203. Images/cinvestav
In the case of Logistic Regression
Thus, we can optimize each wk alternatively by a coordinate-wise
Newton update
wnew
k = wold
k +
−λwold
k + N
i=1 1 − 1
1+exp{−yiwT xi}
yixik
λ + N
i=1 x2
ik
1
1+exp{−yiwT xi}
1 − 1
1+exp{−yiwT xi}
Complexity of this update
Item Complexity
N
i=1 1 − 1
1+exp{−yiwT xi}
yixik O (N)
N
i=1 x2
ik
1
1+exp{−yiwT xi}
1 − 1
1+exp{−yiwT xi}
O (N)
Total Complexity O (N)
For all the dimensions O (Nd)
96 / 98
204. Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
97 / 98
205. Images/cinvestav
We have the following Complexities per iteration
Complexities
Method Per Iteration Convergence Rate
Cholesky Decomposition d3
2 = O d3 Quadratic
Quasi-Newton BFGS O d2 Superlinearly
Coordinate Ascent O (Nd) Not established
98 / 98