SlideShare ist ein Scribd-Unternehmen logo
1 von 205
Downloaden Sie, um offline zu lesen
Introduction to Machine Learning
Logistic Regression and More on Optimization
Andres Mendez-Vazquez
September 3, 2017
1 / 98
Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
2 / 98
Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
3 / 98
Images/cinvestav
Assume the following
Let Y1, Y2, ..., YN independent random variables
Taking values in the set {0, 1}
Now, you have a set of fixed vectors
x1, x2, ..., xN
Mapped to a series of numbers by a weight vector w
wT
x1, wT
x2, ..., wT
xN
4 / 98
Images/cinvestav
Assume the following
Let Y1, Y2, ..., YN independent random variables
Taking values in the set {0, 1}
Now, you have a set of fixed vectors
x1, x2, ..., xN
Mapped to a series of numbers by a weight vector w
wT
x1, wT
x2, ..., wT
xN
4 / 98
Images/cinvestav
Assume the following
Let Y1, Y2, ..., YN independent random variables
Taking values in the set {0, 1}
Now, you have a set of fixed vectors
x1, x2, ..., xN
Mapped to a series of numbers by a weight vector w
wT
x1, wT
x2, ..., wT
xN
4 / 98
Images/cinvestav
In our simplest from
There is a suspected relation
Between θi = P (Yi = 1) and wT xi
Thus we have
y =
1 wT x + e > 0
0 else
Note: Where e is an error with a certain distribution!!!
5 / 98
Images/cinvestav
In our simplest from
There is a suspected relation
Between θi = P (Yi = 1) and wT xi
Thus we have
y =
1 wT x + e > 0
0 else
Note: Where e is an error with a certain distribution!!!
5 / 98
Images/cinvestav
For Example, Graphically
We have
6 / 98
Images/cinvestav
It is better to user a logit version
We have
7 / 98
Images/cinvestav
Logit Distribution
PDF with support z ∈ (−∞, ∞) ,µ location and s scale
p (x|µ, s) =
exp −z−µ
s
s 1 + exp −z−µ
s
2
With a CDF
P (Y < z) =
z
−∞
p (y|µ, s) dy =
1
1 + exp −z−µ
s
8 / 98
Images/cinvestav
Logit Distribution
PDF with support z ∈ (−∞, ∞) ,µ location and s scale
p (x|µ, s) =
exp −z−µ
s
s 1 + exp −z−µ
s
2
With a CDF
P (Y < z) =
z
−∞
p (y|µ, s) dy =
1
1 + exp −z−µ
s
8 / 98
Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
9 / 98
Images/cinvestav
In Bayesian Classification
Assignment of a pattern
It is performed by using the posterior probabilities, P (ωi|x)
And given K classes, we want
K
i=1
P (ωi|x) = 1
Such that each
0 ≤ P (ωi|x) ≤ 1
10 / 98
Images/cinvestav
In Bayesian Classification
Assignment of a pattern
It is performed by using the posterior probabilities, P (ωi|x)
And given K classes, we want
K
i=1
P (ωi|x) = 1
Such that each
0 ≤ P (ωi|x) ≤ 1
10 / 98
Images/cinvestav
In Bayesian Classification
Assignment of a pattern
It is performed by using the posterior probabilities, P (ωi|x)
And given K classes, we want
K
i=1
P (ωi|x) = 1
Such that each
0 ≤ P (ωi|x) ≤ 1
10 / 98
Images/cinvestav
Observation
This is a typical example of the discriminative approach
Where the distribution of data is of no interest.
11 / 98
Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
12 / 98
Images/cinvestav
The Model
We have the following under the extended features
log
P (ω1|x)
P (ωK|x)
= wT
1 x
log
P (ω2|x)
P (ωK|x)
= wT
2 x
...
...
log
P (ωK−1|x)
P (ωK|x)
= wT
K−1x
13 / 98
Images/cinvestav
The Model
We have the following under the extended features
log
P (ω1|x)
P (ωK|x)
= wT
1 x
log
P (ω2|x)
P (ωK|x)
= wT
2 x
...
...
log
P (ωK−1|x)
P (ωK|x)
= wT
K−1x
13 / 98
Images/cinvestav
The Model
We have the following under the extended features
log
P (ω1|x)
P (ωK|x)
= wT
1 x
log
P (ω2|x)
P (ωK|x)
= wT
2 x
...
...
log
P (ωK−1|x)
P (ωK|x)
= wT
K−1x
13 / 98
Images/cinvestav
Further
We have
The model is specified in terms of K − 1 log-odds or logit transformations.
And
Although the model uses the last class as the denominator in the
odds-ratios.
The choice of denominator is arbitrary
Because the estimates are equivariant under this choice.
14 / 98
Images/cinvestav
Further
We have
The model is specified in terms of K − 1 log-odds or logit transformations.
And
Although the model uses the last class as the denominator in the
odds-ratios.
The choice of denominator is arbitrary
Because the estimates are equivariant under this choice.
14 / 98
Images/cinvestav
Further
We have
The model is specified in terms of K − 1 log-odds or logit transformations.
And
Although the model uses the last class as the denominator in the
odds-ratios.
The choice of denominator is arbitrary
Because the estimates are equivariant under this choice.
14 / 98
Images/cinvestav
It is possible to show that
We have that, for l = 1, 2, ..., K − 1
P (ωl|x)
P (ωK|x)
= exp wT
l x
Therefore
K−1
l=1
P (ωl|x)
P (ωK|x)
=
K−1
l=1
exp wT
l x
Thus
1 − P (ωK|x)
P (ωK|x)
=
K−1
l=1
exp wT
l x
15 / 98
Images/cinvestav
It is possible to show that
We have that, for l = 1, 2, ..., K − 1
P (ωl|x)
P (ωK|x)
= exp wT
l x
Therefore
K−1
l=1
P (ωl|x)
P (ωK|x)
=
K−1
l=1
exp wT
l x
Thus
1 − P (ωK|x)
P (ωK|x)
=
K−1
l=1
exp wT
l x
15 / 98
Images/cinvestav
It is possible to show that
We have that, for l = 1, 2, ..., K − 1
P (ωl|x)
P (ωK|x)
= exp wT
l x
Therefore
K−1
l=1
P (ωl|x)
P (ωK|x)
=
K−1
l=1
exp wT
l x
Thus
1 − P (ωK|x)
P (ωK|x)
=
K−1
l=1
exp wT
l x
15 / 98
Images/cinvestav
Basically
We have
P (ωK|x) =
1
1 + K−1
l=1 exp wT
l x
Then
P (ωi|x)
1
1+
K−1
l=1
exp{wT
l
x}
= exp wT
l x
For i = 1, 2, ..., k − 1
P (ωi|x) =
exp wT
i x
1 + K−1
l=1 exp wT
l x
16 / 98
Images/cinvestav
Basically
We have
P (ωK|x) =
1
1 + K−1
l=1 exp wT
l x
Then
P (ωi|x)
1
1+
K−1
l=1
exp{wT
l
x}
= exp wT
l x
For i = 1, 2, ..., k − 1
P (ωi|x) =
exp wT
i x
1 + K−1
l=1 exp wT
l x
16 / 98
Images/cinvestav
Basically
We have
P (ωK|x) =
1
1 + K−1
l=1 exp wT
l x
Then
P (ωi|x)
1
1+
K−1
l=1
exp{wT
l
x}
= exp wT
l x
For i = 1, 2, ..., k − 1
P (ωi|x) =
exp wT
i x
1 + K−1
l=1 exp wT
l x
16 / 98
Images/cinvestav
Additionally
For K
P (ωK|x) =
1
1 + K−1
l=1 exp wT
l x
Easy to see
They sum to one.
17 / 98
Images/cinvestav
Additionally
For K
P (ωK|x) =
1
1 + K−1
l=1 exp wT
l x
Easy to see
They sum to one.
17 / 98
Images/cinvestav
A Note in Notation
Given all these parameters, we summarized them
Θ = {w1, w2, ..., wK−1}
Therefore
P (ωl|X = x) = pl (x|Θ)
18 / 98
Images/cinvestav
A Note in Notation
Given all these parameters, we summarized them
Θ = {w1, w2, ..., wK−1}
Therefore
P (ωl|X = x) = pl (x|Θ)
18 / 98
Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
19 / 98
Images/cinvestav
In the two class case
We have
P1 (ω1|x) =
exp wT x
1 + exp {wT x}
P2 (ω2|x) =
1
1 + exp {wT x}
A similar model
P1 (ω1|x) =
exp −wT x
1 + exp {−wT x}
P2 (ω2|x) =
1
1 + exp {−wT x}
20 / 98
Images/cinvestav
In the two class case
We have
P1 (ω1|x) =
exp wT x
1 + exp {wT x}
P2 (ω2|x) =
1
1 + exp {wT x}
A similar model
P1 (ω1|x) =
exp −wT x
1 + exp {−wT x}
P2 (ω2|x) =
1
1 + exp {−wT x}
20 / 98
Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
21 / 98
Images/cinvestav
We have the following split
Using f (x) = wT
x
22 / 98
Images/cinvestav
Then, we have the mapping to
We have
Class 1 Class 2
23 / 98
Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
24 / 98
Images/cinvestav
A Classic application of Maximum Likelihood
Given a sequence of smaples iid
x1, x2, ..., xN
We have the following pdf
p (x1, x2, ..., xN |Θ) =
N
i=1
p (xi|Θ)
25 / 98
Images/cinvestav
A Classic application of Maximum Likelihood
Given a sequence of smaples iid
x1, x2, ..., xN
We have the following pdf
p (x1, x2, ..., xN |Θ) =
N
i=1
p (xi|Θ)
25 / 98
Images/cinvestav
P (G|X) Distribution
P (G|X) completeley specify the conditional distribution
We have a multinomial distribution which under the log-likelihood of N
observations:
L (Θ) = log p (x1, x2, ..., xN |Θ) = log
N
i=1
pgi (xi|θ) =
N
i=1
log pgi (xi|θ)
Where
gi represent the class that xi belongs.
gi =
1 if xi ∈ Class 1
2 if xi ∈ Class 2
26 / 98
Images/cinvestav
P (G|X) Distribution
P (G|X) completeley specify the conditional distribution
We have a multinomial distribution which under the log-likelihood of N
observations:
L (Θ) = log p (x1, x2, ..., xN |Θ) = log
N
i=1
pgi (xi|θ) =
N
i=1
log pgi (xi|θ)
Where
gi represent the class that xi belongs.
gi =
1 if xi ∈ Class 1
2 if xi ∈ Class 2
26 / 98
Images/cinvestav
P (G|X) Distribution
P (G|X) completeley specify the conditional distribution
We have a multinomial distribution which under the log-likelihood of N
observations:
L (Θ) = log p (x1, x2, ..., xN |Θ) = log
N
i=1
pgi (xi|θ) =
N
i=1
log pgi (xi|θ)
Where
gi represent the class that xi belongs.
gi =
1 if xi ∈ Class 1
2 if xi ∈ Class 2
26 / 98
Images/cinvestav
P (G|X) Distribution
P (G|X) completeley specify the conditional distribution
We have a multinomial distribution which under the log-likelihood of N
observations:
L (Θ) = log p (x1, x2, ..., xN |Θ) = log
N
i=1
pgi (xi|θ) =
N
i=1
log pgi (xi|θ)
Where
gi represent the class that xi belongs.
gi =
1 if xi ∈ Class 1
2 if xi ∈ Class 2
26 / 98
Images/cinvestav
P (G|X) Distribution
P (G|X) completeley specify the conditional distribution
We have a multinomial distribution which under the log-likelihood of N
observations:
L (Θ) = log p (x1, x2, ..., xN |Θ) = log
N
i=1
pgi (xi|θ) =
N
i=1
log pgi (xi|θ)
Where
gi represent the class that xi belongs.
gi =
1 if xi ∈ Class 1
2 if xi ∈ Class 2
26 / 98
Images/cinvestav
How do we integrate this into a Cost Function?
Clearly, we have two distributions
We need to represent the distributions into the functions pgi (xi|θ).
Why not to have all the distributions into this function
pgi (xi|θ) =
K−1
l=1
p (xi|wl)I{xi∈ωl}
It is easy with the two classes
Given that we have a binary situation!!!
27 / 98
Images/cinvestav
How do we integrate this into a Cost Function?
Clearly, we have two distributions
We need to represent the distributions into the functions pgi (xi|θ).
Why not to have all the distributions into this function
pgi (xi|θ) =
K−1
l=1
p (xi|wl)I{xi∈ωl}
It is easy with the two classes
Given that we have a binary situation!!!
27 / 98
Images/cinvestav
How do we integrate this into a Cost Function?
Clearly, we have two distributions
We need to represent the distributions into the functions pgi (xi|θ).
Why not to have all the distributions into this function
pgi (xi|θ) =
K−1
l=1
p (xi|wl)I{xi∈ωl}
It is easy with the two classes
Given that we have a binary situation!!!
27 / 98
Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
28 / 98
Images/cinvestav
Given the two case
We have then a Bernoulli distribution
p1 (xi|w) =


exp wT x
1 + exp {wT x}


yi
p2 (xi|w) =
1
1 + exp {wT x}
1−yi
With
yi = 1 if xi ∈ Class 1
yi = 0 if xi ∈ Class 2
29 / 98
Images/cinvestav
Given the two case
We have then a Bernoulli distribution
p1 (xi|w) =


exp wT x
1 + exp {wT x}


yi
p2 (xi|w) =
1
1 + exp {wT x}
1−yi
With
yi = 1 if xi ∈ Class 1
yi = 0 if xi ∈ Class 2
29 / 98
Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
30 / 98
Images/cinvestav
We have the following
Cost Function
L (w) =
N
i=1
{yi log p1 (xi|w) +
(1 − yi) log (1 − p2 (xi|w))}
After some reductions
L (w) =
N
i=1
yiwT
xi − log 1 + exp wT
xi
31 / 98
Images/cinvestav
We have the following
Cost Function
L (w) =
N
i=1
{yi log p1 (xi|w) +
(1 − yi) log (1 − p2 (xi|w))}
After some reductions
L (w) =
N
i=1
yiwT
xi − log 1 + exp wT
xi
31 / 98
Images/cinvestav
Now, we derive and set it to zero
We have
∂L (w)
∂w
=
N
i=1
xi yi −
exp wT
xi
1 + exp {wT xi}
= 0
Which are d + 1 equations nonlinear
N
i=1
xi (yi − p (xi|w)) =















N
i=1 1 × yi −
exp{wT
xi}
1+exp{wT xi}
N
i=1 xi
1 yi −
exp{wT
xi}
1+exp{wT xi}
N
i=1 xi
2 yi −
exp{wT
xi}
1+exp{wT xi}
...
N
i=1 xi
d yi −
exp{wT
xi}
1+exp{wT xi}















= 0
It is know as a scoring function.
32 / 98
Images/cinvestav
Now, we derive and set it to zero
We have
∂L (w)
∂w
=
N
i=1
xi yi −
exp wT
xi
1 + exp {wT xi}
= 0
Which are d + 1 equations nonlinear
N
i=1
xi (yi − p (xi|w)) =















N
i=1 1 × yi −
exp{wT
xi}
1+exp{wT xi}
N
i=1 xi
1 yi −
exp{wT
xi}
1+exp{wT xi}
N
i=1 xi
2 yi −
exp{wT
xi}
1+exp{wT xi}
...
N
i=1 xi
d yi −
exp{wT
xi}
1+exp{wT xi}















= 0
It is know as a scoring function.
32 / 98
Images/cinvestav
Finally
In other words you
1 d + 1 nonlinear equations in w.
2 For example, from the first equation:
N
i=1
yi =
N
i=1
p (xi|w)
The expected number of class ones matches the observed number.
And hence also class twos.
33 / 98
Images/cinvestav
Finally
In other words you
1 d + 1 nonlinear equations in w.
2 For example, from the first equation:
N
i=1
yi =
N
i=1
p (xi|w)
The expected number of class ones matches the observed number.
And hence also class twos.
33 / 98
Images/cinvestav
Finally
In other words you
1 d + 1 nonlinear equations in w.
2 For example, from the first equation:
N
i=1
yi =
N
i=1
p (xi|w)
The expected number of class ones matches the observed number.
And hence also class twos.
33 / 98
Images/cinvestav
Finally
In other words you
1 d + 1 nonlinear equations in w.
2 For example, from the first equation:
N
i=1
yi =
N
i=1
p (xi|w)
The expected number of class ones matches the observed number.
And hence also class twos.
33 / 98
Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
34 / 98
Images/cinvestav
To solve the previous equations
We use the Newton-Raphson Method
It comes from the first Taylor Approximation
f (x + h) ≈ f (x) + hf (x)
Thus we have for a root r of function f
We have
1 Assume a good estimate of r, x0
2 Thus we have r = x0 + h
3 Or h = r − x0
35 / 98
Images/cinvestav
To solve the previous equations
We use the Newton-Raphson Method
It comes from the first Taylor Approximation
f (x + h) ≈ f (x) + hf (x)
Thus we have for a root r of function f
We have
1 Assume a good estimate of r, x0
2 Thus we have r = x0 + h
3 Or h = r − x0
35 / 98
Images/cinvestav
To solve the previous equations
We use the Newton-Raphson Method
It comes from the first Taylor Approximation
f (x + h) ≈ f (x) + hf (x)
Thus we have for a root r of function f
We have
1 Assume a good estimate of r, x0
2 Thus we have r = x0 + h
3 Or h = r − x0
35 / 98
Images/cinvestav
To solve the previous equations
We use the Newton-Raphson Method
It comes from the first Taylor Approximation
f (x + h) ≈ f (x) + hf (x)
Thus we have for a root r of function f
We have
1 Assume a good estimate of r, x0
2 Thus we have r = x0 + h
3 Or h = r − x0
35 / 98
Images/cinvestav
To solve the previous equations
We use the Newton-Raphson Method
It comes from the first Taylor Approximation
f (x + h) ≈ f (x) + hf (x)
Thus we have for a root r of function f
We have
1 Assume a good estimate of r, x0
2 Thus we have r = x0 + h
3 Or h = r − x0
35 / 98
Images/cinvestav
We have then
From Taylor
0 = f (r) = f (x0 + h) ≈ f (x0) + hf (x0)
Thus, as long f (x0) is not close to 0
h ≈ −
f (x0)
f (x0)
Thus
r = x0 + h ≈ x0 −
f (x0)
f (x0)
36 / 98
Images/cinvestav
We have then
From Taylor
0 = f (r) = f (x0 + h) ≈ f (x0) + hf (x0)
Thus, as long f (x0) is not close to 0
h ≈ −
f (x0)
f (x0)
Thus
r = x0 + h ≈ x0 −
f (x0)
f (x0)
36 / 98
Images/cinvestav
We have then
From Taylor
0 = f (r) = f (x0 + h) ≈ f (x0) + hf (x0)
Thus, as long f (x0) is not close to 0
h ≈ −
f (x0)
f (x0)
Thus
r = x0 + h ≈ x0 −
f (x0)
f (x0)
36 / 98
Images/cinvestav
We have our final improving
We have
x1 ≈ x0 −
f (x0)
f (x0)
37 / 98
Images/cinvestav
Then, on the scoring function
For this, we need the Hessian of the function
∂2L (w)
∂w∂wT
= −
N
i=1
xixT
i


exp wT xi
1 + exp {wT xi}



1 −
exp wT xi
1 + exp {wT xi}


Thus, we have at a starting point wold
wnew
= wold
−
∂L (w)
∂w∂wT
−1
∂L (w)
∂w
38 / 98
Images/cinvestav
Then, on the scoring function
For this, we need the Hessian of the function
∂2L (w)
∂w∂wT
= −
N
i=1
xixT
i


exp wT xi
1 + exp {wT xi}



1 −
exp wT xi
1 + exp {wT xi}


Thus, we have at a starting point wold
wnew
= wold
−
∂L (w)
∂w∂wT
−1
∂L (w)
∂w
38 / 98
Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
39 / 98
Images/cinvestav
We can rewrite all as matrix notations
Assume
1 Let y denotes the vector of yi
2 X is the data matrix N × (d + 1)
3 p the vector of fitted probabilities with the ith element p xi|wold
4 W a N × N diagonal matrix of weights with the ith diagonal element
p xi|wold
1 − p xi|wold
40 / 98
Images/cinvestav
We can rewrite all as matrix notations
Assume
1 Let y denotes the vector of yi
2 X is the data matrix N × (d + 1)
3 p the vector of fitted probabilities with the ith element p xi|wold
4 W a N × N diagonal matrix of weights with the ith diagonal element
p xi|wold
1 − p xi|wold
40 / 98
Images/cinvestav
We can rewrite all as matrix notations
Assume
1 Let y denotes the vector of yi
2 X is the data matrix N × (d + 1)
3 p the vector of fitted probabilities with the ith element p xi|wold
4 W a N × N diagonal matrix of weights with the ith diagonal element
p xi|wold
1 − p xi|wold
40 / 98
Images/cinvestav
We can rewrite all as matrix notations
Assume
1 Let y denotes the vector of yi
2 X is the data matrix N × (d + 1)
3 p the vector of fitted probabilities with the ith element p xi|wold
4 W a N × N diagonal matrix of weights with the ith diagonal element
p xi|wold
1 − p xi|wold
40 / 98
Images/cinvestav
Then, we have
For each updating term
∂L (w)
∂w
= XT
(y − p)
∂L (w)
∂w∂wT
= −XT
WX
41 / 98
Images/cinvestav
Then, we have
Then, the Newton Step
wnew
= wold
+ XT
WX
−1
XT
(y − p)
= Iwold
+ XT
WX
−1
XT
I (y − p)
= XT
WX
−1
XT
WXwold
+ XT
WX
−1
XT
WW−1
(y − p)
= XT
WX
−1
XT
W Xwold
+ W−1
(y − p)
= XT
WX
−1
XT
Wz
42 / 98
Images/cinvestav
Then, we have
Then, the Newton Step
wnew
= wold
+ XT
WX
−1
XT
(y − p)
= Iwold
+ XT
WX
−1
XT
I (y − p)
= XT
WX
−1
XT
WXwold
+ XT
WX
−1
XT
WW−1
(y − p)
= XT
WX
−1
XT
W Xwold
+ W−1
(y − p)
= XT
WX
−1
XT
Wz
42 / 98
Images/cinvestav
Then, we have
Then, the Newton Step
wnew
= wold
+ XT
WX
−1
XT
(y − p)
= Iwold
+ XT
WX
−1
XT
I (y − p)
= XT
WX
−1
XT
WXwold
+ XT
WX
−1
XT
WW−1
(y − p)
= XT
WX
−1
XT
W Xwold
+ W−1
(y − p)
= XT
WX
−1
XT
Wz
42 / 98
Images/cinvestav
Then, we have
Then, the Newton Step
wnew
= wold
+ XT
WX
−1
XT
(y − p)
= Iwold
+ XT
WX
−1
XT
I (y − p)
= XT
WX
−1
XT
WXwold
+ XT
WX
−1
XT
WW−1
(y − p)
= XT
WX
−1
XT
W Xwold
+ W−1
(y − p)
= XT
WX
−1
XT
Wz
42 / 98
Images/cinvestav
Then, we have
Then, the Newton Step
wnew
= wold
+ XT
WX
−1
XT
(y − p)
= Iwold
+ XT
WX
−1
XT
I (y − p)
= XT
WX
−1
XT
WXwold
+ XT
WX
−1
XT
WW−1
(y − p)
= XT
WX
−1
XT
W Xwold
+ W−1
(y − p)
= XT
WX
−1
XT
Wz
42 / 98
Images/cinvestav
Then
We have
Re-expressed the Newton step as a weighted least squares step.
With a the adjusted response as
z = Xwold
+ W−1
(y − p)
43 / 98
Images/cinvestav
Then
We have
Re-expressed the Newton step as a weighted least squares step.
With a the adjusted response as
z = Xwold
+ W−1
(y − p)
43 / 98
Images/cinvestav
This New Algorithm
It is know as
Iteratively Re-weighted Least Squares or IRLS
After all at each iteration, it solves
A weighted Least Square Problem
wnew
← arg min
w
(z − Xw)T
W (z − Xw)
44 / 98
Images/cinvestav
This New Algorithm
It is know as
Iteratively Re-weighted Least Squares or IRLS
After all at each iteration, it solves
A weighted Least Square Problem
wnew
← arg min
w
(z − Xw)T
W (z − Xw)
44 / 98
Images/cinvestav
Observations
Good Starting Point w = 0
However, convergence is never guaranteed!!!
However
Typically the algorithm does converge, since the log-likelihood is
concave.
But overshooting can occur.
45 / 98
Images/cinvestav
Observations
Good Starting Point w = 0
However, convergence is never guaranteed!!!
However
Typically the algorithm does converge, since the log-likelihood is
concave.
But overshooting can occur.
45 / 98
Images/cinvestav
Observations
L (θj) = log n
j=1 p (xj|θj)
Halving Solve the Problem
Perfect!!! 46 / 98
Images/cinvestav
Observations
L (θj) = log n
j=1 p (xj|θj)
Halving Solve the Problem
Perfect!!! 46 / 98
Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
47 / 98
Images/cinvestav
We can decompose the matrix
Given A = XT
WX and Y = XT
Wz
You have
Ax = Y
Thus you need to obtain
x = A−1
Y
48 / 98
Images/cinvestav
We can decompose the matrix
Given A = XT
WX and Y = XT
Wz
You have
Ax = Y
Thus you need to obtain
x = A−1
Y
48 / 98
Images/cinvestav
This can be seen as a system of linear equations
As you can see
We start with a set of linear equations with d + 1 unknowns:
x1, x2, ..., xd+1



a11x1 + a12x2 + ... + a1d+1xd+1 = y1
a21x1 + a22x2 + ... + a2d+1xd+1 = y2
...
...
ad+11x1 + ad+12x2 + ... + ad+1d+1xn = yd+1
Thus
A set of values for x1, x2, ..., xn that satisfy all of the equations
simultaneously is said to be a solution to these equations.
Note it is easier to solve the system Ax = Y
How?
49 / 98
Images/cinvestav
This can be seen as a system of linear equations
As you can see
We start with a set of linear equations with d + 1 unknowns:
x1, x2, ..., xd+1



a11x1 + a12x2 + ... + a1d+1xd+1 = y1
a21x1 + a22x2 + ... + a2d+1xd+1 = y2
...
...
ad+11x1 + ad+12x2 + ... + ad+1d+1xn = yd+1
Thus
A set of values for x1, x2, ..., xn that satisfy all of the equations
simultaneously is said to be a solution to these equations.
Note it is easier to solve the system Ax = Y
How?
49 / 98
Images/cinvestav
This can be seen as a system of linear equations
As you can see
We start with a set of linear equations with d + 1 unknowns:
x1, x2, ..., xd+1



a11x1 + a12x2 + ... + a1d+1xd+1 = y1
a21x1 + a22x2 + ... + a2d+1xd+1 = y2
...
...
ad+11x1 + ad+12x2 + ... + ad+1d+1xn = yd+1
Thus
A set of values for x1, x2, ..., xn that satisfy all of the equations
simultaneously is said to be a solution to these equations.
Note it is easier to solve the system Ax = Y
How?
49 / 98
Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
50 / 98
Images/cinvestav
What is the Cholesky Decomposition?
It is a method that factorize a matrix
A ∈ Cd+1×d+1 is a positive definite Hermitian matrix
Positive definite matrix
xT
Ax > 0 for all x ∈ Cd+1×d+1
Hermitian matrix
A = AT
51 / 98
Images/cinvestav
What is the Cholesky Decomposition?
It is a method that factorize a matrix
A ∈ Cd+1×d+1 is a positive definite Hermitian matrix
Positive definite matrix
xT
Ax > 0 for all x ∈ Cd+1×d+1
Hermitian matrix
A = AT
51 / 98
Images/cinvestav
What is the Cholesky Decomposition?
It is a method that factorize a matrix
A ∈ Cd+1×d+1 is a positive definite Hermitian matrix
Positive definite matrix
xT
Ax > 0 for all x ∈ Cd+1×d+1
Hermitian matrix
A = AT
51 / 98
Images/cinvestav
Therefore
Cholesky decomposes A into lower or upper triangular matrix and
their conjugate transpose
A = LLT
A = RT R
In our particular case
Given that XT WX ∈ Rd+1×d+1 with XT WX = XT WX
T
, we have
that
XT
WX = LLT
Thus, we can use the Cholensky decomposition
1 Cholesky decomposition is of order O d3 and requires 1
6d3 FLOP
operations.
52 / 98
Images/cinvestav
Therefore
Cholesky decomposes A into lower or upper triangular matrix and
their conjugate transpose
A = LLT
A = RT R
In our particular case
Given that XT WX ∈ Rd+1×d+1 with XT WX = XT WX
T
, we have
that
XT
WX = LLT
Thus, we can use the Cholensky decomposition
1 Cholesky decomposition is of order O d3 and requires 1
6d3 FLOP
operations.
52 / 98
Images/cinvestav
Therefore
Cholesky decomposes A into lower or upper triangular matrix and
their conjugate transpose
A = LLT
A = RT R
In our particular case
Given that XT WX ∈ Rd+1×d+1 with XT WX = XT WX
T
, we have
that
XT
WX = LLT
Thus, we can use the Cholensky decomposition
1 Cholesky decomposition is of order O d3 and requires 1
6d3 FLOP
operations.
52 / 98
Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
53 / 98
Images/cinvestav
We have
The matrices A = XT
WX ∈ Rd+1×d+1
and X = A−1
AX = I
From Cholensky
RT
RX = I
If we define RX = B
RT
B = I
54 / 98
Images/cinvestav
We have
The matrices A = XT
WX ∈ Rd+1×d+1
and X = A−1
AX = I
From Cholensky
RT
RX = I
If we define RX = B
RT
B = I
54 / 98
Images/cinvestav
We have
The matrices A = XT
WX ∈ Rd+1×d+1
and X = A−1
AX = I
From Cholensky
RT
RX = I
If we define RX = B
RT
B = I
54 / 98
Images/cinvestav
Now
If B = RT
−1
= L−1
for L = RT
1 We note that the inverse of the lower triangular matrix L is lower
triangular.
2 The diagonal entries of L−1 are the reciprocal of diagonal entries of L





a1,1 0 · · · 0
a2,1 a2,2 0
...
...
...
... 0
ad+1,1 ad+1,2 · · · ad+1,d+1










b1,1 0 · · · 0
b2,1 b2,2 0
...
...
...
... 0
bd+1,1 bd+1,2 · · · bd+1,d+1





=





1 0 · · · 0
0 1 0
...
... 0
... 0
0 · · · 0 1





55 / 98
Images/cinvestav
Now
Now, we construct the following matrix S
si,j =



1
li,i
if i = j
0 otherwise
56 / 98
Images/cinvestav
Now, we have
The matrix S is the correct solution to upper diagonal element of the
matrix B
i.e. sij = bij for i ≤ j ≤ d + 1
Then, we use backward substitution to solve xi,j at equation Rxi = si
Assuming:
X = [x1, x2, ..., xd+1]
S = [s1, s2, ..., sd+1]
57 / 98
Images/cinvestav
Now, we have
The matrix S is the correct solution to upper diagonal element of the
matrix B
i.e. sij = bij for i ≤ j ≤ d + 1
Then, we use backward substitution to solve xi,j at equation Rxi = si
Assuming:
X = [x1, x2, ..., xd+1]
S = [s1, s2, ..., sd+1]
57 / 98
Images/cinvestav
Back Substitution
Back substitution
Since R is upper-triangular, we can rewrite the system Rxi = si as
r1,1x1,i + r1,2x2,i + ... + r1,d−1xd−1,i + r1,dxd,i + r1,d+1xd+1,i = s1,i
r2,2x2 + ... + r2,d−1xd−1,i + r2,dxd,i + r2,d+1xd+1,i = s2,i
...
rd−1,d−1xd−1,i + rd−1,dxd,i + rd−1,d+1xd+1,i = sd−1,i
rd,dxd,i + rd,d+1xd,i = sd,i
rd+1,d+1xd+1,i = sd+1,i
58 / 98
Images/cinvestav
Then
We solve only for xij such that
We have i < j ≤ N (Upper triangle elements).
xji = xij
In our case the same value given that we live on the reals.
59 / 98
Images/cinvestav
Then
We solve only for xij such that
We have i < j ≤ N (Upper triangle elements).
xji = xij
In our case the same value given that we live on the reals.
59 / 98
Images/cinvestav
Complexity
Equation solving requires
1
3(d + 1)3 multiply operations.
The total number of multiply operations for matrix inverse
Including Cholesky decomposition is 1
2 (d + 1)3
Therefore
We have complexity O d3 !!! Per iteration!!!
60 / 98
Images/cinvestav
Complexity
Equation solving requires
1
3(d + 1)3 multiply operations.
The total number of multiply operations for matrix inverse
Including Cholesky decomposition is 1
2 (d + 1)3
Therefore
We have complexity O d3 !!! Per iteration!!!
60 / 98
Images/cinvestav
Complexity
Equation solving requires
1
3(d + 1)3 multiply operations.
The total number of multiply operations for matrix inverse
Including Cholesky decomposition is 1
2 (d + 1)3
Therefore
We have complexity O d3 !!! Per iteration!!!
60 / 98
Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
61 / 98
Images/cinvestav
However
As Good Computer Scientists
We want to obtain a better complexity as O n2 !!!
We can obtain such improvements
Using Quasi Newton Methods
Let’s us to develop the solution
For the most popular one
Broyden-Fletcher-Goldfarb-Shanno (BFGS) method
62 / 98
Images/cinvestav
However
As Good Computer Scientists
We want to obtain a better complexity as O n2 !!!
We can obtain such improvements
Using Quasi Newton Methods
Let’s us to develop the solution
For the most popular one
Broyden-Fletcher-Goldfarb-Shanno (BFGS) method
62 / 98
Images/cinvestav
However
As Good Computer Scientists
We want to obtain a better complexity as O n2 !!!
We can obtain such improvements
Using Quasi Newton Methods
Let’s us to develop the solution
For the most popular one
Broyden-Fletcher-Goldfarb-Shanno (BFGS) method
62 / 98
Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
63 / 98
Images/cinvestav
The Second Order Approximation
We have
f (x) ≈ f (xk) + f (xk) (x − xk) +
1
2
(x − xk)T
Hf (xk) (x − xk)
We develop a new equation based in the previous idea by using
x = xk + p
m (xk, p) = f (xk) + f (xk) p +
1
2
pT
Bkp
Here
Bk is an d + 1 × d + 1 symmetric positive definite matrix that will be
updated through the entire process
64 / 98
Images/cinvestav
The Second Order Approximation
We have
f (x) ≈ f (xk) + f (xk) (x − xk) +
1
2
(x − xk)T
Hf (xk) (x − xk)
We develop a new equation based in the previous idea by using
x = xk + p
m (xk, p) = f (xk) + f (xk) p +
1
2
pT
Bkp
Here
Bk is an d + 1 × d + 1 symmetric positive definite matrix that will be
updated through the entire process
64 / 98
Images/cinvestav
The Second Order Approximation
We have
f (x) ≈ f (xk) + f (xk) (x − xk) +
1
2
(x − xk)T
Hf (xk) (x − xk)
We develop a new equation based in the previous idea by using
x = xk + p
m (xk, p) = f (xk) + f (xk) p +
1
2
pT
Bkp
Here
Bk is an d + 1 × d + 1 symmetric positive definite matrix that will be
updated through the entire process
64 / 98
Images/cinvestav
Therefore
We have
m (xk, 0) = f (xk)
∂m (xk, 0)
∂p
= f (xk) + Bk0 = f (xk)
Therefore, the minimizer of this function can be used as a search
direction
pk = −B−1
k f (xk)
We have the following iteration
xk+1 = xk + αkpk
Where pk is a descent direction.
65 / 98
Images/cinvestav
Therefore
We have
m (xk, 0) = f (xk)
∂m (xk, 0)
∂p
= f (xk) + Bk0 = f (xk)
Therefore, the minimizer of this function can be used as a search
direction
pk = −B−1
k f (xk)
We have the following iteration
xk+1 = xk + αkpk
Where pk is a descent direction.
65 / 98
Images/cinvestav
Therefore
We have
m (xk, 0) = f (xk)
∂m (xk, 0)
∂p
= f (xk) + Bk0 = f (xk)
Therefore, the minimizer of this function can be used as a search
direction
pk = −B−1
k f (xk)
We have the following iteration
xk+1 = xk + αkpk
Where pk is a descent direction.
65 / 98
Images/cinvestav
Descent Direction
Definition
For a given point x ∈ Rd+1 a direction p is called a descent direction, if
there exists α > 0, α ∈ R such that
f (x + αp) < f (xk)
Lemma
Consider a point x ∈ Rd+1. Any direction p satisfying d, f (x) < 0 is a
descent direction
From this
−B−1
k f (xk)
T
f (xk) = − f (xk)T
B−1
k
T
f (xk) < 0
66 / 98
Images/cinvestav
Descent Direction
Definition
For a given point x ∈ Rd+1 a direction p is called a descent direction, if
there exists α > 0, α ∈ R such that
f (x + αp) < f (xk)
Lemma
Consider a point x ∈ Rd+1. Any direction p satisfying d, f (x) < 0 is a
descent direction
From this
−B−1
k f (xk)
T
f (xk) = − f (xk)T
B−1
k
T
f (xk) < 0
66 / 98
Images/cinvestav
Descent Direction
Definition
For a given point x ∈ Rd+1 a direction p is called a descent direction, if
there exists α > 0, α ∈ R such that
f (x + αp) < f (xk)
Lemma
Consider a point x ∈ Rd+1. Any direction p satisfying d, f (x) < 0 is a
descent direction
From this
−B−1
k f (xk)
T
f (xk) = − f (xk)T
B−1
k
T
f (xk) < 0
66 / 98
Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
67 / 98
Images/cinvestav
Where
αk is chosen to satisfy the Wolfe condition
f (xk + αkpk) ≤ f (xk) + c1αk f (xk)T
pk
f (xk + αkpk)T
pk ≥ c2 f (xk)T
pk
with 0 < c1 < c2 < 1
68 / 98
Images/cinvestav
Graphically
f (xk + αkpk) ≤ f (xk) + c1αk f (xk)T
pk (Armijo Condition)
acceptable acceptable
Note In practice, c1 is is chosen to be quite small, c1 = 10−4.
69 / 98
Images/cinvestav
However, Using That Condition alone gets you too small
α s to progress fast
Therefore, f (xk + αkpk)T
pk ≥ c2 f (xk)T
pk
This avoid small steps of αk
Note the following
The left hand side is the derivative dφ(αk)
dαk
i.e. the condition ensures that
the slope of φ (αk) is greater than dφ(0)
dαk
70 / 98
Images/cinvestav
However, Using That Condition alone gets you too small
α s to progress fast
Therefore, f (xk + αkpk)T
pk ≥ c2 f (xk)T
pk
This avoid small steps of αk
Note the following
The left hand side is the derivative dφ(αk)
dαk
i.e. the condition ensures that
the slope of φ (αk) is greater than dφ(0)
dαk
70 / 98
Images/cinvestav
Meaning
if the slope φ (αk) is strongly negative
We can reduce f significantly by moving further along the chosen
direction.
if the slope is only slightly negative or even positive
We cannot expect much more decrease in f in this direction.
It might be good to terminate the linear search.
71 / 98
Images/cinvestav
Meaning
if the slope φ (αk) is strongly negative
We can reduce f significantly by moving further along the chosen
direction.
if the slope is only slightly negative or even positive
We cannot expect much more decrease in f in this direction.
It might be good to terminate the linear search.
71 / 98
Images/cinvestav
Graphically
We have
acceptable acceptable
Minimal Tangent Slope
72 / 98
Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
73 / 98
Images/cinvestav
Now, Instead of computing Bk at each iteration
Instead
Davidon proposed to update it in a simple manner to account for the
curvature measured during the most recent step.
Imagine we generated xk+1 and wish to generate a new quadratic
model
m (xk+1, p) = f (xk+1) + f (xk+1) p +
1
2
pT
Bk+1p
What requirements should we impose on Bk+1?
One reasonable requirement is that the gradient of m should match the
gradient of the objective function f at the latest two iterates xk and xk+1.
74 / 98
Images/cinvestav
Now, Instead of computing Bk at each iteration
Instead
Davidon proposed to update it in a simple manner to account for the
curvature measured during the most recent step.
Imagine we generated xk+1 and wish to generate a new quadratic
model
m (xk+1, p) = f (xk+1) + f (xk+1) p +
1
2
pT
Bk+1p
What requirements should we impose on Bk+1?
One reasonable requirement is that the gradient of m should match the
gradient of the objective function f at the latest two iterates xk and xk+1.
74 / 98
Images/cinvestav
Now, Instead of computing Bk at each iteration
Instead
Davidon proposed to update it in a simple manner to account for the
curvature measured during the most recent step.
Imagine we generated xk+1 and wish to generate a new quadratic
model
m (xk+1, p) = f (xk+1) + f (xk+1) p +
1
2
pT
Bk+1p
What requirements should we impose on Bk+1?
One reasonable requirement is that the gradient of m should match the
gradient of the objective function f at the latest two iterates xk and xk+1.
74 / 98
Images/cinvestav
Now
Since m (xk+1, 0) is precisely f (xk+1)
The second of these conditions is satisfied automatically.
The first condition can be written mathematically
m (xk+1, −αkpk) = f (xk+1) − αkBk+1p = f (xk)
We have
αkBk+1p = f (xk+1) − f (xk)
75 / 98
Images/cinvestav
Now
Since m (xk+1, 0) is precisely f (xk+1)
The second of these conditions is satisfied automatically.
The first condition can be written mathematically
m (xk+1, −αkpk) = f (xk+1) − αkBk+1p = f (xk)
We have
αkBk+1p = f (xk+1) − f (xk)
75 / 98
Images/cinvestav
Now
Since m (xk+1, 0) is precisely f (xk+1)
The second of these conditions is satisfied automatically.
The first condition can be written mathematically
m (xk+1, −αkpk) = f (xk+1) − αkBk+1p = f (xk)
We have
αkBk+1p = f (xk+1) − f (xk)
75 / 98
Images/cinvestav
Then, we can simplify the notation
For this
sk = xk+1 − xk
yk = f (xk+1) − f (xk)
We have the secant equation
Bk+1sk = yk
Given that Bk+1 is symmetric positive, the displacement sk and the
change in gradient yk
sT
k Bk+1sk = sT
k yk > 0
Making possible to Bk+1 to map sk into yk!!!
76 / 98
Images/cinvestav
Then, we can simplify the notation
For this
sk = xk+1 − xk
yk = f (xk+1) − f (xk)
We have the secant equation
Bk+1sk = yk
Given that Bk+1 is symmetric positive, the displacement sk and the
change in gradient yk
sT
k Bk+1sk = sT
k yk > 0
Making possible to Bk+1 to map sk into yk!!!
76 / 98
Images/cinvestav
Then, we can simplify the notation
For this
sk = xk+1 − xk
yk = f (xk+1) − f (xk)
We have the secant equation
Bk+1sk = yk
Given that Bk+1 is symmetric positive, the displacement sk and the
change in gradient yk
sT
k Bk+1sk = sT
k yk > 0
Making possible to Bk+1 to map sk into yk!!!
76 / 98
Images/cinvestav
Interpretation
If our update satisfies the secant condition
Bk+1 (xk+1 − xk) = f (xk+1) − f (xk)
This implies that for the second-order approximation at xk+1 of our
target function f
fapprox (xk+1) = f (xk) + f (xk)T
(xk+1 − xk) +
1
2
(xk+1 − xk)T
Bk+1 (xk+1 − xk)
77 / 98
Images/cinvestav
Interpretation
If our update satisfies the secant condition
Bk+1 (xk+1 − xk) = f (xk+1) − f (xk)
This implies that for the second-order approximation at xk+1 of our
target function f
fapprox (xk+1) = f (xk) + f (xk)T
(xk+1 − xk) +
1
2
(xk+1 − xk)T
Bk+1 (xk+1 − xk)
77 / 98
Images/cinvestav
Then
Getting the gradient over xk+1 − xk
fapprox (xk+1) = f (xk) + Bk+1 (xk+1 − xk)
= f (xk) + f (xk+1) − f (xk)
= f (xk+1)
78 / 98
Images/cinvestav
Then
Getting the gradient over xk+1 − xk
fapprox (xk+1) = f (xk) + Bk+1 (xk+1 − xk)
= f (xk) + f (xk+1) − f (xk)
= f (xk+1)
78 / 98
Images/cinvestav
Then
Getting the gradient over xk+1 − xk
fapprox (xk+1) = f (xk) + Bk+1 (xk+1 − xk)
= f (xk) + f (xk+1) − f (xk)
= f (xk+1)
78 / 98
Images/cinvestav
Now, We have two cases
When f is strongly convex
The inequality will be satisfied for any two points xk and xk+1
This condition will not always hold for non-convex functions
We need to enforce sT
k yk > 0 explicitly
The previous condition hold on the
The Wolfe or Strong Wolfe condition.
79 / 98
Images/cinvestav
Now, We have two cases
When f is strongly convex
The inequality will be satisfied for any two points xk and xk+1
This condition will not always hold for non-convex functions
We need to enforce sT
k yk > 0 explicitly
The previous condition hold on the
The Wolfe or Strong Wolfe condition.
79 / 98
Images/cinvestav
Now, We have two cases
When f is strongly convex
The inequality will be satisfied for any two points xk and xk+1
This condition will not always hold for non-convex functions
We need to enforce sT
k yk > 0 explicitly
The previous condition hold on the
The Wolfe or Strong Wolfe condition.
79 / 98
Images/cinvestav
How?
We have the second part, f (xk + αkpk)T
pk ≥ c2 f (xk)T
pk
By remembering xk+1 = xk + αkpk.
f (xk+1)T
sk ≥ c2 f (xk)T
sk ⇒ yT
k sk = [ f (xk+1) − f (xk)]T
sk
Then
yT
k sk = f (xk+1)T
sk − f (xk)T
sk
≥ c2 f (xk)T
[αkpk] − f (xk)T
[αkpk]
= (c2 − 1) αk f (xk)T
pk
80 / 98
Images/cinvestav
How?
We have the second part, f (xk + αkpk)T
pk ≥ c2 f (xk)T
pk
By remembering xk+1 = xk + αkpk.
f (xk+1)T
sk ≥ c2 f (xk)T
sk ⇒ yT
k sk = [ f (xk+1) − f (xk)]T
sk
Then
yT
k sk = f (xk+1)T
sk − f (xk)T
sk
≥ c2 f (xk)T
[αkpk] − f (xk)T
[αkpk]
= (c2 − 1) αk f (xk)T
pk
80 / 98
Images/cinvestav
How?
We have the second part, f (xk + αkpk)T
pk ≥ c2 f (xk)T
pk
By remembering xk+1 = xk + αkpk.
f (xk+1)T
sk ≥ c2 f (xk)T
sk ⇒ yT
k sk = [ f (xk+1) − f (xk)]T
sk
Then
yT
k sk = f (xk+1)T
sk − f (xk)T
sk
≥ c2 f (xk)T
[αkpk] − f (xk)T
[αkpk]
= (c2 − 1) αk f (xk)T
pk
80 / 98
Images/cinvestav
How?
We have the second part, f (xk + αkpk)T
pk ≥ c2 f (xk)T
pk
By remembering xk+1 = xk + αkpk.
f (xk+1)T
sk ≥ c2 f (xk)T
sk ⇒ yT
k sk = [ f (xk+1) − f (xk)]T
sk
Then
yT
k sk = f (xk+1)T
sk − f (xk)T
sk
≥ c2 f (xk)T
[αkpk] − f (xk)T
[αkpk]
= (c2 − 1) αk f (xk)T
pk
80 / 98
Images/cinvestav
How?
We have the second part, f (xk + αkpk)T
pk ≥ c2 f (xk)T
pk
By remembering xk+1 = xk + αkpk.
f (xk+1)T
sk ≥ c2 f (xk)T
sk ⇒ yT
k sk = [ f (xk+1) − f (xk)]T
sk
Then
yT
k sk = f (xk+1)T
sk − f (xk)T
sk
≥ c2 f (xk)T
[αkpk] − f (xk)T
[αkpk]
= (c2 − 1) αk f (xk)T
pk
80 / 98
Images/cinvestav
Then
Given that c2 − 1 < 0 and αk > 0
(c2 − 1) αk f (xk)T
pk = (c2 − 1) αk f (xk)T
−B−1
k f (xk) > 0
Thus, the curvature condition holds!!!
Then, the secant equation always has a solution under the curvature
conditions.
In fact, it admits an infinite number if solution,
Given
Since there are d(d+1)
2 degrees in a symmetric matrix.
But under constrains you have 2d constraint imposed by the secant
equation and positive definitives.
Leaving d(d+1)
2 − 2d = d2+d−4d
d = d2−3d
d degrees of freedom... enough
for infinitely many solutions.
81 / 98
Images/cinvestav
Then
Given that c2 − 1 < 0 and αk > 0
(c2 − 1) αk f (xk)T
pk = (c2 − 1) αk f (xk)T
−B−1
k f (xk) > 0
Thus, the curvature condition holds!!!
Then, the secant equation always has a solution under the curvature
conditions.
In fact, it admits an infinite number if solution,
Given
Since there are d(d+1)
2 degrees in a symmetric matrix.
But under constrains you have 2d constraint imposed by the secant
equation and positive definitives.
Leaving d(d+1)
2 − 2d = d2+d−4d
d = d2−3d
d degrees of freedom... enough
for infinitely many solutions.
81 / 98
Images/cinvestav
Then
Given that c2 − 1 < 0 and αk > 0
(c2 − 1) αk f (xk)T
pk = (c2 − 1) αk f (xk)T
−B−1
k f (xk) > 0
Thus, the curvature condition holds!!!
Then, the secant equation always has a solution under the curvature
conditions.
In fact, it admits an infinite number if solution,
Given
Since there are d(d+1)
2 degrees in a symmetric matrix.
But under constrains you have 2d constraint imposed by the secant
equation and positive definitives.
Leaving d(d+1)
2 − 2d = d2+d−4d
d = d2−3d
d degrees of freedom... enough
for infinitely many solutions.
81 / 98
Images/cinvestav
Given that we want a unique solution
We could solve the following problem
min
B
B − Bk
s.t. B = BT
Bsk = yk
Many matrix norms give rise to different Quasi-Newton Methos
We use the weighted Frobenius norm (Scale invariant), defined as
A W = W
1
2 AW
1
2
F
Where
A F =
d+1
i=2
d+1
j=1
a2
ij
82 / 98
Images/cinvestav
Given that we want a unique solution
We could solve the following problem
min
B
B − Bk
s.t. B = BT
Bsk = yk
Many matrix norms give rise to different Quasi-Newton Methos
We use the weighted Frobenius norm (Scale invariant), defined as
A W = W
1
2 AW
1
2
F
Where
A F =
d+1
i=2
d+1
j=1
a2
ij
82 / 98
Images/cinvestav
Given that we want a unique solution
We could solve the following problem
min
B
B − Bk
s.t. B = BT
Bsk = yk
Many matrix norms give rise to different Quasi-Newton Methos
We use the weighted Frobenius norm (Scale invariant), defined as
A W = W
1
2 AW
1
2
F
Where
A F =
d+1
i=2
d+1
j=1
a2
ij
82 / 98
Images/cinvestav
In our case, we choose
We have the average Hessian
Gk =
1
0
2
f (xk + ταkpk) dτ
The property yk = Gkαkpk = Gksk
It follows from Taylor’s Theorem!!!
83 / 98
Images/cinvestav
In our case, we choose
We have the average Hessian
Gk =
1
0
2
f (xk + ταkpk) dτ
The property yk = Gkαkpk = Gksk
It follows from Taylor’s Theorem!!!
83 / 98
Images/cinvestav
For the BFGS
In the previous setup we will get Bk, then the inverse update of it
Hk = B−1
k
In BFGS we go directly for the inverse by setting up:
min
H
H − Hk
s.t. H = HT
Hyk = sk
A unique solution will be
Hk+1 = I − ρkskyT
k Hk I − ρkyksT
k + ρsksT
k (1)
where ρk = 1
yT
k
sk
84 / 98
Images/cinvestav
For the BFGS
In the previous setup we will get Bk, then the inverse update of it
Hk = B−1
k
In BFGS we go directly for the inverse by setting up:
min
H
H − Hk
s.t. H = HT
Hyk = sk
A unique solution will be
Hk+1 = I − ρkskyT
k Hk I − ρkyksT
k + ρsksT
k (1)
where ρk = 1
yT
k
sk
84 / 98
Images/cinvestav
Problem
There is no magic formula to find an initial H0
We can use specific information about the problem:
For instance by setting it to the inverse of an approximate Hessian
calculated by finite differences at x0
In our case, we have ∂L(w)
∂w∂wT , or in matrix format XT WX, we could
get initial setup
85 / 98
Images/cinvestav
Problem
There is no magic formula to find an initial H0
We can use specific information about the problem:
For instance by setting it to the inverse of an approximate Hessian
calculated by finite differences at x0
In our case, we have ∂L(w)
∂w∂wT , or in matrix format XT WX, we could
get initial setup
85 / 98
Images/cinvestav
Problem
There is no magic formula to find an initial H0
We can use specific information about the problem:
For instance by setting it to the inverse of an approximate Hessian
calculated by finite differences at x0
In our case, we have ∂L(w)
∂w∂wT , or in matrix format XT WX, we could
get initial setup
85 / 98
Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
86 / 98
Images/cinvestav
Algorithm (BFGS Method)
Quasi-Newton Algorithm
Starting point x0, Convergence tolerance e, Inverse Hessian approximation H0
1 k ← 0
2 while f (xk+1) > e
3 Compute search direction pk = −Hk f (xk+1)
4 Set xk+1 = xk + αkpkwhere αk is obtained from a
linear search (Under Wolfe conditions).
5 Define sk = xk+1 − xk and yk = f (xk+1) − f (xk)
6 Compute Hk+1 by means of (Eq. 1)
7 k ← k + 1
87 / 98
Images/cinvestav
Algorithm (BFGS Method)
Quasi-Newton Algorithm
Starting point x0, Convergence tolerance e, Inverse Hessian approximation H0
1 k ← 0
2 while f (xk+1) > e
3 Compute search direction pk = −Hk f (xk+1)
4 Set xk+1 = xk + αkpkwhere αk is obtained from a
linear search (Under Wolfe conditions).
5 Define sk = xk+1 − xk and yk = f (xk+1) − f (xk)
6 Compute Hk+1 by means of (Eq. 1)
7 k ← k + 1
87 / 98
Images/cinvestav
Algorithm (BFGS Method)
Quasi-Newton Algorithm
Starting point x0, Convergence tolerance e, Inverse Hessian approximation H0
1 k ← 0
2 while f (xk+1) > e
3 Compute search direction pk = −Hk f (xk+1)
4 Set xk+1 = xk + αkpkwhere αk is obtained from a
linear search (Under Wolfe conditions).
5 Define sk = xk+1 − xk and yk = f (xk+1) − f (xk)
6 Compute Hk+1 by means of (Eq. 1)
7 k ← k + 1
87 / 98
Images/cinvestav
Algorithm (BFGS Method)
Quasi-Newton Algorithm
Starting point x0, Convergence tolerance e, Inverse Hessian approximation H0
1 k ← 0
2 while f (xk+1) > e
3 Compute search direction pk = −Hk f (xk+1)
4 Set xk+1 = xk + αkpkwhere αk is obtained from a
linear search (Under Wolfe conditions).
5 Define sk = xk+1 − xk and yk = f (xk+1) − f (xk)
6 Compute Hk+1 by means of (Eq. 1)
7 k ← k + 1
87 / 98
Images/cinvestav
Algorithm (BFGS Method)
Quasi-Newton Algorithm
Starting point x0, Convergence tolerance e, Inverse Hessian approximation H0
1 k ← 0
2 while f (xk+1) > e
3 Compute search direction pk = −Hk f (xk+1)
4 Set xk+1 = xk + αkpkwhere αk is obtained from a
linear search (Under Wolfe conditions).
5 Define sk = xk+1 − xk and yk = f (xk+1) − f (xk)
6 Compute Hk+1 by means of (Eq. 1)
7 k ← k + 1
87 / 98
Images/cinvestav
Algorithm (BFGS Method)
Quasi-Newton Algorithm
Starting point x0, Convergence tolerance e, Inverse Hessian approximation H0
1 k ← 0
2 while f (xk+1) > e
3 Compute search direction pk = −Hk f (xk+1)
4 Set xk+1 = xk + αkpkwhere αk is obtained from a
linear search (Under Wolfe conditions).
5 Define sk = xk+1 − xk and yk = f (xk+1) − f (xk)
6 Compute Hk+1 by means of (Eq. 1)
7 k ← k + 1
87 / 98
Images/cinvestav
Algorithm (BFGS Method)
Quasi-Newton Algorithm
Starting point x0, Convergence tolerance e, Inverse Hessian approximation H0
1 k ← 0
2 while f (xk+1) > e
3 Compute search direction pk = −Hk f (xk+1)
4 Set xk+1 = xk + αkpkwhere αk is obtained from a
linear search (Under Wolfe conditions).
5 Define sk = xk+1 − xk and yk = f (xk+1) − f (xk)
6 Compute Hk+1 by means of (Eq. 1)
7 k ← k + 1
87 / 98
Images/cinvestav
Algorithm (BFGS Method)
Quasi-Newton Algorithm
Starting point x0, Convergence tolerance e, Inverse Hessian approximation H0
1 k ← 0
2 while f (xk+1) > e
3 Compute search direction pk = −Hk f (xk+1)
4 Set xk+1 = xk + αkpkwhere αk is obtained from a
linear search (Under Wolfe conditions).
5 Define sk = xk+1 − xk and yk = f (xk+1) − f (xk)
6 Compute Hk+1 by means of (Eq. 1)
7 k ← k + 1
87 / 98
Images/cinvestav
Complexity
The
Cost of update or inverse update is O d2 operations per iteration.
For More
Nocedal, Jorge & Wright, Stephen J. (1999). Numerical
Optimization. Springer-Verlag.
88 / 98
Images/cinvestav
Complexity
The
Cost of update or inverse update is O d2 operations per iteration.
For More
Nocedal, Jorge & Wright, Stephen J. (1999). Numerical
Optimization. Springer-Verlag.
88 / 98
Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
89 / 98
Images/cinvestav
Given the following
Because the likelihood is convex,
90 / 98
Images/cinvestav
Caution
Here, we change the labeling to yi = ±1 with
p (yi = ±1|x, w) = σ ywT
x =
1
1 + exp {−ywT x}
Thus, we have the following log likelihood under regularization λ > 0
L (w) = −
N
i=1
log 1 + exp −yiwT
xi −
λ
2
wT
w
It is possible to get a Gradient Descent
wl (w) =
N
i=1
1 −
1
1 + exp {−yiwT xi}
yixi − λw
91 / 98
Images/cinvestav
Caution
Here, we change the labeling to yi = ±1 with
p (yi = ±1|x, w) = σ ywT
x =
1
1 + exp {−ywT x}
Thus, we have the following log likelihood under regularization λ > 0
L (w) = −
N
i=1
log 1 + exp −yiwT
xi −
λ
2
wT
w
It is possible to get a Gradient Descent
wl (w) =
N
i=1
1 −
1
1 + exp {−yiwT xi}
yixi − λw
91 / 98
Images/cinvestav
Caution
Here, we change the labeling to yi = ±1 with
p (yi = ±1|x, w) = σ ywT
x =
1
1 + exp {−ywT x}
Thus, we have the following log likelihood under regularization λ > 0
L (w) = −
N
i=1
log 1 + exp −yiwT
xi −
λ
2
wT
w
It is possible to get a Gradient Descent
wl (w) =
N
i=1
1 −
1
1 + exp {−yiwT xi}
yixi − λw
91 / 98
Images/cinvestav
Danger Will Robinson!!!
Gradient descent using resembles the Perceptron learning algorithm
Problem!!! It will always converge for a suitable step size, regardless of
whether the classes are separable!!!
92 / 98
Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
93 / 98
Images/cinvestav
Here
We will simplify our work
By stating the algorithm for coordinate ascent
Then
A more precise version will be given
94 / 98
Images/cinvestav
Here
We will simplify our work
By stating the algorithm for coordinate ascent
Then
A more precise version will be given
94 / 98
Images/cinvestav
Coordinate Ascent
Algorithm
Input Max, an initial w0
1 counter ← 0
2 while counter < Max
3 for i ← 1, ..., d
4 Randomly pick i
5 Compute a step size δ∗
by approximately
maximize arg minδ f (x + δei)
6 xi ← xi + δ∗
Where
ei = 0 · · · 0 1 ← i 0 · · · 0
T
95 / 98
Images/cinvestav
Coordinate Ascent
Algorithm
Input Max, an initial w0
1 counter ← 0
2 while counter < Max
3 for i ← 1, ..., d
4 Randomly pick i
5 Compute a step size δ∗
by approximately
maximize arg minδ f (x + δei)
6 xi ← xi + δ∗
Where
ei = 0 · · · 0 1 ← i 0 · · · 0
T
95 / 98
Images/cinvestav
In the case of Logistic Regression
Thus, we can optimize each wk alternatively by a coordinate-wise
Newton update
wnew
k = wold
k +
−λwold
k + N
i=1 1 − 1
1+exp{−yiwT xi}
yixik
λ + N
i=1 x2
ik
1
1+exp{−yiwT xi}
1 − 1
1+exp{−yiwT xi}
Complexity of this update
Item Complexity
N
i=1 1 − 1
1+exp{−yiwT xi}
yixik O (N)
N
i=1 x2
ik
1
1+exp{−yiwT xi}
1 − 1
1+exp{−yiwT xi}
O (N)
Total Complexity O (N)
For all the dimensions O (Nd)
96 / 98
Images/cinvestav
In the case of Logistic Regression
Thus, we can optimize each wk alternatively by a coordinate-wise
Newton update
wnew
k = wold
k +
−λwold
k + N
i=1 1 − 1
1+exp{−yiwT xi}
yixik
λ + N
i=1 x2
ik
1
1+exp{−yiwT xi}
1 − 1
1+exp{−yiwT xi}
Complexity of this update
Item Complexity
N
i=1 1 − 1
1+exp{−yiwT xi}
yixik O (N)
N
i=1 x2
ik
1
1+exp{−yiwT xi}
1 − 1
1+exp{−yiwT xi}
O (N)
Total Complexity O (N)
For all the dimensions O (Nd)
96 / 98
Images/cinvestav
Outline
1 Logistic Regression
Introduction
Constraints
The Initial Model
The Two Case Class
Graphic Interpretation
Fitting The Model
The Two Class Case
The Final Log-Likelihood
The Newton-Raphson Algorithm
Matrix Notation
2 More on Optimization Methods
Using Cholesky Decomposition
Cholesky Decomposition
The Proposed Method
Quasi-Newton Method
The Second Order Approximation
The Wolfe Condition
Proving BFGS
The BFGS Algorithm
A Neat Trick: Coordinate Ascent
Coordinate Ascent Algorithm
Conclusion
97 / 98
Images/cinvestav
We have the following Complexities per iteration
Complexities
Method Per Iteration Convergence Rate
Cholesky Decomposition d3
2 = O d3 Quadratic
Quasi-Newton BFGS O d2 Superlinearly
Coordinate Ascent O (Nd) Not established
98 / 98

Weitere ähnliche Inhalte

Was ist angesagt?

23 Machine Learning Feature Generation
23 Machine Learning Feature Generation23 Machine Learning Feature Generation
23 Machine Learning Feature GenerationAndres Mendez-Vazquez
 
28 Machine Learning Unsupervised Hierarchical Clustering
28 Machine Learning Unsupervised Hierarchical Clustering28 Machine Learning Unsupervised Hierarchical Clustering
28 Machine Learning Unsupervised Hierarchical ClusteringAndres Mendez-Vazquez
 
31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster ValidityAndres Mendez-Vazquez
 
Artificial Intelligence 06.3 Bayesian Networks - Belief Propagation - Junctio...
Artificial Intelligence 06.3 Bayesian Networks - Belief Propagation - Junctio...Artificial Intelligence 06.3 Bayesian Networks - Belief Propagation - Junctio...
Artificial Intelligence 06.3 Bayesian Networks - Belief Propagation - Junctio...Andres Mendez-Vazquez
 
Markov chain monte_carlo_methods_for_machine_learning
Markov chain monte_carlo_methods_for_machine_learningMarkov chain monte_carlo_methods_for_machine_learning
Markov chain monte_carlo_methods_for_machine_learningAndres Mendez-Vazquez
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi DivergenceMasahiro Suzuki
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector MachinesEdgar Marca
 
27 Machine Learning Unsupervised Measure Properties
27 Machine Learning Unsupervised Measure Properties27 Machine Learning Unsupervised Measure Properties
27 Machine Learning Unsupervised Measure PropertiesAndres Mendez-Vazquez
 
16 Machine Learning Universal Approximation Multilayer Perceptron
16 Machine Learning Universal Approximation Multilayer Perceptron16 Machine Learning Universal Approximation Multilayer Perceptron
16 Machine Learning Universal Approximation Multilayer PerceptronAndres Mendez-Vazquez
 
Artificial Intelligence 06.2 More on Causality Bayesian Networks
Artificial Intelligence 06.2 More on  Causality Bayesian NetworksArtificial Intelligence 06.2 More on  Causality Bayesian Networks
Artificial Intelligence 06.2 More on Causality Bayesian NetworksAndres Mendez-Vazquez
 
Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsChristian Robert
 
(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural NetworkMasahiro Suzuki
 
(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel LearningMasahiro Suzuki
 
Important Cuts
Important CutsImportant Cuts
Important CutsASPAK2014
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackarogozhnikov
 
Machine learning in science and industry — day 3
Machine learning in science and industry — day 3Machine learning in science and industry — day 3
Machine learning in science and industry — day 3arogozhnikov
 
Treewidth and Applications
Treewidth and ApplicationsTreewidth and Applications
Treewidth and ApplicationsASPAK2014
 

Was ist angesagt? (20)

23 Machine Learning Feature Generation
23 Machine Learning Feature Generation23 Machine Learning Feature Generation
23 Machine Learning Feature Generation
 
28 Machine Learning Unsupervised Hierarchical Clustering
28 Machine Learning Unsupervised Hierarchical Clustering28 Machine Learning Unsupervised Hierarchical Clustering
28 Machine Learning Unsupervised Hierarchical Clustering
 
31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity
 
Artificial Intelligence 06.3 Bayesian Networks - Belief Propagation - Junctio...
Artificial Intelligence 06.3 Bayesian Networks - Belief Propagation - Junctio...Artificial Intelligence 06.3 Bayesian Networks - Belief Propagation - Junctio...
Artificial Intelligence 06.3 Bayesian Networks - Belief Propagation - Junctio...
 
Markov chain monte_carlo_methods_for_machine_learning
Markov chain monte_carlo_methods_for_machine_learningMarkov chain monte_carlo_methods_for_machine_learning
Markov chain monte_carlo_methods_for_machine_learning
 
06 Machine Learning - Naive Bayes
06 Machine Learning - Naive Bayes06 Machine Learning - Naive Bayes
06 Machine Learning - Naive Bayes
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector Machines
 
27 Machine Learning Unsupervised Measure Properties
27 Machine Learning Unsupervised Measure Properties27 Machine Learning Unsupervised Measure Properties
27 Machine Learning Unsupervised Measure Properties
 
16 Machine Learning Universal Approximation Multilayer Perceptron
16 Machine Learning Universal Approximation Multilayer Perceptron16 Machine Learning Universal Approximation Multilayer Perceptron
16 Machine Learning Universal Approximation Multilayer Perceptron
 
Artificial Intelligence 06.2 More on Causality Bayesian Networks
Artificial Intelligence 06.2 More on  Causality Bayesian NetworksArtificial Intelligence 06.2 More on  Causality Bayesian Networks
Artificial Intelligence 06.2 More on Causality Bayesian Networks
 
Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithms
 
(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network
 
(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning
 
Important Cuts
Important CutsImportant Cuts
Important Cuts
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic track
 
Machine learning in science and industry — day 3
Machine learning in science and industry — day 3Machine learning in science and industry — day 3
Machine learning in science and industry — day 3
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Treewidth and Applications
Treewidth and ApplicationsTreewidth and Applications
Treewidth and Applications
 
Richard Everitt's slides
Richard Everitt's slidesRichard Everitt's slides
Richard Everitt's slides
 

Ähnlich wie Introduction to logistic regression

Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking componentsChristian Robert
 
A nonlinear approximation of the Bayesian Update formula
A nonlinear approximation of the Bayesian Update formulaA nonlinear approximation of the Bayesian Update formula
A nonlinear approximation of the Bayesian Update formulaAlexander Litvinenko
 
Cheatsheet supervised-learning
Cheatsheet supervised-learningCheatsheet supervised-learning
Cheatsheet supervised-learningSteve Nouri
 
Mit2 092 f09_lec23
Mit2 092 f09_lec23Mit2 092 f09_lec23
Mit2 092 f09_lec23Rahman Hakim
 
MCMC and likelihood-free methods
MCMC and likelihood-free methodsMCMC and likelihood-free methods
MCMC and likelihood-free methodsChristian Robert
 
A new implementation of k-MLE for mixture modelling of Wishart distributions
A new implementation of k-MLE for mixture modelling of Wishart distributionsA new implementation of k-MLE for mixture modelling of Wishart distributions
A new implementation of k-MLE for mixture modelling of Wishart distributionsFrank Nielsen
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackarogozhnikov
 
12 Machine Learning Supervised Hidden Markov Chains
12 Machine Learning  Supervised Hidden Markov Chains12 Machine Learning  Supervised Hidden Markov Chains
12 Machine Learning Supervised Hidden Markov ChainsAndres Mendez-Vazquez
 
Distributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsDistributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsPantelis Sopasakis
 
System overflow blocking-transients-for-queues-with-batch-arrivals-using-a-fa...
System overflow blocking-transients-for-queues-with-batch-arrivals-using-a-fa...System overflow blocking-transients-for-queues-with-batch-arrivals-using-a-fa...
System overflow blocking-transients-for-queues-with-batch-arrivals-using-a-fa...Cemal Ardil
 
Bayesian inference on mixtures
Bayesian inference on mixturesBayesian inference on mixtures
Bayesian inference on mixturesChristian Robert
 
Loss Calibrated Variational Inference
Loss Calibrated Variational InferenceLoss Calibrated Variational Inference
Loss Calibrated Variational InferenceTomasz Kusmierczyk
 
Variational Bayes: A Gentle Introduction
Variational Bayes: A Gentle IntroductionVariational Bayes: A Gentle Introduction
Variational Bayes: A Gentle IntroductionFlavio Morelli
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classificationSung Yub Kim
 
Probability cheatsheet
Probability cheatsheetProbability cheatsheet
Probability cheatsheetSuvrat Mishra
 

Ähnlich wie Introduction to logistic regression (20)

Recursive Compressed Sensing
Recursive Compressed SensingRecursive Compressed Sensing
Recursive Compressed Sensing
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
A nonlinear approximation of the Bayesian Update formula
A nonlinear approximation of the Bayesian Update formulaA nonlinear approximation of the Bayesian Update formula
A nonlinear approximation of the Bayesian Update formula
 
Cheatsheet supervised-learning
Cheatsheet supervised-learningCheatsheet supervised-learning
Cheatsheet supervised-learning
 
Mit2 092 f09_lec23
Mit2 092 f09_lec23Mit2 092 f09_lec23
Mit2 092 f09_lec23
 
MCMC and likelihood-free methods
MCMC and likelihood-free methodsMCMC and likelihood-free methods
MCMC and likelihood-free methods
 
A new implementation of k-MLE for mixture modelling of Wishart distributions
A new implementation of k-MLE for mixture modelling of Wishart distributionsA new implementation of k-MLE for mixture modelling of Wishart distributions
A new implementation of k-MLE for mixture modelling of Wishart distributions
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic track
 
12 Machine Learning Supervised Hidden Markov Chains
12 Machine Learning  Supervised Hidden Markov Chains12 Machine Learning  Supervised Hidden Markov Chains
12 Machine Learning Supervised Hidden Markov Chains
 
Distributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsDistributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUs
 
System overflow blocking-transients-for-queues-with-batch-arrivals-using-a-fa...
System overflow blocking-transients-for-queues-with-batch-arrivals-using-a-fa...System overflow blocking-transients-for-queues-with-batch-arrivals-using-a-fa...
System overflow blocking-transients-for-queues-with-batch-arrivals-using-a-fa...
 
Bayesian inference on mixtures
Bayesian inference on mixturesBayesian inference on mixtures
Bayesian inference on mixtures
 
The Gaussian Hardy-Littlewood Maximal Function
The Gaussian Hardy-Littlewood Maximal FunctionThe Gaussian Hardy-Littlewood Maximal Function
The Gaussian Hardy-Littlewood Maximal Function
 
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
 
Particle filter
Particle filterParticle filter
Particle filter
 
Pairing scott
Pairing scottPairing scott
Pairing scott
 
Loss Calibrated Variational Inference
Loss Calibrated Variational InferenceLoss Calibrated Variational Inference
Loss Calibrated Variational Inference
 
Variational Bayes: A Gentle Introduction
Variational Bayes: A Gentle IntroductionVariational Bayes: A Gentle Introduction
Variational Bayes: A Gentle Introduction
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
 
Probability cheatsheet
Probability cheatsheetProbability cheatsheet
Probability cheatsheet
 

Mehr von Andres Mendez-Vazquez

01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectorsAndres Mendez-Vazquez
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issuesAndres Mendez-Vazquez
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiationAndres Mendez-Vazquez
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep LearningAndres Mendez-Vazquez
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learningAndres Mendez-Vazquez
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusAndres Mendez-Vazquez
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusAndres Mendez-Vazquez
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesAndres Mendez-Vazquez
 
Introduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusIntroduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusAndres Mendez-Vazquez
 
Introduction Machine Learning Syllabus
Introduction Machine Learning SyllabusIntroduction Machine Learning Syllabus
Introduction Machine Learning SyllabusAndres Mendez-Vazquez
 

Mehr von Andres Mendez-Vazquez (20)

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
 
05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
 
01.01 vector spaces
01.01 vector spaces01.01 vector spaces
01.01 vector spaces
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
 
Zetta global
Zetta globalZetta global
Zetta global
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
 
Introduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusIntroduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems Syllabus
 
Introduction Machine Learning Syllabus
Introduction Machine Learning SyllabusIntroduction Machine Learning Syllabus
Introduction Machine Learning Syllabus
 

Kürzlich hochgeladen

Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfrs7054576148
 

Kürzlich hochgeladen (20)

Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdf
 

Introduction to logistic regression

  • 1. Introduction to Machine Learning Logistic Regression and More on Optimization Andres Mendez-Vazquez September 3, 2017 1 / 98
  • 2. Images/cinvestav Outline 1 Logistic Regression Introduction Constraints The Initial Model The Two Case Class Graphic Interpretation Fitting The Model The Two Class Case The Final Log-Likelihood The Newton-Raphson Algorithm Matrix Notation 2 More on Optimization Methods Using Cholesky Decomposition Cholesky Decomposition The Proposed Method Quasi-Newton Method The Second Order Approximation The Wolfe Condition Proving BFGS The BFGS Algorithm A Neat Trick: Coordinate Ascent Coordinate Ascent Algorithm Conclusion 2 / 98
  • 3. Images/cinvestav Outline 1 Logistic Regression Introduction Constraints The Initial Model The Two Case Class Graphic Interpretation Fitting The Model The Two Class Case The Final Log-Likelihood The Newton-Raphson Algorithm Matrix Notation 2 More on Optimization Methods Using Cholesky Decomposition Cholesky Decomposition The Proposed Method Quasi-Newton Method The Second Order Approximation The Wolfe Condition Proving BFGS The BFGS Algorithm A Neat Trick: Coordinate Ascent Coordinate Ascent Algorithm Conclusion 3 / 98
  • 4. Images/cinvestav Assume the following Let Y1, Y2, ..., YN independent random variables Taking values in the set {0, 1} Now, you have a set of fixed vectors x1, x2, ..., xN Mapped to a series of numbers by a weight vector w wT x1, wT x2, ..., wT xN 4 / 98
  • 5. Images/cinvestav Assume the following Let Y1, Y2, ..., YN independent random variables Taking values in the set {0, 1} Now, you have a set of fixed vectors x1, x2, ..., xN Mapped to a series of numbers by a weight vector w wT x1, wT x2, ..., wT xN 4 / 98
  • 6. Images/cinvestav Assume the following Let Y1, Y2, ..., YN independent random variables Taking values in the set {0, 1} Now, you have a set of fixed vectors x1, x2, ..., xN Mapped to a series of numbers by a weight vector w wT x1, wT x2, ..., wT xN 4 / 98
  • 7. Images/cinvestav In our simplest from There is a suspected relation Between θi = P (Yi = 1) and wT xi Thus we have y = 1 wT x + e > 0 0 else Note: Where e is an error with a certain distribution!!! 5 / 98
  • 8. Images/cinvestav In our simplest from There is a suspected relation Between θi = P (Yi = 1) and wT xi Thus we have y = 1 wT x + e > 0 0 else Note: Where e is an error with a certain distribution!!! 5 / 98
  • 10. Images/cinvestav It is better to user a logit version We have 7 / 98
  • 11. Images/cinvestav Logit Distribution PDF with support z ∈ (−∞, ∞) ,µ location and s scale p (x|µ, s) = exp −z−µ s s 1 + exp −z−µ s 2 With a CDF P (Y < z) = z −∞ p (y|µ, s) dy = 1 1 + exp −z−µ s 8 / 98
  • 12. Images/cinvestav Logit Distribution PDF with support z ∈ (−∞, ∞) ,µ location and s scale p (x|µ, s) = exp −z−µ s s 1 + exp −z−µ s 2 With a CDF P (Y < z) = z −∞ p (y|µ, s) dy = 1 1 + exp −z−µ s 8 / 98
  • 13. Images/cinvestav Outline 1 Logistic Regression Introduction Constraints The Initial Model The Two Case Class Graphic Interpretation Fitting The Model The Two Class Case The Final Log-Likelihood The Newton-Raphson Algorithm Matrix Notation 2 More on Optimization Methods Using Cholesky Decomposition Cholesky Decomposition The Proposed Method Quasi-Newton Method The Second Order Approximation The Wolfe Condition Proving BFGS The BFGS Algorithm A Neat Trick: Coordinate Ascent Coordinate Ascent Algorithm Conclusion 9 / 98
  • 14. Images/cinvestav In Bayesian Classification Assignment of a pattern It is performed by using the posterior probabilities, P (ωi|x) And given K classes, we want K i=1 P (ωi|x) = 1 Such that each 0 ≤ P (ωi|x) ≤ 1 10 / 98
  • 15. Images/cinvestav In Bayesian Classification Assignment of a pattern It is performed by using the posterior probabilities, P (ωi|x) And given K classes, we want K i=1 P (ωi|x) = 1 Such that each 0 ≤ P (ωi|x) ≤ 1 10 / 98
  • 16. Images/cinvestav In Bayesian Classification Assignment of a pattern It is performed by using the posterior probabilities, P (ωi|x) And given K classes, we want K i=1 P (ωi|x) = 1 Such that each 0 ≤ P (ωi|x) ≤ 1 10 / 98
  • 17. Images/cinvestav Observation This is a typical example of the discriminative approach Where the distribution of data is of no interest. 11 / 98
  • 18. Images/cinvestav Outline 1 Logistic Regression Introduction Constraints The Initial Model The Two Case Class Graphic Interpretation Fitting The Model The Two Class Case The Final Log-Likelihood The Newton-Raphson Algorithm Matrix Notation 2 More on Optimization Methods Using Cholesky Decomposition Cholesky Decomposition The Proposed Method Quasi-Newton Method The Second Order Approximation The Wolfe Condition Proving BFGS The BFGS Algorithm A Neat Trick: Coordinate Ascent Coordinate Ascent Algorithm Conclusion 12 / 98
  • 19. Images/cinvestav The Model We have the following under the extended features log P (ω1|x) P (ωK|x) = wT 1 x log P (ω2|x) P (ωK|x) = wT 2 x ... ... log P (ωK−1|x) P (ωK|x) = wT K−1x 13 / 98
  • 20. Images/cinvestav The Model We have the following under the extended features log P (ω1|x) P (ωK|x) = wT 1 x log P (ω2|x) P (ωK|x) = wT 2 x ... ... log P (ωK−1|x) P (ωK|x) = wT K−1x 13 / 98
  • 21. Images/cinvestav The Model We have the following under the extended features log P (ω1|x) P (ωK|x) = wT 1 x log P (ω2|x) P (ωK|x) = wT 2 x ... ... log P (ωK−1|x) P (ωK|x) = wT K−1x 13 / 98
  • 22. Images/cinvestav Further We have The model is specified in terms of K − 1 log-odds or logit transformations. And Although the model uses the last class as the denominator in the odds-ratios. The choice of denominator is arbitrary Because the estimates are equivariant under this choice. 14 / 98
  • 23. Images/cinvestav Further We have The model is specified in terms of K − 1 log-odds or logit transformations. And Although the model uses the last class as the denominator in the odds-ratios. The choice of denominator is arbitrary Because the estimates are equivariant under this choice. 14 / 98
  • 24. Images/cinvestav Further We have The model is specified in terms of K − 1 log-odds or logit transformations. And Although the model uses the last class as the denominator in the odds-ratios. The choice of denominator is arbitrary Because the estimates are equivariant under this choice. 14 / 98
  • 25. Images/cinvestav It is possible to show that We have that, for l = 1, 2, ..., K − 1 P (ωl|x) P (ωK|x) = exp wT l x Therefore K−1 l=1 P (ωl|x) P (ωK|x) = K−1 l=1 exp wT l x Thus 1 − P (ωK|x) P (ωK|x) = K−1 l=1 exp wT l x 15 / 98
  • 26. Images/cinvestav It is possible to show that We have that, for l = 1, 2, ..., K − 1 P (ωl|x) P (ωK|x) = exp wT l x Therefore K−1 l=1 P (ωl|x) P (ωK|x) = K−1 l=1 exp wT l x Thus 1 − P (ωK|x) P (ωK|x) = K−1 l=1 exp wT l x 15 / 98
  • 27. Images/cinvestav It is possible to show that We have that, for l = 1, 2, ..., K − 1 P (ωl|x) P (ωK|x) = exp wT l x Therefore K−1 l=1 P (ωl|x) P (ωK|x) = K−1 l=1 exp wT l x Thus 1 − P (ωK|x) P (ωK|x) = K−1 l=1 exp wT l x 15 / 98
  • 28. Images/cinvestav Basically We have P (ωK|x) = 1 1 + K−1 l=1 exp wT l x Then P (ωi|x) 1 1+ K−1 l=1 exp{wT l x} = exp wT l x For i = 1, 2, ..., k − 1 P (ωi|x) = exp wT i x 1 + K−1 l=1 exp wT l x 16 / 98
  • 29. Images/cinvestav Basically We have P (ωK|x) = 1 1 + K−1 l=1 exp wT l x Then P (ωi|x) 1 1+ K−1 l=1 exp{wT l x} = exp wT l x For i = 1, 2, ..., k − 1 P (ωi|x) = exp wT i x 1 + K−1 l=1 exp wT l x 16 / 98
  • 30. Images/cinvestav Basically We have P (ωK|x) = 1 1 + K−1 l=1 exp wT l x Then P (ωi|x) 1 1+ K−1 l=1 exp{wT l x} = exp wT l x For i = 1, 2, ..., k − 1 P (ωi|x) = exp wT i x 1 + K−1 l=1 exp wT l x 16 / 98
  • 31. Images/cinvestav Additionally For K P (ωK|x) = 1 1 + K−1 l=1 exp wT l x Easy to see They sum to one. 17 / 98
  • 32. Images/cinvestav Additionally For K P (ωK|x) = 1 1 + K−1 l=1 exp wT l x Easy to see They sum to one. 17 / 98
  • 33. Images/cinvestav A Note in Notation Given all these parameters, we summarized them Θ = {w1, w2, ..., wK−1} Therefore P (ωl|X = x) = pl (x|Θ) 18 / 98
  • 34. Images/cinvestav A Note in Notation Given all these parameters, we summarized them Θ = {w1, w2, ..., wK−1} Therefore P (ωl|X = x) = pl (x|Θ) 18 / 98
  • 35. Images/cinvestav Outline 1 Logistic Regression Introduction Constraints The Initial Model The Two Case Class Graphic Interpretation Fitting The Model The Two Class Case The Final Log-Likelihood The Newton-Raphson Algorithm Matrix Notation 2 More on Optimization Methods Using Cholesky Decomposition Cholesky Decomposition The Proposed Method Quasi-Newton Method The Second Order Approximation The Wolfe Condition Proving BFGS The BFGS Algorithm A Neat Trick: Coordinate Ascent Coordinate Ascent Algorithm Conclusion 19 / 98
  • 36. Images/cinvestav In the two class case We have P1 (ω1|x) = exp wT x 1 + exp {wT x} P2 (ω2|x) = 1 1 + exp {wT x} A similar model P1 (ω1|x) = exp −wT x 1 + exp {−wT x} P2 (ω2|x) = 1 1 + exp {−wT x} 20 / 98
  • 37. Images/cinvestav In the two class case We have P1 (ω1|x) = exp wT x 1 + exp {wT x} P2 (ω2|x) = 1 1 + exp {wT x} A similar model P1 (ω1|x) = exp −wT x 1 + exp {−wT x} P2 (ω2|x) = 1 1 + exp {−wT x} 20 / 98
  • 38. Images/cinvestav Outline 1 Logistic Regression Introduction Constraints The Initial Model The Two Case Class Graphic Interpretation Fitting The Model The Two Class Case The Final Log-Likelihood The Newton-Raphson Algorithm Matrix Notation 2 More on Optimization Methods Using Cholesky Decomposition Cholesky Decomposition The Proposed Method Quasi-Newton Method The Second Order Approximation The Wolfe Condition Proving BFGS The BFGS Algorithm A Neat Trick: Coordinate Ascent Coordinate Ascent Algorithm Conclusion 21 / 98
  • 39. Images/cinvestav We have the following split Using f (x) = wT x 22 / 98
  • 40. Images/cinvestav Then, we have the mapping to We have Class 1 Class 2 23 / 98
  • 41. Images/cinvestav Outline 1 Logistic Regression Introduction Constraints The Initial Model The Two Case Class Graphic Interpretation Fitting The Model The Two Class Case The Final Log-Likelihood The Newton-Raphson Algorithm Matrix Notation 2 More on Optimization Methods Using Cholesky Decomposition Cholesky Decomposition The Proposed Method Quasi-Newton Method The Second Order Approximation The Wolfe Condition Proving BFGS The BFGS Algorithm A Neat Trick: Coordinate Ascent Coordinate Ascent Algorithm Conclusion 24 / 98
  • 42. Images/cinvestav A Classic application of Maximum Likelihood Given a sequence of smaples iid x1, x2, ..., xN We have the following pdf p (x1, x2, ..., xN |Θ) = N i=1 p (xi|Θ) 25 / 98
  • 43. Images/cinvestav A Classic application of Maximum Likelihood Given a sequence of smaples iid x1, x2, ..., xN We have the following pdf p (x1, x2, ..., xN |Θ) = N i=1 p (xi|Θ) 25 / 98
  • 44. Images/cinvestav P (G|X) Distribution P (G|X) completeley specify the conditional distribution We have a multinomial distribution which under the log-likelihood of N observations: L (Θ) = log p (x1, x2, ..., xN |Θ) = log N i=1 pgi (xi|θ) = N i=1 log pgi (xi|θ) Where gi represent the class that xi belongs. gi = 1 if xi ∈ Class 1 2 if xi ∈ Class 2 26 / 98
  • 45. Images/cinvestav P (G|X) Distribution P (G|X) completeley specify the conditional distribution We have a multinomial distribution which under the log-likelihood of N observations: L (Θ) = log p (x1, x2, ..., xN |Θ) = log N i=1 pgi (xi|θ) = N i=1 log pgi (xi|θ) Where gi represent the class that xi belongs. gi = 1 if xi ∈ Class 1 2 if xi ∈ Class 2 26 / 98
  • 46. Images/cinvestav P (G|X) Distribution P (G|X) completeley specify the conditional distribution We have a multinomial distribution which under the log-likelihood of N observations: L (Θ) = log p (x1, x2, ..., xN |Θ) = log N i=1 pgi (xi|θ) = N i=1 log pgi (xi|θ) Where gi represent the class that xi belongs. gi = 1 if xi ∈ Class 1 2 if xi ∈ Class 2 26 / 98
  • 47. Images/cinvestav P (G|X) Distribution P (G|X) completeley specify the conditional distribution We have a multinomial distribution which under the log-likelihood of N observations: L (Θ) = log p (x1, x2, ..., xN |Θ) = log N i=1 pgi (xi|θ) = N i=1 log pgi (xi|θ) Where gi represent the class that xi belongs. gi = 1 if xi ∈ Class 1 2 if xi ∈ Class 2 26 / 98
  • 48. Images/cinvestav P (G|X) Distribution P (G|X) completeley specify the conditional distribution We have a multinomial distribution which under the log-likelihood of N observations: L (Θ) = log p (x1, x2, ..., xN |Θ) = log N i=1 pgi (xi|θ) = N i=1 log pgi (xi|θ) Where gi represent the class that xi belongs. gi = 1 if xi ∈ Class 1 2 if xi ∈ Class 2 26 / 98
  • 49. Images/cinvestav How do we integrate this into a Cost Function? Clearly, we have two distributions We need to represent the distributions into the functions pgi (xi|θ). Why not to have all the distributions into this function pgi (xi|θ) = K−1 l=1 p (xi|wl)I{xi∈ωl} It is easy with the two classes Given that we have a binary situation!!! 27 / 98
  • 50. Images/cinvestav How do we integrate this into a Cost Function? Clearly, we have two distributions We need to represent the distributions into the functions pgi (xi|θ). Why not to have all the distributions into this function pgi (xi|θ) = K−1 l=1 p (xi|wl)I{xi∈ωl} It is easy with the two classes Given that we have a binary situation!!! 27 / 98
  • 51. Images/cinvestav How do we integrate this into a Cost Function? Clearly, we have two distributions We need to represent the distributions into the functions pgi (xi|θ). Why not to have all the distributions into this function pgi (xi|θ) = K−1 l=1 p (xi|wl)I{xi∈ωl} It is easy with the two classes Given that we have a binary situation!!! 27 / 98
  • 52. Images/cinvestav Outline 1 Logistic Regression Introduction Constraints The Initial Model The Two Case Class Graphic Interpretation Fitting The Model The Two Class Case The Final Log-Likelihood The Newton-Raphson Algorithm Matrix Notation 2 More on Optimization Methods Using Cholesky Decomposition Cholesky Decomposition The Proposed Method Quasi-Newton Method The Second Order Approximation The Wolfe Condition Proving BFGS The BFGS Algorithm A Neat Trick: Coordinate Ascent Coordinate Ascent Algorithm Conclusion 28 / 98
  • 53. Images/cinvestav Given the two case We have then a Bernoulli distribution p1 (xi|w) =   exp wT x 1 + exp {wT x}   yi p2 (xi|w) = 1 1 + exp {wT x} 1−yi With yi = 1 if xi ∈ Class 1 yi = 0 if xi ∈ Class 2 29 / 98
  • 54. Images/cinvestav Given the two case We have then a Bernoulli distribution p1 (xi|w) =   exp wT x 1 + exp {wT x}   yi p2 (xi|w) = 1 1 + exp {wT x} 1−yi With yi = 1 if xi ∈ Class 1 yi = 0 if xi ∈ Class 2 29 / 98
  • 55. Images/cinvestav Outline 1 Logistic Regression Introduction Constraints The Initial Model The Two Case Class Graphic Interpretation Fitting The Model The Two Class Case The Final Log-Likelihood The Newton-Raphson Algorithm Matrix Notation 2 More on Optimization Methods Using Cholesky Decomposition Cholesky Decomposition The Proposed Method Quasi-Newton Method The Second Order Approximation The Wolfe Condition Proving BFGS The BFGS Algorithm A Neat Trick: Coordinate Ascent Coordinate Ascent Algorithm Conclusion 30 / 98
  • 56. Images/cinvestav We have the following Cost Function L (w) = N i=1 {yi log p1 (xi|w) + (1 − yi) log (1 − p2 (xi|w))} After some reductions L (w) = N i=1 yiwT xi − log 1 + exp wT xi 31 / 98
  • 57. Images/cinvestav We have the following Cost Function L (w) = N i=1 {yi log p1 (xi|w) + (1 − yi) log (1 − p2 (xi|w))} After some reductions L (w) = N i=1 yiwT xi − log 1 + exp wT xi 31 / 98
  • 58. Images/cinvestav Now, we derive and set it to zero We have ∂L (w) ∂w = N i=1 xi yi − exp wT xi 1 + exp {wT xi} = 0 Which are d + 1 equations nonlinear N i=1 xi (yi − p (xi|w)) =                N i=1 1 × yi − exp{wT xi} 1+exp{wT xi} N i=1 xi 1 yi − exp{wT xi} 1+exp{wT xi} N i=1 xi 2 yi − exp{wT xi} 1+exp{wT xi} ... N i=1 xi d yi − exp{wT xi} 1+exp{wT xi}                = 0 It is know as a scoring function. 32 / 98
  • 59. Images/cinvestav Now, we derive and set it to zero We have ∂L (w) ∂w = N i=1 xi yi − exp wT xi 1 + exp {wT xi} = 0 Which are d + 1 equations nonlinear N i=1 xi (yi − p (xi|w)) =                N i=1 1 × yi − exp{wT xi} 1+exp{wT xi} N i=1 xi 1 yi − exp{wT xi} 1+exp{wT xi} N i=1 xi 2 yi − exp{wT xi} 1+exp{wT xi} ... N i=1 xi d yi − exp{wT xi} 1+exp{wT xi}                = 0 It is know as a scoring function. 32 / 98
  • 60. Images/cinvestav Finally In other words you 1 d + 1 nonlinear equations in w. 2 For example, from the first equation: N i=1 yi = N i=1 p (xi|w) The expected number of class ones matches the observed number. And hence also class twos. 33 / 98
  • 61. Images/cinvestav Finally In other words you 1 d + 1 nonlinear equations in w. 2 For example, from the first equation: N i=1 yi = N i=1 p (xi|w) The expected number of class ones matches the observed number. And hence also class twos. 33 / 98
  • 62. Images/cinvestav Finally In other words you 1 d + 1 nonlinear equations in w. 2 For example, from the first equation: N i=1 yi = N i=1 p (xi|w) The expected number of class ones matches the observed number. And hence also class twos. 33 / 98
  • 63. Images/cinvestav Finally In other words you 1 d + 1 nonlinear equations in w. 2 For example, from the first equation: N i=1 yi = N i=1 p (xi|w) The expected number of class ones matches the observed number. And hence also class twos. 33 / 98
  • 64. Images/cinvestav Outline 1 Logistic Regression Introduction Constraints The Initial Model The Two Case Class Graphic Interpretation Fitting The Model The Two Class Case The Final Log-Likelihood The Newton-Raphson Algorithm Matrix Notation 2 More on Optimization Methods Using Cholesky Decomposition Cholesky Decomposition The Proposed Method Quasi-Newton Method The Second Order Approximation The Wolfe Condition Proving BFGS The BFGS Algorithm A Neat Trick: Coordinate Ascent Coordinate Ascent Algorithm Conclusion 34 / 98
  • 65. Images/cinvestav To solve the previous equations We use the Newton-Raphson Method It comes from the first Taylor Approximation f (x + h) ≈ f (x) + hf (x) Thus we have for a root r of function f We have 1 Assume a good estimate of r, x0 2 Thus we have r = x0 + h 3 Or h = r − x0 35 / 98
  • 66. Images/cinvestav To solve the previous equations We use the Newton-Raphson Method It comes from the first Taylor Approximation f (x + h) ≈ f (x) + hf (x) Thus we have for a root r of function f We have 1 Assume a good estimate of r, x0 2 Thus we have r = x0 + h 3 Or h = r − x0 35 / 98
  • 67. Images/cinvestav To solve the previous equations We use the Newton-Raphson Method It comes from the first Taylor Approximation f (x + h) ≈ f (x) + hf (x) Thus we have for a root r of function f We have 1 Assume a good estimate of r, x0 2 Thus we have r = x0 + h 3 Or h = r − x0 35 / 98
  • 68. Images/cinvestav To solve the previous equations We use the Newton-Raphson Method It comes from the first Taylor Approximation f (x + h) ≈ f (x) + hf (x) Thus we have for a root r of function f We have 1 Assume a good estimate of r, x0 2 Thus we have r = x0 + h 3 Or h = r − x0 35 / 98
  • 69. Images/cinvestav To solve the previous equations We use the Newton-Raphson Method It comes from the first Taylor Approximation f (x + h) ≈ f (x) + hf (x) Thus we have for a root r of function f We have 1 Assume a good estimate of r, x0 2 Thus we have r = x0 + h 3 Or h = r − x0 35 / 98
  • 70. Images/cinvestav We have then From Taylor 0 = f (r) = f (x0 + h) ≈ f (x0) + hf (x0) Thus, as long f (x0) is not close to 0 h ≈ − f (x0) f (x0) Thus r = x0 + h ≈ x0 − f (x0) f (x0) 36 / 98
  • 71. Images/cinvestav We have then From Taylor 0 = f (r) = f (x0 + h) ≈ f (x0) + hf (x0) Thus, as long f (x0) is not close to 0 h ≈ − f (x0) f (x0) Thus r = x0 + h ≈ x0 − f (x0) f (x0) 36 / 98
  • 72. Images/cinvestav We have then From Taylor 0 = f (r) = f (x0 + h) ≈ f (x0) + hf (x0) Thus, as long f (x0) is not close to 0 h ≈ − f (x0) f (x0) Thus r = x0 + h ≈ x0 − f (x0) f (x0) 36 / 98
  • 73. Images/cinvestav We have our final improving We have x1 ≈ x0 − f (x0) f (x0) 37 / 98
  • 74. Images/cinvestav Then, on the scoring function For this, we need the Hessian of the function ∂2L (w) ∂w∂wT = − N i=1 xixT i   exp wT xi 1 + exp {wT xi}    1 − exp wT xi 1 + exp {wT xi}   Thus, we have at a starting point wold wnew = wold − ∂L (w) ∂w∂wT −1 ∂L (w) ∂w 38 / 98
  • 75. Images/cinvestav Then, on the scoring function For this, we need the Hessian of the function ∂2L (w) ∂w∂wT = − N i=1 xixT i   exp wT xi 1 + exp {wT xi}    1 − exp wT xi 1 + exp {wT xi}   Thus, we have at a starting point wold wnew = wold − ∂L (w) ∂w∂wT −1 ∂L (w) ∂w 38 / 98
  • 76. Images/cinvestav Outline 1 Logistic Regression Introduction Constraints The Initial Model The Two Case Class Graphic Interpretation Fitting The Model The Two Class Case The Final Log-Likelihood The Newton-Raphson Algorithm Matrix Notation 2 More on Optimization Methods Using Cholesky Decomposition Cholesky Decomposition The Proposed Method Quasi-Newton Method The Second Order Approximation The Wolfe Condition Proving BFGS The BFGS Algorithm A Neat Trick: Coordinate Ascent Coordinate Ascent Algorithm Conclusion 39 / 98
  • 77. Images/cinvestav We can rewrite all as matrix notations Assume 1 Let y denotes the vector of yi 2 X is the data matrix N × (d + 1) 3 p the vector of fitted probabilities with the ith element p xi|wold 4 W a N × N diagonal matrix of weights with the ith diagonal element p xi|wold 1 − p xi|wold 40 / 98
  • 78. Images/cinvestav We can rewrite all as matrix notations Assume 1 Let y denotes the vector of yi 2 X is the data matrix N × (d + 1) 3 p the vector of fitted probabilities with the ith element p xi|wold 4 W a N × N diagonal matrix of weights with the ith diagonal element p xi|wold 1 − p xi|wold 40 / 98
  • 79. Images/cinvestav We can rewrite all as matrix notations Assume 1 Let y denotes the vector of yi 2 X is the data matrix N × (d + 1) 3 p the vector of fitted probabilities with the ith element p xi|wold 4 W a N × N diagonal matrix of weights with the ith diagonal element p xi|wold 1 − p xi|wold 40 / 98
  • 80. Images/cinvestav We can rewrite all as matrix notations Assume 1 Let y denotes the vector of yi 2 X is the data matrix N × (d + 1) 3 p the vector of fitted probabilities with the ith element p xi|wold 4 W a N × N diagonal matrix of weights with the ith diagonal element p xi|wold 1 − p xi|wold 40 / 98
  • 81. Images/cinvestav Then, we have For each updating term ∂L (w) ∂w = XT (y − p) ∂L (w) ∂w∂wT = −XT WX 41 / 98
  • 82. Images/cinvestav Then, we have Then, the Newton Step wnew = wold + XT WX −1 XT (y − p) = Iwold + XT WX −1 XT I (y − p) = XT WX −1 XT WXwold + XT WX −1 XT WW−1 (y − p) = XT WX −1 XT W Xwold + W−1 (y − p) = XT WX −1 XT Wz 42 / 98
  • 83. Images/cinvestav Then, we have Then, the Newton Step wnew = wold + XT WX −1 XT (y − p) = Iwold + XT WX −1 XT I (y − p) = XT WX −1 XT WXwold + XT WX −1 XT WW−1 (y − p) = XT WX −1 XT W Xwold + W−1 (y − p) = XT WX −1 XT Wz 42 / 98
  • 84. Images/cinvestav Then, we have Then, the Newton Step wnew = wold + XT WX −1 XT (y − p) = Iwold + XT WX −1 XT I (y − p) = XT WX −1 XT WXwold + XT WX −1 XT WW−1 (y − p) = XT WX −1 XT W Xwold + W−1 (y − p) = XT WX −1 XT Wz 42 / 98
  • 85. Images/cinvestav Then, we have Then, the Newton Step wnew = wold + XT WX −1 XT (y − p) = Iwold + XT WX −1 XT I (y − p) = XT WX −1 XT WXwold + XT WX −1 XT WW−1 (y − p) = XT WX −1 XT W Xwold + W−1 (y − p) = XT WX −1 XT Wz 42 / 98
  • 86. Images/cinvestav Then, we have Then, the Newton Step wnew = wold + XT WX −1 XT (y − p) = Iwold + XT WX −1 XT I (y − p) = XT WX −1 XT WXwold + XT WX −1 XT WW−1 (y − p) = XT WX −1 XT W Xwold + W−1 (y − p) = XT WX −1 XT Wz 42 / 98
  • 87. Images/cinvestav Then We have Re-expressed the Newton step as a weighted least squares step. With a the adjusted response as z = Xwold + W−1 (y − p) 43 / 98
  • 88. Images/cinvestav Then We have Re-expressed the Newton step as a weighted least squares step. With a the adjusted response as z = Xwold + W−1 (y − p) 43 / 98
  • 89. Images/cinvestav This New Algorithm It is know as Iteratively Re-weighted Least Squares or IRLS After all at each iteration, it solves A weighted Least Square Problem wnew ← arg min w (z − Xw)T W (z − Xw) 44 / 98
  • 90. Images/cinvestav This New Algorithm It is know as Iteratively Re-weighted Least Squares or IRLS After all at each iteration, it solves A weighted Least Square Problem wnew ← arg min w (z − Xw)T W (z − Xw) 44 / 98
  • 91. Images/cinvestav Observations Good Starting Point w = 0 However, convergence is never guaranteed!!! However Typically the algorithm does converge, since the log-likelihood is concave. But overshooting can occur. 45 / 98
  • 92. Images/cinvestav Observations Good Starting Point w = 0 However, convergence is never guaranteed!!! However Typically the algorithm does converge, since the log-likelihood is concave. But overshooting can occur. 45 / 98
  • 93. Images/cinvestav Observations L (θj) = log n j=1 p (xj|θj) Halving Solve the Problem Perfect!!! 46 / 98
  • 94. Images/cinvestav Observations L (θj) = log n j=1 p (xj|θj) Halving Solve the Problem Perfect!!! 46 / 98
  • 95. Images/cinvestav Outline 1 Logistic Regression Introduction Constraints The Initial Model The Two Case Class Graphic Interpretation Fitting The Model The Two Class Case The Final Log-Likelihood The Newton-Raphson Algorithm Matrix Notation 2 More on Optimization Methods Using Cholesky Decomposition Cholesky Decomposition The Proposed Method Quasi-Newton Method The Second Order Approximation The Wolfe Condition Proving BFGS The BFGS Algorithm A Neat Trick: Coordinate Ascent Coordinate Ascent Algorithm Conclusion 47 / 98
  • 96. Images/cinvestav We can decompose the matrix Given A = XT WX and Y = XT Wz You have Ax = Y Thus you need to obtain x = A−1 Y 48 / 98
  • 97. Images/cinvestav We can decompose the matrix Given A = XT WX and Y = XT Wz You have Ax = Y Thus you need to obtain x = A−1 Y 48 / 98
  • 98. Images/cinvestav This can be seen as a system of linear equations As you can see We start with a set of linear equations with d + 1 unknowns: x1, x2, ..., xd+1    a11x1 + a12x2 + ... + a1d+1xd+1 = y1 a21x1 + a22x2 + ... + a2d+1xd+1 = y2 ... ... ad+11x1 + ad+12x2 + ... + ad+1d+1xn = yd+1 Thus A set of values for x1, x2, ..., xn that satisfy all of the equations simultaneously is said to be a solution to these equations. Note it is easier to solve the system Ax = Y How? 49 / 98
  • 99. Images/cinvestav This can be seen as a system of linear equations As you can see We start with a set of linear equations with d + 1 unknowns: x1, x2, ..., xd+1    a11x1 + a12x2 + ... + a1d+1xd+1 = y1 a21x1 + a22x2 + ... + a2d+1xd+1 = y2 ... ... ad+11x1 + ad+12x2 + ... + ad+1d+1xn = yd+1 Thus A set of values for x1, x2, ..., xn that satisfy all of the equations simultaneously is said to be a solution to these equations. Note it is easier to solve the system Ax = Y How? 49 / 98
  • 100. Images/cinvestav This can be seen as a system of linear equations As you can see We start with a set of linear equations with d + 1 unknowns: x1, x2, ..., xd+1    a11x1 + a12x2 + ... + a1d+1xd+1 = y1 a21x1 + a22x2 + ... + a2d+1xd+1 = y2 ... ... ad+11x1 + ad+12x2 + ... + ad+1d+1xn = yd+1 Thus A set of values for x1, x2, ..., xn that satisfy all of the equations simultaneously is said to be a solution to these equations. Note it is easier to solve the system Ax = Y How? 49 / 98
  • 101. Images/cinvestav Outline 1 Logistic Regression Introduction Constraints The Initial Model The Two Case Class Graphic Interpretation Fitting The Model The Two Class Case The Final Log-Likelihood The Newton-Raphson Algorithm Matrix Notation 2 More on Optimization Methods Using Cholesky Decomposition Cholesky Decomposition The Proposed Method Quasi-Newton Method The Second Order Approximation The Wolfe Condition Proving BFGS The BFGS Algorithm A Neat Trick: Coordinate Ascent Coordinate Ascent Algorithm Conclusion 50 / 98
  • 102. Images/cinvestav What is the Cholesky Decomposition? It is a method that factorize a matrix A ∈ Cd+1×d+1 is a positive definite Hermitian matrix Positive definite matrix xT Ax > 0 for all x ∈ Cd+1×d+1 Hermitian matrix A = AT 51 / 98
  • 103. Images/cinvestav What is the Cholesky Decomposition? It is a method that factorize a matrix A ∈ Cd+1×d+1 is a positive definite Hermitian matrix Positive definite matrix xT Ax > 0 for all x ∈ Cd+1×d+1 Hermitian matrix A = AT 51 / 98
  • 104. Images/cinvestav What is the Cholesky Decomposition? It is a method that factorize a matrix A ∈ Cd+1×d+1 is a positive definite Hermitian matrix Positive definite matrix xT Ax > 0 for all x ∈ Cd+1×d+1 Hermitian matrix A = AT 51 / 98
  • 105. Images/cinvestav Therefore Cholesky decomposes A into lower or upper triangular matrix and their conjugate transpose A = LLT A = RT R In our particular case Given that XT WX ∈ Rd+1×d+1 with XT WX = XT WX T , we have that XT WX = LLT Thus, we can use the Cholensky decomposition 1 Cholesky decomposition is of order O d3 and requires 1 6d3 FLOP operations. 52 / 98
  • 106. Images/cinvestav Therefore Cholesky decomposes A into lower or upper triangular matrix and their conjugate transpose A = LLT A = RT R In our particular case Given that XT WX ∈ Rd+1×d+1 with XT WX = XT WX T , we have that XT WX = LLT Thus, we can use the Cholensky decomposition 1 Cholesky decomposition is of order O d3 and requires 1 6d3 FLOP operations. 52 / 98
  • 107. Images/cinvestav Therefore Cholesky decomposes A into lower or upper triangular matrix and their conjugate transpose A = LLT A = RT R In our particular case Given that XT WX ∈ Rd+1×d+1 with XT WX = XT WX T , we have that XT WX = LLT Thus, we can use the Cholensky decomposition 1 Cholesky decomposition is of order O d3 and requires 1 6d3 FLOP operations. 52 / 98
  • 108. Images/cinvestav Outline 1 Logistic Regression Introduction Constraints The Initial Model The Two Case Class Graphic Interpretation Fitting The Model The Two Class Case The Final Log-Likelihood The Newton-Raphson Algorithm Matrix Notation 2 More on Optimization Methods Using Cholesky Decomposition Cholesky Decomposition The Proposed Method Quasi-Newton Method The Second Order Approximation The Wolfe Condition Proving BFGS The BFGS Algorithm A Neat Trick: Coordinate Ascent Coordinate Ascent Algorithm Conclusion 53 / 98
  • 109. Images/cinvestav We have The matrices A = XT WX ∈ Rd+1×d+1 and X = A−1 AX = I From Cholensky RT RX = I If we define RX = B RT B = I 54 / 98
  • 110. Images/cinvestav We have The matrices A = XT WX ∈ Rd+1×d+1 and X = A−1 AX = I From Cholensky RT RX = I If we define RX = B RT B = I 54 / 98
  • 111. Images/cinvestav We have The matrices A = XT WX ∈ Rd+1×d+1 and X = A−1 AX = I From Cholensky RT RX = I If we define RX = B RT B = I 54 / 98
  • 112. Images/cinvestav Now If B = RT −1 = L−1 for L = RT 1 We note that the inverse of the lower triangular matrix L is lower triangular. 2 The diagonal entries of L−1 are the reciprocal of diagonal entries of L      a1,1 0 · · · 0 a2,1 a2,2 0 ... ... ... ... 0 ad+1,1 ad+1,2 · · · ad+1,d+1           b1,1 0 · · · 0 b2,1 b2,2 0 ... ... ... ... 0 bd+1,1 bd+1,2 · · · bd+1,d+1      =      1 0 · · · 0 0 1 0 ... ... 0 ... 0 0 · · · 0 1      55 / 98
  • 113. Images/cinvestav Now Now, we construct the following matrix S si,j =    1 li,i if i = j 0 otherwise 56 / 98
  • 114. Images/cinvestav Now, we have The matrix S is the correct solution to upper diagonal element of the matrix B i.e. sij = bij for i ≤ j ≤ d + 1 Then, we use backward substitution to solve xi,j at equation Rxi = si Assuming: X = [x1, x2, ..., xd+1] S = [s1, s2, ..., sd+1] 57 / 98
  • 115. Images/cinvestav Now, we have The matrix S is the correct solution to upper diagonal element of the matrix B i.e. sij = bij for i ≤ j ≤ d + 1 Then, we use backward substitution to solve xi,j at equation Rxi = si Assuming: X = [x1, x2, ..., xd+1] S = [s1, s2, ..., sd+1] 57 / 98
  • 116. Images/cinvestav Back Substitution Back substitution Since R is upper-triangular, we can rewrite the system Rxi = si as r1,1x1,i + r1,2x2,i + ... + r1,d−1xd−1,i + r1,dxd,i + r1,d+1xd+1,i = s1,i r2,2x2 + ... + r2,d−1xd−1,i + r2,dxd,i + r2,d+1xd+1,i = s2,i ... rd−1,d−1xd−1,i + rd−1,dxd,i + rd−1,d+1xd+1,i = sd−1,i rd,dxd,i + rd,d+1xd,i = sd,i rd+1,d+1xd+1,i = sd+1,i 58 / 98
  • 117. Images/cinvestav Then We solve only for xij such that We have i < j ≤ N (Upper triangle elements). xji = xij In our case the same value given that we live on the reals. 59 / 98
  • 118. Images/cinvestav Then We solve only for xij such that We have i < j ≤ N (Upper triangle elements). xji = xij In our case the same value given that we live on the reals. 59 / 98
  • 119. Images/cinvestav Complexity Equation solving requires 1 3(d + 1)3 multiply operations. The total number of multiply operations for matrix inverse Including Cholesky decomposition is 1 2 (d + 1)3 Therefore We have complexity O d3 !!! Per iteration!!! 60 / 98
  • 120. Images/cinvestav Complexity Equation solving requires 1 3(d + 1)3 multiply operations. The total number of multiply operations for matrix inverse Including Cholesky decomposition is 1 2 (d + 1)3 Therefore We have complexity O d3 !!! Per iteration!!! 60 / 98
  • 121. Images/cinvestav Complexity Equation solving requires 1 3(d + 1)3 multiply operations. The total number of multiply operations for matrix inverse Including Cholesky decomposition is 1 2 (d + 1)3 Therefore We have complexity O d3 !!! Per iteration!!! 60 / 98
  • 122. Images/cinvestav Outline 1 Logistic Regression Introduction Constraints The Initial Model The Two Case Class Graphic Interpretation Fitting The Model The Two Class Case The Final Log-Likelihood The Newton-Raphson Algorithm Matrix Notation 2 More on Optimization Methods Using Cholesky Decomposition Cholesky Decomposition The Proposed Method Quasi-Newton Method The Second Order Approximation The Wolfe Condition Proving BFGS The BFGS Algorithm A Neat Trick: Coordinate Ascent Coordinate Ascent Algorithm Conclusion 61 / 98
  • 123. Images/cinvestav However As Good Computer Scientists We want to obtain a better complexity as O n2 !!! We can obtain such improvements Using Quasi Newton Methods Let’s us to develop the solution For the most popular one Broyden-Fletcher-Goldfarb-Shanno (BFGS) method 62 / 98
  • 124. Images/cinvestav However As Good Computer Scientists We want to obtain a better complexity as O n2 !!! We can obtain such improvements Using Quasi Newton Methods Let’s us to develop the solution For the most popular one Broyden-Fletcher-Goldfarb-Shanno (BFGS) method 62 / 98
  • 125. Images/cinvestav However As Good Computer Scientists We want to obtain a better complexity as O n2 !!! We can obtain such improvements Using Quasi Newton Methods Let’s us to develop the solution For the most popular one Broyden-Fletcher-Goldfarb-Shanno (BFGS) method 62 / 98
  • 126. Images/cinvestav Outline 1 Logistic Regression Introduction Constraints The Initial Model The Two Case Class Graphic Interpretation Fitting The Model The Two Class Case The Final Log-Likelihood The Newton-Raphson Algorithm Matrix Notation 2 More on Optimization Methods Using Cholesky Decomposition Cholesky Decomposition The Proposed Method Quasi-Newton Method The Second Order Approximation The Wolfe Condition Proving BFGS The BFGS Algorithm A Neat Trick: Coordinate Ascent Coordinate Ascent Algorithm Conclusion 63 / 98
  • 127. Images/cinvestav The Second Order Approximation We have f (x) ≈ f (xk) + f (xk) (x − xk) + 1 2 (x − xk)T Hf (xk) (x − xk) We develop a new equation based in the previous idea by using x = xk + p m (xk, p) = f (xk) + f (xk) p + 1 2 pT Bkp Here Bk is an d + 1 × d + 1 symmetric positive definite matrix that will be updated through the entire process 64 / 98
  • 128. Images/cinvestav The Second Order Approximation We have f (x) ≈ f (xk) + f (xk) (x − xk) + 1 2 (x − xk)T Hf (xk) (x − xk) We develop a new equation based in the previous idea by using x = xk + p m (xk, p) = f (xk) + f (xk) p + 1 2 pT Bkp Here Bk is an d + 1 × d + 1 symmetric positive definite matrix that will be updated through the entire process 64 / 98
  • 129. Images/cinvestav The Second Order Approximation We have f (x) ≈ f (xk) + f (xk) (x − xk) + 1 2 (x − xk)T Hf (xk) (x − xk) We develop a new equation based in the previous idea by using x = xk + p m (xk, p) = f (xk) + f (xk) p + 1 2 pT Bkp Here Bk is an d + 1 × d + 1 symmetric positive definite matrix that will be updated through the entire process 64 / 98
  • 130. Images/cinvestav Therefore We have m (xk, 0) = f (xk) ∂m (xk, 0) ∂p = f (xk) + Bk0 = f (xk) Therefore, the minimizer of this function can be used as a search direction pk = −B−1 k f (xk) We have the following iteration xk+1 = xk + αkpk Where pk is a descent direction. 65 / 98
  • 131. Images/cinvestav Therefore We have m (xk, 0) = f (xk) ∂m (xk, 0) ∂p = f (xk) + Bk0 = f (xk) Therefore, the minimizer of this function can be used as a search direction pk = −B−1 k f (xk) We have the following iteration xk+1 = xk + αkpk Where pk is a descent direction. 65 / 98
  • 132. Images/cinvestav Therefore We have m (xk, 0) = f (xk) ∂m (xk, 0) ∂p = f (xk) + Bk0 = f (xk) Therefore, the minimizer of this function can be used as a search direction pk = −B−1 k f (xk) We have the following iteration xk+1 = xk + αkpk Where pk is a descent direction. 65 / 98
  • 133. Images/cinvestav Descent Direction Definition For a given point x ∈ Rd+1 a direction p is called a descent direction, if there exists α > 0, α ∈ R such that f (x + αp) < f (xk) Lemma Consider a point x ∈ Rd+1. Any direction p satisfying d, f (x) < 0 is a descent direction From this −B−1 k f (xk) T f (xk) = − f (xk)T B−1 k T f (xk) < 0 66 / 98
  • 134. Images/cinvestav Descent Direction Definition For a given point x ∈ Rd+1 a direction p is called a descent direction, if there exists α > 0, α ∈ R such that f (x + αp) < f (xk) Lemma Consider a point x ∈ Rd+1. Any direction p satisfying d, f (x) < 0 is a descent direction From this −B−1 k f (xk) T f (xk) = − f (xk)T B−1 k T f (xk) < 0 66 / 98
  • 135. Images/cinvestav Descent Direction Definition For a given point x ∈ Rd+1 a direction p is called a descent direction, if there exists α > 0, α ∈ R such that f (x + αp) < f (xk) Lemma Consider a point x ∈ Rd+1. Any direction p satisfying d, f (x) < 0 is a descent direction From this −B−1 k f (xk) T f (xk) = − f (xk)T B−1 k T f (xk) < 0 66 / 98
  • 136. Images/cinvestav Outline 1 Logistic Regression Introduction Constraints The Initial Model The Two Case Class Graphic Interpretation Fitting The Model The Two Class Case The Final Log-Likelihood The Newton-Raphson Algorithm Matrix Notation 2 More on Optimization Methods Using Cholesky Decomposition Cholesky Decomposition The Proposed Method Quasi-Newton Method The Second Order Approximation The Wolfe Condition Proving BFGS The BFGS Algorithm A Neat Trick: Coordinate Ascent Coordinate Ascent Algorithm Conclusion 67 / 98
  • 137. Images/cinvestav Where αk is chosen to satisfy the Wolfe condition f (xk + αkpk) ≤ f (xk) + c1αk f (xk)T pk f (xk + αkpk)T pk ≥ c2 f (xk)T pk with 0 < c1 < c2 < 1 68 / 98
  • 138. Images/cinvestav Graphically f (xk + αkpk) ≤ f (xk) + c1αk f (xk)T pk (Armijo Condition) acceptable acceptable Note In practice, c1 is is chosen to be quite small, c1 = 10−4. 69 / 98
  • 139. Images/cinvestav However, Using That Condition alone gets you too small α s to progress fast Therefore, f (xk + αkpk)T pk ≥ c2 f (xk)T pk This avoid small steps of αk Note the following The left hand side is the derivative dφ(αk) dαk i.e. the condition ensures that the slope of φ (αk) is greater than dφ(0) dαk 70 / 98
  • 140. Images/cinvestav However, Using That Condition alone gets you too small α s to progress fast Therefore, f (xk + αkpk)T pk ≥ c2 f (xk)T pk This avoid small steps of αk Note the following The left hand side is the derivative dφ(αk) dαk i.e. the condition ensures that the slope of φ (αk) is greater than dφ(0) dαk 70 / 98
  • 141. Images/cinvestav Meaning if the slope φ (αk) is strongly negative We can reduce f significantly by moving further along the chosen direction. if the slope is only slightly negative or even positive We cannot expect much more decrease in f in this direction. It might be good to terminate the linear search. 71 / 98
  • 142. Images/cinvestav Meaning if the slope φ (αk) is strongly negative We can reduce f significantly by moving further along the chosen direction. if the slope is only slightly negative or even positive We cannot expect much more decrease in f in this direction. It might be good to terminate the linear search. 71 / 98
  • 144. Images/cinvestav Outline 1 Logistic Regression Introduction Constraints The Initial Model The Two Case Class Graphic Interpretation Fitting The Model The Two Class Case The Final Log-Likelihood The Newton-Raphson Algorithm Matrix Notation 2 More on Optimization Methods Using Cholesky Decomposition Cholesky Decomposition The Proposed Method Quasi-Newton Method The Second Order Approximation The Wolfe Condition Proving BFGS The BFGS Algorithm A Neat Trick: Coordinate Ascent Coordinate Ascent Algorithm Conclusion 73 / 98
  • 145. Images/cinvestav Now, Instead of computing Bk at each iteration Instead Davidon proposed to update it in a simple manner to account for the curvature measured during the most recent step. Imagine we generated xk+1 and wish to generate a new quadratic model m (xk+1, p) = f (xk+1) + f (xk+1) p + 1 2 pT Bk+1p What requirements should we impose on Bk+1? One reasonable requirement is that the gradient of m should match the gradient of the objective function f at the latest two iterates xk and xk+1. 74 / 98
  • 146. Images/cinvestav Now, Instead of computing Bk at each iteration Instead Davidon proposed to update it in a simple manner to account for the curvature measured during the most recent step. Imagine we generated xk+1 and wish to generate a new quadratic model m (xk+1, p) = f (xk+1) + f (xk+1) p + 1 2 pT Bk+1p What requirements should we impose on Bk+1? One reasonable requirement is that the gradient of m should match the gradient of the objective function f at the latest two iterates xk and xk+1. 74 / 98
  • 147. Images/cinvestav Now, Instead of computing Bk at each iteration Instead Davidon proposed to update it in a simple manner to account for the curvature measured during the most recent step. Imagine we generated xk+1 and wish to generate a new quadratic model m (xk+1, p) = f (xk+1) + f (xk+1) p + 1 2 pT Bk+1p What requirements should we impose on Bk+1? One reasonable requirement is that the gradient of m should match the gradient of the objective function f at the latest two iterates xk and xk+1. 74 / 98
  • 148. Images/cinvestav Now Since m (xk+1, 0) is precisely f (xk+1) The second of these conditions is satisfied automatically. The first condition can be written mathematically m (xk+1, −αkpk) = f (xk+1) − αkBk+1p = f (xk) We have αkBk+1p = f (xk+1) − f (xk) 75 / 98
  • 149. Images/cinvestav Now Since m (xk+1, 0) is precisely f (xk+1) The second of these conditions is satisfied automatically. The first condition can be written mathematically m (xk+1, −αkpk) = f (xk+1) − αkBk+1p = f (xk) We have αkBk+1p = f (xk+1) − f (xk) 75 / 98
  • 150. Images/cinvestav Now Since m (xk+1, 0) is precisely f (xk+1) The second of these conditions is satisfied automatically. The first condition can be written mathematically m (xk+1, −αkpk) = f (xk+1) − αkBk+1p = f (xk) We have αkBk+1p = f (xk+1) − f (xk) 75 / 98
  • 151. Images/cinvestav Then, we can simplify the notation For this sk = xk+1 − xk yk = f (xk+1) − f (xk) We have the secant equation Bk+1sk = yk Given that Bk+1 is symmetric positive, the displacement sk and the change in gradient yk sT k Bk+1sk = sT k yk > 0 Making possible to Bk+1 to map sk into yk!!! 76 / 98
  • 152. Images/cinvestav Then, we can simplify the notation For this sk = xk+1 − xk yk = f (xk+1) − f (xk) We have the secant equation Bk+1sk = yk Given that Bk+1 is symmetric positive, the displacement sk and the change in gradient yk sT k Bk+1sk = sT k yk > 0 Making possible to Bk+1 to map sk into yk!!! 76 / 98
  • 153. Images/cinvestav Then, we can simplify the notation For this sk = xk+1 − xk yk = f (xk+1) − f (xk) We have the secant equation Bk+1sk = yk Given that Bk+1 is symmetric positive, the displacement sk and the change in gradient yk sT k Bk+1sk = sT k yk > 0 Making possible to Bk+1 to map sk into yk!!! 76 / 98
  • 154. Images/cinvestav Interpretation If our update satisfies the secant condition Bk+1 (xk+1 − xk) = f (xk+1) − f (xk) This implies that for the second-order approximation at xk+1 of our target function f fapprox (xk+1) = f (xk) + f (xk)T (xk+1 − xk) + 1 2 (xk+1 − xk)T Bk+1 (xk+1 − xk) 77 / 98
  • 155. Images/cinvestav Interpretation If our update satisfies the secant condition Bk+1 (xk+1 − xk) = f (xk+1) − f (xk) This implies that for the second-order approximation at xk+1 of our target function f fapprox (xk+1) = f (xk) + f (xk)T (xk+1 − xk) + 1 2 (xk+1 − xk)T Bk+1 (xk+1 − xk) 77 / 98
  • 156. Images/cinvestav Then Getting the gradient over xk+1 − xk fapprox (xk+1) = f (xk) + Bk+1 (xk+1 − xk) = f (xk) + f (xk+1) − f (xk) = f (xk+1) 78 / 98
  • 157. Images/cinvestav Then Getting the gradient over xk+1 − xk fapprox (xk+1) = f (xk) + Bk+1 (xk+1 − xk) = f (xk) + f (xk+1) − f (xk) = f (xk+1) 78 / 98
  • 158. Images/cinvestav Then Getting the gradient over xk+1 − xk fapprox (xk+1) = f (xk) + Bk+1 (xk+1 − xk) = f (xk) + f (xk+1) − f (xk) = f (xk+1) 78 / 98
  • 159. Images/cinvestav Now, We have two cases When f is strongly convex The inequality will be satisfied for any two points xk and xk+1 This condition will not always hold for non-convex functions We need to enforce sT k yk > 0 explicitly The previous condition hold on the The Wolfe or Strong Wolfe condition. 79 / 98
  • 160. Images/cinvestav Now, We have two cases When f is strongly convex The inequality will be satisfied for any two points xk and xk+1 This condition will not always hold for non-convex functions We need to enforce sT k yk > 0 explicitly The previous condition hold on the The Wolfe or Strong Wolfe condition. 79 / 98
  • 161. Images/cinvestav Now, We have two cases When f is strongly convex The inequality will be satisfied for any two points xk and xk+1 This condition will not always hold for non-convex functions We need to enforce sT k yk > 0 explicitly The previous condition hold on the The Wolfe or Strong Wolfe condition. 79 / 98
  • 162. Images/cinvestav How? We have the second part, f (xk + αkpk)T pk ≥ c2 f (xk)T pk By remembering xk+1 = xk + αkpk. f (xk+1)T sk ≥ c2 f (xk)T sk ⇒ yT k sk = [ f (xk+1) − f (xk)]T sk Then yT k sk = f (xk+1)T sk − f (xk)T sk ≥ c2 f (xk)T [αkpk] − f (xk)T [αkpk] = (c2 − 1) αk f (xk)T pk 80 / 98
  • 163. Images/cinvestav How? We have the second part, f (xk + αkpk)T pk ≥ c2 f (xk)T pk By remembering xk+1 = xk + αkpk. f (xk+1)T sk ≥ c2 f (xk)T sk ⇒ yT k sk = [ f (xk+1) − f (xk)]T sk Then yT k sk = f (xk+1)T sk − f (xk)T sk ≥ c2 f (xk)T [αkpk] − f (xk)T [αkpk] = (c2 − 1) αk f (xk)T pk 80 / 98
  • 164. Images/cinvestav How? We have the second part, f (xk + αkpk)T pk ≥ c2 f (xk)T pk By remembering xk+1 = xk + αkpk. f (xk+1)T sk ≥ c2 f (xk)T sk ⇒ yT k sk = [ f (xk+1) − f (xk)]T sk Then yT k sk = f (xk+1)T sk − f (xk)T sk ≥ c2 f (xk)T [αkpk] − f (xk)T [αkpk] = (c2 − 1) αk f (xk)T pk 80 / 98
  • 165. Images/cinvestav How? We have the second part, f (xk + αkpk)T pk ≥ c2 f (xk)T pk By remembering xk+1 = xk + αkpk. f (xk+1)T sk ≥ c2 f (xk)T sk ⇒ yT k sk = [ f (xk+1) − f (xk)]T sk Then yT k sk = f (xk+1)T sk − f (xk)T sk ≥ c2 f (xk)T [αkpk] − f (xk)T [αkpk] = (c2 − 1) αk f (xk)T pk 80 / 98
  • 166. Images/cinvestav How? We have the second part, f (xk + αkpk)T pk ≥ c2 f (xk)T pk By remembering xk+1 = xk + αkpk. f (xk+1)T sk ≥ c2 f (xk)T sk ⇒ yT k sk = [ f (xk+1) − f (xk)]T sk Then yT k sk = f (xk+1)T sk − f (xk)T sk ≥ c2 f (xk)T [αkpk] − f (xk)T [αkpk] = (c2 − 1) αk f (xk)T pk 80 / 98
  • 167. Images/cinvestav Then Given that c2 − 1 < 0 and αk > 0 (c2 − 1) αk f (xk)T pk = (c2 − 1) αk f (xk)T −B−1 k f (xk) > 0 Thus, the curvature condition holds!!! Then, the secant equation always has a solution under the curvature conditions. In fact, it admits an infinite number if solution, Given Since there are d(d+1) 2 degrees in a symmetric matrix. But under constrains you have 2d constraint imposed by the secant equation and positive definitives. Leaving d(d+1) 2 − 2d = d2+d−4d d = d2−3d d degrees of freedom... enough for infinitely many solutions. 81 / 98
  • 168. Images/cinvestav Then Given that c2 − 1 < 0 and αk > 0 (c2 − 1) αk f (xk)T pk = (c2 − 1) αk f (xk)T −B−1 k f (xk) > 0 Thus, the curvature condition holds!!! Then, the secant equation always has a solution under the curvature conditions. In fact, it admits an infinite number if solution, Given Since there are d(d+1) 2 degrees in a symmetric matrix. But under constrains you have 2d constraint imposed by the secant equation and positive definitives. Leaving d(d+1) 2 − 2d = d2+d−4d d = d2−3d d degrees of freedom... enough for infinitely many solutions. 81 / 98
  • 169. Images/cinvestav Then Given that c2 − 1 < 0 and αk > 0 (c2 − 1) αk f (xk)T pk = (c2 − 1) αk f (xk)T −B−1 k f (xk) > 0 Thus, the curvature condition holds!!! Then, the secant equation always has a solution under the curvature conditions. In fact, it admits an infinite number if solution, Given Since there are d(d+1) 2 degrees in a symmetric matrix. But under constrains you have 2d constraint imposed by the secant equation and positive definitives. Leaving d(d+1) 2 − 2d = d2+d−4d d = d2−3d d degrees of freedom... enough for infinitely many solutions. 81 / 98
  • 170. Images/cinvestav Given that we want a unique solution We could solve the following problem min B B − Bk s.t. B = BT Bsk = yk Many matrix norms give rise to different Quasi-Newton Methos We use the weighted Frobenius norm (Scale invariant), defined as A W = W 1 2 AW 1 2 F Where A F = d+1 i=2 d+1 j=1 a2 ij 82 / 98
  • 171. Images/cinvestav Given that we want a unique solution We could solve the following problem min B B − Bk s.t. B = BT Bsk = yk Many matrix norms give rise to different Quasi-Newton Methos We use the weighted Frobenius norm (Scale invariant), defined as A W = W 1 2 AW 1 2 F Where A F = d+1 i=2 d+1 j=1 a2 ij 82 / 98
  • 172. Images/cinvestav Given that we want a unique solution We could solve the following problem min B B − Bk s.t. B = BT Bsk = yk Many matrix norms give rise to different Quasi-Newton Methos We use the weighted Frobenius norm (Scale invariant), defined as A W = W 1 2 AW 1 2 F Where A F = d+1 i=2 d+1 j=1 a2 ij 82 / 98
  • 173. Images/cinvestav In our case, we choose We have the average Hessian Gk = 1 0 2 f (xk + ταkpk) dτ The property yk = Gkαkpk = Gksk It follows from Taylor’s Theorem!!! 83 / 98
  • 174. Images/cinvestav In our case, we choose We have the average Hessian Gk = 1 0 2 f (xk + ταkpk) dτ The property yk = Gkαkpk = Gksk It follows from Taylor’s Theorem!!! 83 / 98
  • 175. Images/cinvestav For the BFGS In the previous setup we will get Bk, then the inverse update of it Hk = B−1 k In BFGS we go directly for the inverse by setting up: min H H − Hk s.t. H = HT Hyk = sk A unique solution will be Hk+1 = I − ρkskyT k Hk I − ρkyksT k + ρsksT k (1) where ρk = 1 yT k sk 84 / 98
  • 176. Images/cinvestav For the BFGS In the previous setup we will get Bk, then the inverse update of it Hk = B−1 k In BFGS we go directly for the inverse by setting up: min H H − Hk s.t. H = HT Hyk = sk A unique solution will be Hk+1 = I − ρkskyT k Hk I − ρkyksT k + ρsksT k (1) where ρk = 1 yT k sk 84 / 98
  • 177. Images/cinvestav Problem There is no magic formula to find an initial H0 We can use specific information about the problem: For instance by setting it to the inverse of an approximate Hessian calculated by finite differences at x0 In our case, we have ∂L(w) ∂w∂wT , or in matrix format XT WX, we could get initial setup 85 / 98
  • 178. Images/cinvestav Problem There is no magic formula to find an initial H0 We can use specific information about the problem: For instance by setting it to the inverse of an approximate Hessian calculated by finite differences at x0 In our case, we have ∂L(w) ∂w∂wT , or in matrix format XT WX, we could get initial setup 85 / 98
  • 179. Images/cinvestav Problem There is no magic formula to find an initial H0 We can use specific information about the problem: For instance by setting it to the inverse of an approximate Hessian calculated by finite differences at x0 In our case, we have ∂L(w) ∂w∂wT , or in matrix format XT WX, we could get initial setup 85 / 98
  • 180. Images/cinvestav Outline 1 Logistic Regression Introduction Constraints The Initial Model The Two Case Class Graphic Interpretation Fitting The Model The Two Class Case The Final Log-Likelihood The Newton-Raphson Algorithm Matrix Notation 2 More on Optimization Methods Using Cholesky Decomposition Cholesky Decomposition The Proposed Method Quasi-Newton Method The Second Order Approximation The Wolfe Condition Proving BFGS The BFGS Algorithm A Neat Trick: Coordinate Ascent Coordinate Ascent Algorithm Conclusion 86 / 98
  • 181. Images/cinvestav Algorithm (BFGS Method) Quasi-Newton Algorithm Starting point x0, Convergence tolerance e, Inverse Hessian approximation H0 1 k ← 0 2 while f (xk+1) > e 3 Compute search direction pk = −Hk f (xk+1) 4 Set xk+1 = xk + αkpkwhere αk is obtained from a linear search (Under Wolfe conditions). 5 Define sk = xk+1 − xk and yk = f (xk+1) − f (xk) 6 Compute Hk+1 by means of (Eq. 1) 7 k ← k + 1 87 / 98
  • 182. Images/cinvestav Algorithm (BFGS Method) Quasi-Newton Algorithm Starting point x0, Convergence tolerance e, Inverse Hessian approximation H0 1 k ← 0 2 while f (xk+1) > e 3 Compute search direction pk = −Hk f (xk+1) 4 Set xk+1 = xk + αkpkwhere αk is obtained from a linear search (Under Wolfe conditions). 5 Define sk = xk+1 − xk and yk = f (xk+1) − f (xk) 6 Compute Hk+1 by means of (Eq. 1) 7 k ← k + 1 87 / 98
  • 183. Images/cinvestav Algorithm (BFGS Method) Quasi-Newton Algorithm Starting point x0, Convergence tolerance e, Inverse Hessian approximation H0 1 k ← 0 2 while f (xk+1) > e 3 Compute search direction pk = −Hk f (xk+1) 4 Set xk+1 = xk + αkpkwhere αk is obtained from a linear search (Under Wolfe conditions). 5 Define sk = xk+1 − xk and yk = f (xk+1) − f (xk) 6 Compute Hk+1 by means of (Eq. 1) 7 k ← k + 1 87 / 98
  • 184. Images/cinvestav Algorithm (BFGS Method) Quasi-Newton Algorithm Starting point x0, Convergence tolerance e, Inverse Hessian approximation H0 1 k ← 0 2 while f (xk+1) > e 3 Compute search direction pk = −Hk f (xk+1) 4 Set xk+1 = xk + αkpkwhere αk is obtained from a linear search (Under Wolfe conditions). 5 Define sk = xk+1 − xk and yk = f (xk+1) − f (xk) 6 Compute Hk+1 by means of (Eq. 1) 7 k ← k + 1 87 / 98
  • 185. Images/cinvestav Algorithm (BFGS Method) Quasi-Newton Algorithm Starting point x0, Convergence tolerance e, Inverse Hessian approximation H0 1 k ← 0 2 while f (xk+1) > e 3 Compute search direction pk = −Hk f (xk+1) 4 Set xk+1 = xk + αkpkwhere αk is obtained from a linear search (Under Wolfe conditions). 5 Define sk = xk+1 − xk and yk = f (xk+1) − f (xk) 6 Compute Hk+1 by means of (Eq. 1) 7 k ← k + 1 87 / 98
  • 186. Images/cinvestav Algorithm (BFGS Method) Quasi-Newton Algorithm Starting point x0, Convergence tolerance e, Inverse Hessian approximation H0 1 k ← 0 2 while f (xk+1) > e 3 Compute search direction pk = −Hk f (xk+1) 4 Set xk+1 = xk + αkpkwhere αk is obtained from a linear search (Under Wolfe conditions). 5 Define sk = xk+1 − xk and yk = f (xk+1) − f (xk) 6 Compute Hk+1 by means of (Eq. 1) 7 k ← k + 1 87 / 98
  • 187. Images/cinvestav Algorithm (BFGS Method) Quasi-Newton Algorithm Starting point x0, Convergence tolerance e, Inverse Hessian approximation H0 1 k ← 0 2 while f (xk+1) > e 3 Compute search direction pk = −Hk f (xk+1) 4 Set xk+1 = xk + αkpkwhere αk is obtained from a linear search (Under Wolfe conditions). 5 Define sk = xk+1 − xk and yk = f (xk+1) − f (xk) 6 Compute Hk+1 by means of (Eq. 1) 7 k ← k + 1 87 / 98
  • 188. Images/cinvestav Algorithm (BFGS Method) Quasi-Newton Algorithm Starting point x0, Convergence tolerance e, Inverse Hessian approximation H0 1 k ← 0 2 while f (xk+1) > e 3 Compute search direction pk = −Hk f (xk+1) 4 Set xk+1 = xk + αkpkwhere αk is obtained from a linear search (Under Wolfe conditions). 5 Define sk = xk+1 − xk and yk = f (xk+1) − f (xk) 6 Compute Hk+1 by means of (Eq. 1) 7 k ← k + 1 87 / 98
  • 189. Images/cinvestav Complexity The Cost of update or inverse update is O d2 operations per iteration. For More Nocedal, Jorge & Wright, Stephen J. (1999). Numerical Optimization. Springer-Verlag. 88 / 98
  • 190. Images/cinvestav Complexity The Cost of update or inverse update is O d2 operations per iteration. For More Nocedal, Jorge & Wright, Stephen J. (1999). Numerical Optimization. Springer-Verlag. 88 / 98
  • 191. Images/cinvestav Outline 1 Logistic Regression Introduction Constraints The Initial Model The Two Case Class Graphic Interpretation Fitting The Model The Two Class Case The Final Log-Likelihood The Newton-Raphson Algorithm Matrix Notation 2 More on Optimization Methods Using Cholesky Decomposition Cholesky Decomposition The Proposed Method Quasi-Newton Method The Second Order Approximation The Wolfe Condition Proving BFGS The BFGS Algorithm A Neat Trick: Coordinate Ascent Coordinate Ascent Algorithm Conclusion 89 / 98
  • 192. Images/cinvestav Given the following Because the likelihood is convex, 90 / 98
  • 193. Images/cinvestav Caution Here, we change the labeling to yi = ±1 with p (yi = ±1|x, w) = σ ywT x = 1 1 + exp {−ywT x} Thus, we have the following log likelihood under regularization λ > 0 L (w) = − N i=1 log 1 + exp −yiwT xi − λ 2 wT w It is possible to get a Gradient Descent wl (w) = N i=1 1 − 1 1 + exp {−yiwT xi} yixi − λw 91 / 98
  • 194. Images/cinvestav Caution Here, we change the labeling to yi = ±1 with p (yi = ±1|x, w) = σ ywT x = 1 1 + exp {−ywT x} Thus, we have the following log likelihood under regularization λ > 0 L (w) = − N i=1 log 1 + exp −yiwT xi − λ 2 wT w It is possible to get a Gradient Descent wl (w) = N i=1 1 − 1 1 + exp {−yiwT xi} yixi − λw 91 / 98
  • 195. Images/cinvestav Caution Here, we change the labeling to yi = ±1 with p (yi = ±1|x, w) = σ ywT x = 1 1 + exp {−ywT x} Thus, we have the following log likelihood under regularization λ > 0 L (w) = − N i=1 log 1 + exp −yiwT xi − λ 2 wT w It is possible to get a Gradient Descent wl (w) = N i=1 1 − 1 1 + exp {−yiwT xi} yixi − λw 91 / 98
  • 196. Images/cinvestav Danger Will Robinson!!! Gradient descent using resembles the Perceptron learning algorithm Problem!!! It will always converge for a suitable step size, regardless of whether the classes are separable!!! 92 / 98
  • 197. Images/cinvestav Outline 1 Logistic Regression Introduction Constraints The Initial Model The Two Case Class Graphic Interpretation Fitting The Model The Two Class Case The Final Log-Likelihood The Newton-Raphson Algorithm Matrix Notation 2 More on Optimization Methods Using Cholesky Decomposition Cholesky Decomposition The Proposed Method Quasi-Newton Method The Second Order Approximation The Wolfe Condition Proving BFGS The BFGS Algorithm A Neat Trick: Coordinate Ascent Coordinate Ascent Algorithm Conclusion 93 / 98
  • 198. Images/cinvestav Here We will simplify our work By stating the algorithm for coordinate ascent Then A more precise version will be given 94 / 98
  • 199. Images/cinvestav Here We will simplify our work By stating the algorithm for coordinate ascent Then A more precise version will be given 94 / 98
  • 200. Images/cinvestav Coordinate Ascent Algorithm Input Max, an initial w0 1 counter ← 0 2 while counter < Max 3 for i ← 1, ..., d 4 Randomly pick i 5 Compute a step size δ∗ by approximately maximize arg minδ f (x + δei) 6 xi ← xi + δ∗ Where ei = 0 · · · 0 1 ← i 0 · · · 0 T 95 / 98
  • 201. Images/cinvestav Coordinate Ascent Algorithm Input Max, an initial w0 1 counter ← 0 2 while counter < Max 3 for i ← 1, ..., d 4 Randomly pick i 5 Compute a step size δ∗ by approximately maximize arg minδ f (x + δei) 6 xi ← xi + δ∗ Where ei = 0 · · · 0 1 ← i 0 · · · 0 T 95 / 98
  • 202. Images/cinvestav In the case of Logistic Regression Thus, we can optimize each wk alternatively by a coordinate-wise Newton update wnew k = wold k + −λwold k + N i=1 1 − 1 1+exp{−yiwT xi} yixik λ + N i=1 x2 ik 1 1+exp{−yiwT xi} 1 − 1 1+exp{−yiwT xi} Complexity of this update Item Complexity N i=1 1 − 1 1+exp{−yiwT xi} yixik O (N) N i=1 x2 ik 1 1+exp{−yiwT xi} 1 − 1 1+exp{−yiwT xi} O (N) Total Complexity O (N) For all the dimensions O (Nd) 96 / 98
  • 203. Images/cinvestav In the case of Logistic Regression Thus, we can optimize each wk alternatively by a coordinate-wise Newton update wnew k = wold k + −λwold k + N i=1 1 − 1 1+exp{−yiwT xi} yixik λ + N i=1 x2 ik 1 1+exp{−yiwT xi} 1 − 1 1+exp{−yiwT xi} Complexity of this update Item Complexity N i=1 1 − 1 1+exp{−yiwT xi} yixik O (N) N i=1 x2 ik 1 1+exp{−yiwT xi} 1 − 1 1+exp{−yiwT xi} O (N) Total Complexity O (N) For all the dimensions O (Nd) 96 / 98
  • 204. Images/cinvestav Outline 1 Logistic Regression Introduction Constraints The Initial Model The Two Case Class Graphic Interpretation Fitting The Model The Two Class Case The Final Log-Likelihood The Newton-Raphson Algorithm Matrix Notation 2 More on Optimization Methods Using Cholesky Decomposition Cholesky Decomposition The Proposed Method Quasi-Newton Method The Second Order Approximation The Wolfe Condition Proving BFGS The BFGS Algorithm A Neat Trick: Coordinate Ascent Coordinate Ascent Algorithm Conclusion 97 / 98
  • 205. Images/cinvestav We have the following Complexities per iteration Complexities Method Per Iteration Convergence Rate Cholesky Decomposition d3 2 = O d3 Quadratic Quasi-Newton BFGS O d2 Superlinearly Coordinate Ascent O (Nd) Not established 98 / 98