Optimization tutorial

0
Sebastian Bernasek
7-14-2015
Intro to Optimization: Part 1

1
What is optimization?
Identify variable values that minimize or maximize
some objective while satisfying constraints
objective
variables
constraints
minimize f(x)
where x = {x1,x2,..xn}
s.t. Ax < b

2
What for?
Finance
• maximize profit, minimize risk
• constraints: budgets, regulations
Engineering
• maximize IRR, minimize emissions
• constraints: resources, safety
Data modeling

3
Given a proposed model:
y(x) = θ1 sin(θ2 x)
which parameters (θi) best describe the data?
Data modeling

4
Which parameters (θi) best describe the data?
We must quantify goodness-of-fit
Data modeling

5
A good model will have minimal residual error
Goodness-of-fit metrics
ei =Yi - y(Xi )
y(Xi ) =q1 sin q2Xi( )
where Xi,Yi are data
and y(Xi) is the model, e.g.

6
 Least Squares
 Weighted Least Squares
SSE = ei
2
i=1
N
å = Yi - y(Xi )( )
2
i=1
N
å
WSSE =
ei
2
si
2
i=1
N
å =
Yi - y(Xi )( )
2
si
2
i=1
N
ågives greater importance to
more precise data
all data equally important
We seek to minimize SSE and WSSE

7
 Log likelihood
Define the likelihood L(θ|Y)=p(Y|θ)
as the likelihood of θ being the true
parameters given the observed data

9
 Log likelihood
So what is p(Yi|θ) ?
Assume each residual is drawn from a distribution. For
example, assume ei are Gaussian distributed with
p(Yi |q) =
1
(2psi
2
)1/2
e
-
1
2si
2
Yi -y(Xi )( )
2
ei -ei = ei
m = ei = 0

10
 Log likelihood
lnL(q |Y) = ln p(Yi |q)
i=1
N
å
lnL(q |Y) = ln
1
(2psi
2
)1/2
e
-
1
2si
2
Yi -y(Xi )( )
2
é
ë
ê
ê
ù
û
ú
úi=1
N
å
lnL(q |Y) = ln
1
(2psi
2
)1/2
é
ë
ê
ù
û
ú-
i=1
N
å
1
2si
2
Yi - y(Xi )( )
2
i=1
N
å
lnL(q |Y) = ln
1
(2psi
2
)1/2
é
ë
ê
ù
û
ú-
i=1
N
å
1
2si
2
Yi -q1 sin(q2Xi )( )
2
i=1
N
å
maximize ln L(θ | Y)

11
 Least Squares
• simple and straightforward to implement
• requires large N for high accuracy
 Weighted Least Squares
• accounts for variability in precision of variables
• converges to least squares for high N
 Log Likelihood
• requires assumption for residuals PDF

12
Data modeling
objective
variables
constraints
minimize SSE(θ)
where θ = {θ1,θ2,..θn}
s.t. Aθ < b

13
Data modeling
optimum
variables
minimum
θ = {5,1}
SSE(θ) = 277

14
minimize f(x)
where x = {x1,x2,..xn}
s.t. Ax < b
So how do we optimize?

15
Types of problems
There are many classes of optimization problems
1. constrained vs unconstrained
2. static vs dynamic
3. continuous vs discrete variables
4. deterministic vs stochastic variables
5. single vs multiple objective functions

16
Types of algorithms
There are many more classes of algorithms that
attempt to solve these problems
NEOS, UW

17
Types of algorithms
There are many more classes of algorithms that
attempt to solve these problems
NEOS, UW

18
Unconstrained Optimization
 Zero-Order Methods (function calls only)
• Nelder-Mead Simplex (direct search)
• Powell Conjugate Directions
 First-Order Methods
• Steepest Descent
• Nonlinear Conjugate Gradients
• Broyden-Fletcher-Goldfarb-Shanno Algorithms (BFGS)
 Second-Order Methods
• Newton’s Method
• Newton Conjugate Gradient
Here we classify algorithms by the derivative
information utilized.
scipy.optimize.fmin

19
 All but the simplex and Newton methods call
one-dimensional line searches as a subroutine
 Common option:
• Bisection Methods (e.g. Golden Search)
General Iterative Scheme
α=step size
dn = search direction
Unconstrained Optimization in 1-D
xn+1 = xn +adn
R =
a
b
= golden.ratio
R2
+ R-1= 0
linear convergence, but robust

20
1-D Optimization
root finding
 Calculus-based option:
• Newton-Raphson
Unconstrained Optimization in 1-D
xn+1 = xn -
f '(xn )
f ''(xn )
f '(xn+1)= 0
f ''(xn ) »
f '(xn )- f '(xn-1)
xn - xn-1
can use explicit
derivatives or numerical
approximation
g(xn+1)= g(xn )+g'(xn )(xn+1 - xn )
g(xn+1) = 0
xn+1 = xn -
g(xn )
g'(xn )
we want:
so let: g(x)= f '(x)

21
 Move to minimum of quadratic fit at each point
can achieve quadratic convergence for twice
differentiable functions
Newton Raphson
COS 323 Course Notes, Princeton U.

22
Newton’s Method in N-Dimensions
1. Construct a locally quadratic model (mn) via
Taylor expansion about xn:
2. At each step we want to move toward the
minimum of this model, where
pn = x- xn
mn (pn ) = f (xn )+ pn
T
Ñf (xn )+
1
2
pn
T
H(xn )pn
points near xn…
Ñmn(pn ) = 0
Ñmn(pn )= Ñf (xn )+H(xn )pndifferentiating…
pn = -H-1
(xn )Ñf (xn )solving…

23
3. The minimum of the local second-order model
lies in the direction pn.
Determine the optimal step size, α, by 1-D optimization
pn = -H-1
(xn )Ñf (xn )
Search directionGeneral Iterative Scheme
α=step size
dn = search direction
xn+1 = xn +adn
a = argmina f (xn -aH-1
(xn )Ñf (xn ))éë ùû
a = argmina f (xn +apn )[ ]
Golden search,
Newton’s method,
Brent’s Method,
Nelder-Mead Simplex,
etc.

24
4. Take the step
5. Check termination criteria and return to step 3
xn+1 = xn +apn
possible criteria:
• Maximum iterations reached
• Change in objective function below threshold
• Change in local gradient below threshold
• Change in local Hessian below threshold

25
BFGS Algorithm (quasi-Newton method)
• Numerically approx. H-1(xn)
• Multiply matrices
 How do we compute the Hessian?
pn = -H-1
(xn )Ñf (xn )
H(xn )pn =-Ñf (xn )
Newton’s Method
• Define H(xn) expressions
• Invert it and multiply
• Accurate
• Costly for high N
• Requires 2nd derivatives
• Avoids solving system
• Only req. 1st derivatives
• Crazy math I don’t get
scipy.optimize.fmin_bfgs

26
Gradient Descent
 Newton/BFGS make use of the local Hessian
 Alternatively we could just use the gradient
1. Pick a starting point, x0
2. Evaluate the local derivative
3. Perform line-search along gradient
1. Move directly along the gradient
2. Check convergence criteria and return to 2
Ñf (x0 )
xn+1 = xn -aÑf (xn )
a = argmina f (xn -aÑf (xn ))[ ]

27
Gradient Descent
 Function must be differentiable
 Subsequent steps are always perpendicular
 Can get caught in narrow valleys
xn+1 = xn -aÑf (xn )
a = argmina f (xn -aÑf (xn ))[ ]

28
Conjugate Gradient Method
 Avoids reversing previous iterations by ensuring that
each step is conjugate to all previous steps, creating a
linearly independent set of basis vectors
1. Pick a starting point and evaluate local derivative
2. First step follows gradient descent
3. Compute weights for previous steps, βn
x1 = x0 -aÑf (x0 )
a = argmina f (x0 -aÑf (x0 ))[ ]
bn =
Dxn
T
(Dxn -Dxn-1)
DxT
n-1Dxn-1
Dxn = -Ñf (xn )where
is the steepest direction
Polak-Ribiere Version

29
 Creates a set, si, of linearly independent vectors
that span the parameter space xi.
4. Compute search direction, sn
4. Move to optimal point along sn
4. Check convergence criteria and return to step 3
xn+1 = xn +asn
a = argmina f (xn +asn )[ ]
*Note that setting βi = 0 yields the gradient descent algorithm
sn =-Ñf (xn )+bnsn-1

30
 For properly conditions problems, guaranteed to
converge in N iterations
 Very commonly used scipy.optimize.fmin_cg

31
Powell’s Conjugate Directions
 Performs N line searches along N basis vectors in order
to determine an optimal search direction
 Preserves minimization achieved by previous steps by
retaining the basis vector set between iterations
1. Pick a starting point and a set of basis vectors
1. Determine the optimum step size along each vector
1. Let the search vector be the linear combination of basis vectors:
is convention

32
4. Move along the search vector
4. Add to the basis and drop the oldest basis vector
5. Check the convergence criteria and return to step 2
and
Problem: Algorithm tends toward a linearly dependent basis set
Solutions: 1. Reset to an orthogonal basis every N iterations
2. At step 5, replace the basis vector corresponding to the
largest change in f(x)

33
Advantages
 No derivatives required  only uses function calls
 Quadratic convergence
Accessible via scipy.optimize.fmin_powell

34
Nelder-Mead Simplex Algorithm
Direct search algorithm
Default method: scipy.optimize.fmin
Method consists of a simplex crawling
around the parameter space until it finds
and brackets a local minimum.

35
Simplex: convex hull of N+1 vertices in N-space.
2D: a triangle 3D: a tetrahedron

36
1. Pick a starting point and define a simplex around it with
N+1 vertices xi
2. Evaluate f(xi) at each vertex and rank order the vertices
such that x1 is the best and xN+1 is the worst
3. Evaluate the centroid of the best N vertices
x =
1
N
xi
i=1
N
å

37
4. Reflection: let xr = x +a(x - xN+1)
f (x1)£ f (xr )< f (xN )If replace xN+1 with xr
worst point
(highest function val.)
x
xr
xN+1

38
5. Expansion:
If reflection resulted in the best point, try:
xe = x +b(x - xN+1)
f (xe )< f (xr )If then replace xN+1 with xe
If not, replace xN+1 with xr
worst point
(highest function val.)
x
xe
xN+1
xr

39
6. Contraction:
If reflected point is still the worst, then try contraction
xc = x -g(xr - x)
x
xN+1
xc

40
7. Shrinkage:
If contraction fails, scale all vertices toward the best vertex.
xi = x1 +d(xi - x1)
xN+1
x2
x1
for 2 £i £ N +1

41
 Advantages:
 Doesn’t require any derivatives
 Few function calls at each iteration
 Works with rough surfaces
 Disadvantages:
 Can require many iterations
 Does not always converge. Convergence criteria are unknown.
 Inefficient in very high N

42
Parameter Required Typical
α > 0 1
β > 1 2
γ 0 < γ < 1 0.5
δ 0 < δ < 1 0.5

43
Algorithm Comparison
min f(x) iterations f(x) evals f'(x) evals
powell -2 2 43 0
conjugate gradient -2 4 40 10
gradient descent -2 3 32 8
bfgs -2 6 48 12
simplex -2 45 87 0
f(x) = sin(x) + cos(y)

44
Algorithm Comparison
f(x) = sin(xy) + cos(y)
 Simplex & Powell seem to similarly follow valleys with a more “local” focus
 BFGS/CG readily transcend valleys

45
2-D Rosenbock Function
min f(x) iterations f(x) evals f'(x) evals
powell 3.8 E-28 25 719 0
conjugate gradient 9.5 E-08 33 368 89
gradient descent 1.1 E+01 400 1712 428
bfgs 1.8 E-11 47 284 71
simplex 5.6 E-10 106 201 0

Optimization tutorial

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Optimization tutorial

Similar to Optimization tutorial (20)

Recently uploaded

Recently uploaded (20)

Optimization tutorial