4. Unconstrained minimization problems
•Recall: Constrained minimization problems
•From Lecture 1, the formation of a general constrained convex optimization problem is as follows
•min푓푥푠.푡.푥∈χ
•Where 푓:χ→Ris convex and smooth
•From Lecture 1, the formation of an unconstrained optimization problem is as follows
•min푓푥
•Where 푓:푅푛→푅is convex and smooth
•In this problem, the necessary and sufficient condition for optimal solution x0 is
•훻푓푥=0푎푡푥=푥0
4
5. Unconstrained minimization problems
•Minimize f(x)
•When f is differentiable and convex, a necessary and sufficient condition for a point 푥∗to be optimal is훻푓푥∗=0
•Minimize f(x) is the same as fining solution of 훻푓푥∗=0
•Min f(x): Analytically solving the optimality equation
•훻푓푥∗=0: Usually be solved by an iterative algorithm
5
6. Description of Gradient Descent Method
•The idea relies on the fact that −훻푓(푥(푘))is a descent direction
•푥(푘+1)=푥(푘)−η푘훻푓(푥(푘))푤푖푡ℎ푓푥푘+1<푓(푥푘)
•Δ푥(푘)is the step, or search direction
•η푘is the step size, or step length
•Too small η푘will cause slow convergence
•Too large η푘could cause overshoot the minima and diverge
6
7. Description of Gradient Descent Method
•Algorithm (Gradient Descent Method)
•given a starting point 푥∈푑표푚푓
•repeat
1.Δ푥≔−훻푓푥
2.Line search: Choose step size ηvia exact or backtracking line search
3.Update 푥≔푥+ηΔ푥
•untilstopping criterion is satisfied
•Stopping criterion usually 훻푓(푥)2≤휖
•Very simple, but often very slow; rarely used in practice
7
8. Pros and Cons
•Pros
•Can be applied to every dimension and space (even possible to infinite dimension)
•Easy to implement
•Cons
•Local optima problem
•Relatively slow close to minimum
•For non-differentiable functions, gradient methods are ill-defined
8
11. Usage of Gradient Descent Method
•Linear Regression
•Find minimum loss function to choose best hypothesis
11
Example of Loss function:
푑푎푡푎푝푟푒푑푖푐푡−푑푎푡푎표푏푠푒푟푣푒푑 2
Find the hypothesis (function) which minimize the loss function
12. Usage of Gradient Descent Method
•Neural Network
•Back propagation
•SVM (Support Vector Machine)
•Graphical models
•Least Mean Squared Filter
…and many other applications!
12
13. Questions
•Does Gradient Descent Method always converge?
•If not, what is condition for convergence?
•How can make Gradient Descent Method faster?
•What is proper value for step size η푘
13
15. L-Lipschitzfunction
•Definition
•A function 푓:푅푛→푅is called L-Lipschitzif and only if 훻푓푥−훻푓푦2≤퐿푥−푦2,∀푥,푦∈푅푛
•We denote this condition by 푓∈퐶퐿, where 퐶퐿is class of L-Lipschitzfunctions
15
17. Strong Convexity and implications
•Definition
•If there exist a constant m > 0 such that 훻2푓≻=푚퐼푓표푟∀푥∈푆, then the function f(x) is strongly convex function on S
17
18. Strong Convexity and implications
•Lemma 4.3
•If f is strongly convex on S, we have the following inequality:
•푓푦≥푓푥+<훻푓푥,푦−푥>+ 푚 2 푦−푥2푓표푟∀푥,푦∈푆
•Proof
18
( )
useful as stopping criterion (if you know m)
20. Upper Bound of 훻2푓(푥)
•Lemma 4.3 implies that the sublevel sets contained in S are bounded, so in particular, S is bounded. Therefore the maximum eigenvalue of 훻2푓푥is bounded above on S
•ThereexistsaconstantMsuchthat훻2푓푥=≺푀퐼푓표푟∀푥∈푆
•Lemma 4.4
•퐹표푟푎푛푦푥,푦∈푆,푖푓훻2푓푥=≺푀퐼푓표푟푎푙푙푥∈푆푡ℎ푒푛 푓푦≤푓푥+<훻푓푥,푦−푥>+ 푀 2 푦−푥2
20
21. Condition Number
•From Lemma 4.3 and 4.4 we have
푚퐼=≺훻2푓푥=≺푀퐼푓표푟∀푥∈푆,푚>0,푀>0
•The ratio k=M/m is thus an upper bound on the condition number of the matrix 훻2푓푥
•When the ratio is close to 1, we call it well-conditioned
•When the ratio is much larger than 1, we call it ill-conditioned
•When the ratio is exactly 1, it is the best case that only one step will lead to the optimal solution (there is no wrong direction)
21
22. Condition Number
•Theorem 4.5
•Gradient descent for a strongly convex function f and step sizeη= 1 푀 will converge as
•푓푥∗−푓∗≤푐푘푓푥0−푓∗,푤ℎ푒푟푒푐≤1− 푚 푀
•Rate of convergence c is known as linear convergence
•Since we usually do not know the value of M, we do line search
•For exact line search, 푐=1− 푚 푀
•For backtracking line search, 푐=1−min2푚훼,2훽훼푚 푀 <1
22
23. Methods & Examples
Exact Line Search, Backtracking Line Search, Coordinate Descent Method, Steepest Descent Method
23
24. Exact Line Search
• The optimal line search method in which η is chosen to
minimize f along the ray 푥 − η훻푓 푥 , as shown in below
• Exact line search is used when the cost of minimization
problem with one variable is low compared to the cost of
computing the search direction itself.
• It is not very practical
24
25. Exact Line Search
•Convergence Analysis
•
•푓푥푘−푓∗decreases by at least a constant factor in every iteration
•Converging to 0 geometric fast. (linear convergence)
25
1
26. Backtracking Line Search
•It depends on two constants 훼,훽푤푖푡ℎ0<훼<0.5,0<훽<1
•It starts with unit step size and then reduces it by the factor 훽until the stopping condition
푓(푥−η훻푓(푥))≤푓(푥)−훼η훻푓푥2
•Since −훻푓푥is a descent direction and −훻푓푥2<0, so for small enough step size η, we have
푓푥−η훻푓푥≈푓푥−η훻푓푥2<푓푥−훼η훻푓푥2
•It shows that the backtracking line search eventually terminates
•훼is typically chosen between 0.01 and 0.3
•훽is often chosen to be between 0.1 and 0.8
26
32. Coordinate Descent Method
• Coordinate descent belongs to the class of several non
derivative methods used for minimizing differentiable
functions.
• Here, cost is minimized in one coordinate direction in each
iteration.
32
33. Coordinate Descent Method
•Pros
•It is well suited for parallel computation
•Cons
•May not reach the local minimum even for convex function
33
35. Coordinate Descent Method
•Method of selecting the coordinate for next iteration
•Cyclic Coordinate Descent
•Greedy Coordinate Descent
•(Uniform) Random Coordinate Descent
35
36. Steepest Descent Method
•The gradient descent method takes many iterations
•Steepest Descent Method aims at choosing the best direction at each iteration
•Normalized steepest descent direction
•Δ푥푛푠푑=푎푟푔푚푖푛훻푓푥푇푣푣=1}
•Interpretation: for small 푣,푓푥+푣≈푓푥+훻푓푥푇푣direction Δ푥푛푠푑 is unit-norm step with most negative directional derivative
•Iteratively, the algorithm follows the following steps
•Calculate direction of descent Δ푥푛푠푑
•Calculate step size, t
•푥+=푥+푡Δ푥푛푠푑
36
37. Steepest Descent for various norms
•The choice of norm used the steepest descent direction can be have dramatic effect on converge rate
•푙2norm
•The steepest descent direction is as follows
•Δ푥푛푠푑= −훻푓(푥) 훻푓(푥)2
•푙1norm
•For 푥1= 푖푥푖,a descent direction is as follows,
•Δ푥푛푑푠=−푠푖푔푛 휕푓푥 휕푥푖 ∗푒푖 ∗
•푖∗=푎푟푔min 푖 휕푓 휕푥푖
•푙∞norm
•For 푥∞=argmin 푖 푥푖,a descent direction is as follows
•Δ푥푛푑푠=−푠푖푔푛−훻푓푥
37
40. Steepest Descent Convergence Rate
•Fact: Any norm can be bounded by ∙2, i.e., ∃훾, 훾∈(0,1] such that, 푥≥훾푥2푎푛푑푥∗≥훾푥2
•Theorem 5.5
•If f is strongly convex with respect to m and M, and ∙2has 훾, 훾as above then steepest decent with backtracking line search has linear convergence with rate
•푐=1−2푚훼 훾2min1, 훽훾 푀
•Proof: Will be proved in the lecture 6
40
42. Summary
•Unconstrained Convex Optimization Problem
•Gradient Descent Method
•Step Size Trade-off between safety and speed
•Convergence Conditions
•L-LipschtizFunction
•Strong Convexity
•Condition Number
42
43. Summary
•Exact Line Search
•Backtracking Line Search
•Coordinate Descent Method
•Good for parallel computation but not always converge
•Steepest Descent Method
•The choice of norm is important
43