Why Teams call analytics are critical to your entire business
Coordinate Descent method
1. Coordinate descent method
2013.11.21
SanghyukChun
Many contents are from
Large Scale Optimization Lecture 5 by Caramanis& Sanghaviin Texas Austin
Optimization Lecture 25 by Geoff Gordon and Ryan Tibshiraniin CMU
Convex Optimization Lecture 20 by SuvritSrain UC Berkeley 1
3. Overview of Coordinate descent method
•Idea
•Recall: unconstrained minimization problem
•From Lecture 1, the formation of an unconstrained optimization problem is as follows
•min푓푥
•Where 푓:푅푛→푅is convex and smooth
•In this problem, the necessary and sufficient condition for optimal solution x0 is
•훻푓푥=0푎푡푥=푥0
•훻푓푥= 휕푓 휕푥1 풆ퟏ+⋯+ 휕푓 휕푥푛 풆풏=0
•Thus, in this situation, 휕푓 휕푥1=⋯= 휕푓 휕푥푛 =0
•What if minimize for each basis respectively?
3
4. Overview of Coordinate descent method
•Description
•퐿푒푡푒1,푒2,…,푒푛is basis for function 푓
•If 푥푖 푘is given, the 푖thcoordinate of 푥푖 푘+1is given by
•푥푖 푘+1←푎푟푔푚푖푛푦∈R푓(푥1 푘+1,…,푥푖−1 푘+1,푦,푥푖+1 푘,…,푥푛 푘)
•푥푖 푘+1overwrites value in 푥푖 푘(in actual implementation)
•Algorithm
•Initialize with guess 푥=푥1,푥2,…,푥푛 푇
•repeatfor all j in 1,2,…,n do 푥푗←푎푟푔푚푖푛푥푗푓푥 end foruntil convergence
4
5. Overview of Coordinate descent method
•Start with some initial guess 푥(0), and repeat for k = 1,2,3…
•푥1(푘)∈푎푟푔푚푖푛푥1푓(푥1,푥2 푘−1,푥3 푘−1,…,푥푛 푘−1)
•푥2(푘)∈푎푟푔푚푖푛푥2푓(푥1 푘,푥2,푥3 푘−1,…,푥푛 푘−1)
•푥3(푘)∈푎푟푔푚푖푛푥3푓(푥1 푘,푥2 푘,푥3,…,푥푛 푘−1)
…
•푥푛 (푘)∈푎푟푔푚푖푛푥푛푓푥1 푘,푥2 푘,푥3 푘,…,푥푛
•Every iteration, it goes each coordinate basis direction
•c.f. Gradient Descent Method
•Every iteration (step), it goes 훻푓= 휕푓 휕푥1 풆ퟏ+⋯+ 휕푓 휕푥푛 풆풏direction
5
6. Properties of Coordinate Descent
•Note:
•Order of cycle through coordinates is arbitrary, can use any permutation of {1,2,…,n}
•Cyclic order: 1,2,…,n,1,…, repeat
•Almost Cyclic: Each coordinate 1<i<n picked at least once every B successive iterations (B>n)
•Double sweep: 1,2,…,n,n-1,…,2,1, repeat
•Cyclic with permutation: random order each cycle
•Random sampling: pick random index at each iteration
•Can everywhere replace individual coordinates with blocks of coordinates (Block Coordinate Descent Method)
•“One-at-time” update scheme is critical, and “all-at-once” scheme does not necessarily converge
6
7. Properties of Coordinate Descent
•Advantages
•Parallel algorithm is possible
•No step size tuning
•Each iteration usually cheap (single variable optimization)
•No extra storage vectors needed
•No other pesky parameters (usually) that must be tuned
•Works well for large-scale problems
•Very useful in cases where the actual gradient of 푓is not known
•Easy to implement
•Disadvantages
•Tricky if single variable optimization is hard
•Convergence theory can be complicated
•Can be slower near optimum than more sophisticated methods
•Non smooth case more tricky
7
8. Convergence of Coordinate descent
•Recall: 푥푖 푘+1←푎푟푔푚푖푛푦∈R푓(푥1 푘+1,…,푥푖−1 푘+1,푦,푥푖+1 푘,…,푥푛 푘)
•Thus, one beings with an initial 푥0for a local minimum on F, and get a sequence 푿0,푿1,푿2,…iteratively
•By doing line search in each iteration, we automatically have
•퐹푿0≥퐹푿1≥퐹푿2≥⋯,
•It can be shown that this sequence has similar convergence properties as steepest descent
•No improvement after one cycle of line search along coordinate directions implies a stationary point is reached
8
9. Convergence Analysis
•For continuously differentiable cost functions, it can be shown to generate sequences whose limit points are stationary
•Lemma 5.4
•Proof
•In the Caramanislecture note
•Idea: show that limj→∞ 푥1(푘푗+1) −푥1 푘푗=0using limj→∞ 푧1 푘푗−푥1 푘푗=0 푤ℎ푒푟푒,푧푖 (푘)=(푥1 푘+1,…,푥푖 푘+1,푥푖+1 푘,…,푥푛 (푘))
9
10. Convergence Analysis
•Question
•Given convex, differentiable 푓:푅푛→푅, if we are at a point 푥s.t. 푓푥is minimized along each coordinate axis, have we found a global minimizer?
•i.e., does 푓푥+푑∙푒푖≥푓푥푓표푟∀푑,푖→푓푥=min 푧 푓푧?
•Here, 푒푖=0,…,1,…,0∈푅푛, the 푖-thstandard basis vector
•Answer
•Yes
•Proof
•훻푓푥= 휕푓 휕푥1 풆ퟏ+⋯+ 휕푓 휕푥푛 풆풏=0
10
12. Convergence Analysis
•Question
•Same again, but now 푓푥=푔푥+Σ푖=1 푛ℎ푖푥푖
•Where 푔convex, differentiable and each ℎ푖convex?
•Here, non-smooth part called separable
•Answer
•Yes
•Proof: for any 푦
•푓푦−푓푥≥훻푔푥푇푦−푥+Σ푖=1 푛ℎ푖푦푖−ℎ푖푥푖 = 푖=1 푛 훻푖푔푥푦푖−푥푖+ℎ푖푦푖−ℎ푖푥푖≥0
12
≥0
13. Example
13
•Example Matlabcode
•Reuse source code from http://www.mathworks.com/matlabcentral/fileexchange/35535- simplified-gradient-descent-optimization