This document provides an overview of parallel coordinate descent algorithms. It discusses how naive parallelization of sequential coordinate descent will not always converge due to coordinate interactions. Two approaches for parallel coordinate descent are presented: Expected Separable Over-approximation (ESO) and Shotgun. ESO minimizes an overapproximated quadratic function to determine step sizes. Shotgun randomly selects coordinates to update in parallel each iteration. The document also notes limitations such as large communication overhead and inability to prove convergence without knowing the separability and smoothness of the objective function.
3. Definition
Coordinate wise minimization of the objection function.
Objective function is of the form -
F(x) = f (x) + Ω(x) (1)
where
f(x) = partially separable function
Ω(x) = simple block separable function
4. Sequential Coordinate Descent (SCD)
Set x = 0 ∈ R2d
+ ;
while not converged do
Choose j ∈ {1, ..., 2d} uniformly at random;
Set δxj ← max{−xj , −( F(x))j /β};
Update xj ← xj + δxj ;
end
Algorithm 1: Shooting: Sequential Coordinate Descent
5. Approach to Naive Parallelization
Each iteration of SCD minimizes one single coordinate.
We can parallelize by updating multiple coordinates at each
iteration by different processors.
6. Why Naive Parallelization won’t work
1 Theoretically it is proven that ”one at a time” converges while
”all at once” update may not.
2 Depends on correlation among coordinates.
7. Intuition behind Parallelization
If ∆x is the collective update to x in one single iteration by a naive
parallel approach then
F(x + ∆x) − F(x) <
−1
2
ij ∈Pt
(δxij
)2
+
1
2
ij ,ik ∈Pt ,j=k
(AT
A)ij ,ik
δxij
δxik
(2)
where A = design matrix for L1 regularised loss function.
Therefore we need to design our step sizes in parallel updates
based on the interference amount.
8. Expected Separable Over-approximation (ESO)
1 Let the update rule be generally defined as
x ← x +
1
β
i∈ ˆS
hi
ei (3)
where h defines the update rule. Then
E[f (x + h[ ˆS])] ≤ f (x) +
E[ ˆS]
n
(( f (x))T
h) +
β
2
( hw )2
(4)
where h[ ˆS] = i∈ ˆS hi ei
( hw )2 = n
i=1 wi (hi )2
2 We overapproximate function by a quadratic and minimize
that in PCDM1 and PCDM2.
9. Shotgun: Parallel Coordinate Descent
Choose number of parallel updates P ≥ 1;
Set x = 0 ∈ R2d
+ ;
while not converged do
Choose random subset of P weights in {1,...,2d};
In parallel on P processors
Get assigned weight j;
Set δxj ← max{−xj , −( F(x))j /β};
Update xj ← xj + δxj ;
end
Algorithm 2: Shotgun: Parallel Coordinate Descent
10. Parallel Coordinate Descent Method 1 (PCDM 1)
Choose initial point x0 ∈ RN
for k = 0, 1, 2, ... do
Randomly generate a set of blocks Sk ⊂ {1, 2, ..., n}
xk+1 ← xk + (h(xk))[Sk ]
end
Algorithm 3: Parallel Coordinate Descent Method 1 (PCDM 1)
11. Parallel Coordinate Descent Method 2 (PCDM 2)
Choose initial point x0 ∈ RN
for k = 0, 1, 2, ... do
Randomly generate a set of blocks Sk ⊂ {1, 2, ..., n}
xk+1 ← xk + (h(xk))[Sk ]
If F(xk+1) > F(xk), then xk+1 ← xk
end
Algorithm 4: Parallel Coordinate Descent Method 2 (PCDM 2)
12. Limitations
1 Each iteration has minimal computation while communication
overhead will be large
(Synchronous Vs Asynchronous - Optimally Strong convexity
required for convergence).
2 Convergence cannot be proved if nature of F(x), i.e
separability and smoothness is not known.
13. References and Further Reading I
[1] Peter Richtarik, Martin Takac
University of Edinburgh, United Kingdom
Parallel Coordinate Descent Methods for Big Data
Optimization
[2] Joseph Bradley, Aapo Kyrola, Danny Bickson, Carlos
Guestrin
Carnegie Mellon University, Pittsburgh, USA
Parallel Coordinate Descent for L1-Regularized Loss
Minimization
[3] Ji Liu, Stephen J. Wright
University of Wisconsin, Madison, USA
Asynchronous Stochastic Descent: Parallelism and
Convergence Properties