Parallel Coordinate Descent Algorithms Review

•

0 gefällt mir•608 views

This document provides an overview of parallel coordinate descent algorithms. It discusses how naive parallelization of sequential coordinate descent will not always converge due to coordinate interactions. Two approaches for parallel coordinate descent are presented: Expected Separable Over-approximation (ESO) and Shotgun. ESO minimizes an overapproximated quadratic function to determine step sizes. Shotgun randomly selects coordinates to update in parallel each iteration. The document also notes limitations such as large communication overhead and inability to prove convergence without knowing the separability and smoothness of the objective function.

Daten & Analysen

Parallel Coordinate Descent Algorithms
A Quick Review
Chaitanya Prasad, S. Shaleen Kumar Gupta
Nanyang Technological Institute, Singapore
June 7, 2016

Outline
1 Introduction
2 Sequential Coordinate Descent
3 Naive Parallelization
4 Intuition behind Parallelization
5 Expected Separable Over-approximation (ESO)
6 Algorithm for Parallel Coordinate Decent
7 Limitations

Deﬁnition
Coordinate wise minimization of the objection function.
Objective function is of the form -
F(x) = f (x) + Ω(x) (1)
where
f(x) = partially separable function
Ω(x) = simple block separable function

Sequential Coordinate Descent (SCD)
Set x = 0 ∈ R2d
+ ;
while not converged do
Choose j ∈ {1, ..., 2d} uniformly at random;
Set δxj ← max{−xj , −( F(x))j /β};
Update xj ← xj + δxj ;
end
Algorithm 1: Shooting: Sequential Coordinate Descent

Approach to Naive Parallelization
Each iteration of SCD minimizes one single coordinate.
We can parallelize by updating multiple coordinates at each
iteration by diﬀerent processors.

Why Naive Parallelization won’t work
1 Theoretically it is proven that ”one at a time” converges while
”all at once” update may not.
2 Depends on correlation among coordinates.

Intuition behind Parallelization
If ∆x is the collective update to x in one single iteration by a naive
parallel approach then
F(x + ∆x) − F(x) <
−1
2
ij ∈Pt
(δxij
)2
+
1
2
ij ,ik ∈Pt ,j=k
(AT
A)ij ,ik
δxij
δxik
(2)
where A = design matrix for L1 regularised loss function.
Therefore we need to design our step sizes in parallel updates
based on the interference amount.

Expected Separable Over-approximation (ESO)
1 Let the update rule be generally deﬁned as
x ← x +
1
β
i∈ ˆS
hi
ei (3)
where h deﬁnes the update rule. Then
E[f (x + h[ ˆS])] ≤ f (x) +
E[ ˆS]
n
(( f (x))T
h) +
β
2
( hw )2
(4)
where h[ ˆS] = i∈ ˆS hi ei
( hw )2 = n
i=1 wi (hi )2
2 We overapproximate function by a quadratic and minimize
that in PCDM1 and PCDM2.

Shotgun: Parallel Coordinate Descent
Choose number of parallel updates P ≥ 1;
Set x = 0 ∈ R2d
+ ;
while not converged do
Choose random subset of P weights in {1,...,2d};
In parallel on P processors
Get assigned weight j;
Set δxj ← max{−xj , −( F(x))j /β};
Update xj ← xj + δxj ;
end
Algorithm 2: Shotgun: Parallel Coordinate Descent

Parallel Coordinate Descent Method 1 (PCDM 1)
Choose initial point x0 ∈ RN
for k = 0, 1, 2, ... do
Randomly generate a set of blocks Sk ⊂ {1, 2, ..., n}
xk+1 ← xk + (h(xk))[Sk ]
end
Algorithm 3: Parallel Coordinate Descent Method 1 (PCDM 1)

Parallel Coordinate Descent Method 2 (PCDM 2)
Choose initial point x0 ∈ RN
for k = 0, 1, 2, ... do
Randomly generate a set of blocks Sk ⊂ {1, 2, ..., n}
xk+1 ← xk + (h(xk))[Sk ]
If F(xk+1) > F(xk), then xk+1 ← xk
end
Algorithm 4: Parallel Coordinate Descent Method 2 (PCDM 2)

Limitations
1 Each iteration has minimal computation while communication
overhead will be large
(Synchronous Vs Asynchronous - Optimally Strong convexity
required for convergence).
2 Convergence cannot be proved if nature of F(x), i.e
separability and smoothness is not known.

References and Further Reading I
[1] Peter Richtarik, Martin Takac
University of Edinburgh, United Kingdom
Parallel Coordinate Descent Methods for Big Data
Optimization
[2] Joseph Bradley, Aapo Kyrola, Danny Bickson, Carlos
Guestrin
Carnegie Mellon University, Pittsburgh, USA
Parallel Coordinate Descent for L1-Regularized Loss
Minimization
[3] Ji Liu, Stephen J. Wright
University of Wisconsin, Madison, USA
Asynchronous Stochastic Descent: Parallelism and
Convergence Properties

Empfohlen

Lecture note4coordinatedescentXudong Sun

Convex Optimization Modelling with CVXOPTandrewmart11

Asmatsus

H2O World - Consensus Optimization and Machine Learning - Stephen BoydSri Ambati

Estimating structured vector autoregressive modelsAkira Tanimoto

Lecture 4: Stochastic Hydrology (Site Characterization)Amro Elfeki

Solving the energy problem of helium final reportJamesMa54

Lecture 5: Stochastic Hydrology Amro Elfeki

Empfohlen

Lecture note4coordinatedescentXudong Sun

Convex Optimization Modelling with CVXOPTandrewmart11

Asmatsus

H2O World - Consensus Optimization and Machine Learning - Stephen BoydSri Ambati

Estimating structured vector autoregressive modelsAkira Tanimoto

Lecture 4: Stochastic Hydrology (Site Characterization)Amro Elfeki

Solving the energy problem of helium final reportJamesMa54

Lecture 5: Stochastic Hydrology Amro Elfeki

Fast and efficient exact synthesis of single qubit unitaries generated by cli...JamesMa54

Solovay Kitaev theoremJamesMa54

Distributed Architecture of Subspace Clustering and RelatedPei-Che Chang

Optimal Budget Allocation: Theoretical Guarantee and Efficient AlgorithmTasuku Soma

MLP輪読スパース8章トレースノルム正則化Akira Tanimoto

Maximizing Submodular Function over the Integer LatticeTasuku Soma

MVPA with SpaceNet: sparse structured priorsElvis DOHMATOB

Lecture 6: Stochastic Hydrology (Estimation Problem-Kriging-, Conditional Sim...Amro Elfeki

Stochastic Hydrology Lecture 1: Introduction Amro Elfeki

Hyperparameter optimization with approximate gradientFabian Pedregosa

Lecture 2: Stochastic Hydrology Amro Elfeki

Pseudo Random Number GeneratorsDarshini Parikh

Brief Introduction About Topological Interference Management (TIM)Pei-Che Chang

Lecture9 xingTianlu Wang

Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Tomoya Murata

Lecture 3: Stochastic HydrologyAmro Elfeki

cheb_conf_aksenov.pdfAlexey Vasyukov

SPSF02 - Graphical Data RepresentationSyeilendra Pramuditya

DSP 05 _ Sheet FiveAmr E. Mohamed

SPSF03 - Numerical IntegrationsSyeilendra Pramuditya

Large Scale Kernel Learning using Block Coordinate DescentShaleen Kumar Gupta

Relaxation methods for the matrix exponential on large networksDavid Gleich

Weitere ähnliche Inhalte

Was ist angesagt?

Fast and efficient exact synthesis of single qubit unitaries generated by cli...JamesMa54

Solovay Kitaev theoremJamesMa54

Distributed Architecture of Subspace Clustering and RelatedPei-Che Chang

Optimal Budget Allocation: Theoretical Guarantee and Efficient AlgorithmTasuku Soma

MLP輪読スパース8章トレースノルム正則化Akira Tanimoto

Maximizing Submodular Function over the Integer LatticeTasuku Soma

MVPA with SpaceNet: sparse structured priorsElvis DOHMATOB

Lecture 6: Stochastic Hydrology (Estimation Problem-Kriging-, Conditional Sim...Amro Elfeki

Stochastic Hydrology Lecture 1: Introduction Amro Elfeki

Hyperparameter optimization with approximate gradientFabian Pedregosa

Lecture 2: Stochastic Hydrology Amro Elfeki

Pseudo Random Number GeneratorsDarshini Parikh

Brief Introduction About Topological Interference Management (TIM)Pei-Che Chang

Lecture9 xingTianlu Wang

Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Tomoya Murata

Lecture 3: Stochastic HydrologyAmro Elfeki

cheb_conf_aksenov.pdfAlexey Vasyukov

SPSF02 - Graphical Data RepresentationSyeilendra Pramuditya

DSP 05 _ Sheet FiveAmr E. Mohamed

SPSF03 - Numerical IntegrationsSyeilendra Pramuditya

Was ist angesagt? (20)

Fast and efficient exact synthesis of single qubit unitaries generated by cli...

Solovay Kitaev theorem

Distributed Architecture of Subspace Clustering and Related

Optimal Budget Allocation: Theoretical Guarantee and Efficient Algorithm

MLP輪読スパース8章トレースノルム正則化

Maximizing Submodular Function over the Integer Lattice

MVPA with SpaceNet: sparse structured priors

Lecture 6: Stochastic Hydrology (Estimation Problem-Kriging-, Conditional Sim...

Stochastic Hydrology Lecture 1: Introduction

Hyperparameter optimization with approximate gradient

Lecture 2: Stochastic Hydrology

Pseudo Random Number Generators

Brief Introduction About Topological Interference Management (TIM)

Lecture9 xing

Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...

Lecture 3: Stochastic Hydrology

cheb_conf_aksenov.pdf

SPSF02 - Graphical Data Representation

DSP 05 _ Sheet Five

SPSF03 - Numerical Integrations

Andere mochten auch

Large Scale Kernel Learning using Block Coordinate DescentShaleen Kumar Gupta

Relaxation methods for the matrix exponential on large networksDavid Gleich

Dsp and the predictionSoohan Ahn

Closed-Form Solutions in Low-Rank Subspace Recovery Models and Their Implicat...少华白

Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit

Solution of engineering problemsGiridhar D

COCOA: Communication-Efficient Coordinate Ascentjeykottalam

Andere mochten auch (7)

Large Scale Kernel Learning using Block Coordinate Descent

Relaxation methods for the matrix exponential on large networks

Dsp and the prediction

Closed-Form Solutions in Low-Rank Subspace Recovery Models and Their Implicat...

Parallel Linear Regression in Interative Reduce and YARN

Solution of engineering problems

COCOA: Communication-Efficient Coordinate Ascent

Ähnlich wie Parallel Coordinate Descent Algorithms Review

Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...The Statistical and Applied Mathematical Sciences Institute

MLHEP 2015: Introductory Lecture #4arogozhnikov

Randomized algorithms ver 1.0Dr. C.V. Suresh Babu

02 basics i-handoutsheetslibrary

Nonconvex Compressed Sensing with the Sum-of-Squares MethodTasuku Soma

QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...The Statistical and Applied Mathematical Sciences Institute

The world of loss function홍배 김

Integration techniquesKrishna Gali

FEM Introduction: Solving ODE-BVP using the Galerkin's MethodSuddhasheel GHOSH, PhD

Distributed solution of stochastic optimal control problem on GPUsPantelis Sopasakis

A new implementation of k-MLE for mixture modelling of Wishart distributionsFrank Nielsen

Bayesian inference on mixturesChristian Robert

Tensor Train data format for uncertainty quantificationAlexander Litvinenko

Steven Duplij, "Polyadic rings of p-adic integers"Steven Duplij (Stepan Douplii)

SPDE presentation 2012Zheng Mengdi

Point Collocation Method used in the solving of Differential Equations, parti...Suddhasheel GHOSH, PhD

Control of Discrete-Time Piecewise Affine Probabilistic Systems using Reachab...Leo Asselborn

CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...The Statistical and Applied Mathematical Sciences Institute

MA8353 TPDErmkceteee

Secant MethodNasima Akhtar

Ähnlich wie Parallel Coordinate Descent Algorithms Review (20)

Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...

MLHEP 2015: Introductory Lecture #4

Randomized algorithms ver 1.0

02 basics i-handout

Nonconvex Compressed Sensing with the Sum-of-Squares Method

QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...

The world of loss function

Integration techniques

FEM Introduction: Solving ODE-BVP using the Galerkin's Method

Distributed solution of stochastic optimal control problem on GPUs

A new implementation of k-MLE for mixture modelling of Wishart distributions

Bayesian inference on mixtures

Tensor Train data format for uncertainty quantification

Steven Duplij, "Polyadic rings of p-adic integers"

SPDE presentation 2012

Point Collocation Method used in the solving of Differential Equations, parti...

Control of Discrete-Time Piecewise Affine Probabilistic Systems using Reachab...

CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...

MA8353 TPDE

Secant Method

Kürzlich hochgeladen

RadioAdProWritingCinderellabyButleri.pdfgstagge

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档208367051

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna

NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7

DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett

Learn How Data Science Changes Our WorldEduminds Learning

ASML's Taxonomy Adventure by Daniel Cantervoginip

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一fhwihughh

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534

Call Girls in Saket 99530🔝 56974 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2

Kürzlich hochgeladen (20)

RadioAdProWritingCinderellabyButleri.pdf

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING

PKS-TGC-1084-630 - Stage 1 Proposal.pptx

Student profile product demonstration on grades, ability, well-being and mind...

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...

NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...

DBA Basics: Getting Started with Performance Tuning.pdf

Learn How Data Science Changes Our World

ASML's Taxonomy Adventure by Daniel Canter

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...

Call Girls in Saket 99530🔝 56974 Escort Service

Identifying Appropriate Test Statistics Involving Population Mean

Parallel Coordinate Descent Algorithms Review

1. Parallel Coordinate Descent Algorithms A Quick Review Chaitanya Prasad, S. Shaleen Kumar Gupta Nanyang Technological Institute, Singapore June 7, 2016

2. Outline 1 Introduction 2 Sequential Coordinate Descent 3 Naive Parallelization 4 Intuition behind Parallelization 5 Expected Separable Over-approximation (ESO) 6 Algorithm for Parallel Coordinate Decent 7 Limitations

3. Deﬁnition Coordinate wise minimization of the objection function. Objective function is of the form - F(x) = f (x) + Ω(x) (1) where f(x) = partially separable function Ω(x) = simple block separable function

4. Sequential Coordinate Descent (SCD) Set x = 0 ∈ R2d + ; while not converged do Choose j ∈ {1, ..., 2d} uniformly at random; Set δxj ← max{−xj , −( F(x))j /β}; Update xj ← xj + δxj ; end Algorithm 1: Shooting: Sequential Coordinate Descent

5. Approach to Naive Parallelization Each iteration of SCD minimizes one single coordinate. We can parallelize by updating multiple coordinates at each iteration by diﬀerent processors.

6. Why Naive Parallelization won’t work 1 Theoretically it is proven that ”one at a time” converges while ”all at once” update may not. 2 Depends on correlation among coordinates.

7. Intuition behind Parallelization If ∆x is the collective update to x in one single iteration by a naive parallel approach then F(x + ∆x) − F(x) < −1 2 ij ∈Pt (δxij )2 + 1 2 ij ,ik ∈Pt ,j=k (AT A)ij ,ik δxij δxik (2) where A = design matrix for L1 regularised loss function. Therefore we need to design our step sizes in parallel updates based on the interference amount.

8. Expected Separable Over-approximation (ESO) 1 Let the update rule be generally deﬁned as x ← x + 1 β i∈ ˆS hi ei (3) where h deﬁnes the update rule. Then E[f (x + h[ ˆS])] ≤ f (x) + E[ ˆS] n (( f (x))T h) + β 2 ( hw )2 (4) where h[ ˆS] = i∈ ˆS hi ei ( hw )2 = n i=1 wi (hi )2 2 We overapproximate function by a quadratic and minimize that in PCDM1 and PCDM2.

9. Shotgun: Parallel Coordinate Descent Choose number of parallel updates P ≥ 1; Set x = 0 ∈ R2d + ; while not converged do Choose random subset of P weights in {1,...,2d}; In parallel on P processors Get assigned weight j; Set δxj ← max{−xj , −( F(x))j /β}; Update xj ← xj + δxj ; end Algorithm 2: Shotgun: Parallel Coordinate Descent

10. Parallel Coordinate Descent Method 1 (PCDM 1) Choose initial point x0 ∈ RN for k = 0, 1, 2, ... do Randomly generate a set of blocks Sk ⊂ {1, 2, ..., n} xk+1 ← xk + (h(xk))[Sk ] end Algorithm 3: Parallel Coordinate Descent Method 1 (PCDM 1)

11. Parallel Coordinate Descent Method 2 (PCDM 2) Choose initial point x0 ∈ RN for k = 0, 1, 2, ... do Randomly generate a set of blocks Sk ⊂ {1, 2, ..., n} xk+1 ← xk + (h(xk))[Sk ] If F(xk+1) > F(xk), then xk+1 ← xk end Algorithm 4: Parallel Coordinate Descent Method 2 (PCDM 2)

12. Limitations 1 Each iteration has minimal computation while communication overhead will be large (Synchronous Vs Asynchronous - Optimally Strong convexity required for convergence). 2 Convergence cannot be proved if nature of F(x), i.e separability and smoothness is not known.

13. References and Further Reading I [1] Peter Richtarik, Martin Takac University of Edinburgh, United Kingdom Parallel Coordinate Descent Methods for Big Data Optimization [2] Joseph Bradley, Aapo Kyrola, Danny Bickson, Carlos Guestrin Carnegie Mellon University, Pittsburgh, USA Parallel Coordinate Descent for L1-Regularized Loss Minimization [3] Ji Liu, Stephen J. Wright University of Wisconsin, Madison, USA Asynchronous Stochastic Descent: Parallelism and Convergence Properties