SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Downloaden Sie, um offline zu lesen
Distributed Subgradient Methods
for Saddle-Point Problems
David Mateos-N´u˜nez Jorge Cort´es
University of California, San Diego
{dmateosn,cortes}@ucsd.edu
Conference on Decision and Control
Osaka, Japan, December 17, 2015
1 / 21
Context: decentralized, peer-to-peer systems
Sensor networks Medical diagnosis
Formation control Recommender systems
2 / 21
General agenda for today
Review of (consensus based) distributed convex optimization
Part 1:
Distributed optimization with separable constraints
via agreement on the Lagrange multipliers
General saddle-point problems with explicit agreement
Part 2:
Convex-concave problems not arising from Lagrangians
e.g., strict concave part
Distributed low-rank matrix completion
through a sadlle-point characterization of the nuclear norm
3 / 21
Review: consensus based distributed convex optimization
x∗
∈ arg min
x∈Rd
N
i=1
f i
(x) (basic unconstrained problem)
Agent i has access to f i
Agent i can share its estimate of x∗ with “neighboring” agents
f 1
f 2
f 3
f 4
f 5
A =






a13 a14
a21 a25
a31 a32
a41
a52






Adjacency matrix
Parallel computations: Tsitsiklis 84, Bertsekas and Tsitsiklis 95
Consensus: Jadbabaie et al. 03, Olfati-Saber, Murray 04, Boyd et al. 05
Distributed multi-agent optimization: A. Nedi´c and A. Ozdaglar 07 4 / 21
Review: the Laplacian matrix
L = diag(A1) − A =






a13 + a14 −a13 −a14
−a21 a21 + a25 −a25
−a31 −a32 a31 + a32
−a41 a41
−a52 a52






Nullspace is agreement ⇔ graph has spanning tree
Consensus via feedback on disagreement −[Lx]i = N
j=1 aij (xj − xi )
5 / 21
Part 1: Distributed constrained convex optimization
min
wi ∈Wi , ∀i
D∈D
N
i=1
f i
(wi
, D)
s.t. g1
(w1
, D) + · · · + gN
(wN
, D) ≤ 0
Constraints might couple decisions of agents that cannot communicate directly
e.g., wi 2
2 − 10 ≤ 0
Agent i only knows how wi enters the constraint through gi
D is the usual decision vector the agents need to agree upon
Constraints are useful models for
Traffic and routing (Flow conservation)
Resource allocation (Budgets)
Optimal control (System evolution)
Network formation (Relative positions/angles)
6 / 21
Agenda for distributed constrained optimization
Previous work and limitations
Distributing the constraints through the Lagrangian decomposition
Idea: Agreement on the multiplier
General saddle-point problems with agreement constraints
Our distributed saddle-point dynamics with Laplacian averaging
Theorem of convergence: ∼ 1√
# iter.
saddle-point evaluation error
7 / 21
Previous work by type of constraint & info. structure
N
i=1
f i
(x)
s.t. g(x) ≤ 0
All agents know g
2011 D. Yuan, S. Xu, and H. Zhao
2012 M. Zhu and S. Mart´ınez
Increasing literature
min
wi ∈Wi
N
i=1
f i
(wi
)
s.t.
N
i=1
gi
(wi
) ≤ 0
Agent i knows only gi
when Aw ≤ 0, knows only column i
versus column i & row i of A
Less studied information structure:
’10 D. Mosk-Aoyama,
T. Roughgarden and D. Shah
(only linear constraints)
’13 M. B¨urger, G. Notarstefano,
and F. Allgwer
Dual cutting-plane consensus methods
’13 T.-H. Chang, A. Nedi´c,
and A. Scaglione
Primal-dual perturbation methods
8 / 21
Distributing the constraint via agreement on multipliers
min
wi ∈Wi
D∈D
N
i=1
f i
(wi
, D)
s.t. g1
(w1
, D) + · · · + gN
(wN
, D) ≤ 0
same as
min
wi ∈Wi
D∈D
max
z∈Rm
≥0
N
i=1
f i
(wi
, D) + z
N
i=1
gi
(wi
, D)
= min
wi ∈Wi
D∈D
max
zi ∈Rm
≥0
zi =zj ∀i,j
N
i=1
f i
(wi
, D) + zi
gi
(wi
, D)
= min
wi ∈Wi
Di ∈D
Di =Dj ∀i,j
max
zi ∈Rm
≥0
zi =zj ∀i,j
N
i=1
f i
(wi
, D
i
) + zi
gi
(wi
, D
i
)
Local coupled through agreement
(Existence of saddle-points ⇒ Max-min property = Strong duality )9 / 21
Saddle-point problems with explicit agreement
A more general framework
min
w∈W
(D1,...,DN )∈DN
Di =Dj , ∀i,j
max
µ∈M
(z1,...,zN )∈ZN
zi =zj , ∀i,j
φ w, (D
1
, . . . , D
N
)
convex
, µ, (z1
, . . . , zN
)
concave
Distributed setting unstudied in the literature
Inspiration from A. Nedi´c and A. Ozdaglar, 09 and K. Arrow, et al. 1958
Particularizes to...
Convex-concave functions arising from Lagrangians
The concave part is linear
Min-max formulation of nuclear norm regularization (later in talk)
The concave part is quadratic
10 / 21
Our general algorithm
Projected saddle-point subgradient algorithm with Laplacian averaging
(provided a saddle-point exists under agreement)
ˆwt+1 = wt − ηtgwt
ˆDt+1 = Dt −σLtDt
Lap. aver.
−ηtgDt
ˆµt+1 = µt + ηtgµt
ˆzt+1 = zt −σLtzt
Lap. aver.
+ηt ˆgzt
(wt+1, Dt+1, µt+1, zt+1) = (PW ˆwt+1 , PDN ˆDt+1 , PM ˆµt+1 , PZN ˆzt+1 )
Orthogonal projections onto compact convex sets
gwt , gDt , gµt , ˆgzt are subgradients and supergradients of φ(w, D, µ, z)
Any initial conditions; not “anytime constraint satisfaction”
11 / 21
Theorem (Distributed saddle-point approximation)
Assume that
φ(w, D, µ, z) is convex in (w, D) ∈ W × DN and concave in
(µ, z) ∈ M × ZN
The dynamics is bounded (maybe achieved through projections)
The sequence of weight-balanced communication digraphs is
δ-nondegenerate (aij > δ whenever aij > 0)
B-jointly-connected (unions of length B are strongly connected)
For a suitable choice of consensus stepsize σ and (decreasing) subgradient
stepsizes {ηt}, then, for any saddle point (w∗
, D
∗
, µ∗
, z∗
)
D∗=D∗⊗1,z∗=z∗⊗1
of φ ,
−
α
√
t − 1
≤ φ(wav
t , D
av
t , zav
t , µav
t ) − φ(w∗
, D
∗
, z∗
, µ∗
) ≤
α
√
t − 1
wav
t+1 := 1
t+1
t+1
s=1
ws = t
t+1wav
t + 1
t+1wt+1
can be computed recursively
12 / 21
Part 2: Beyond Lagrangians
Lagrangians are particular cases of convex-concave functions
– the concave part (Lagrange multipliers) is always linear
Other min-max problems can benefit from distributed formulations
– e.g., min-max formulations of the nuclear norm
Agenda for distributed optimization with nuclear norm
regularization
Definition of nuclear norm
Application to low-rank matrix completion
Our dynamics for distributed optimization with nuclear norm
Theorem of convergence (corollary from previous result)
13 / 21
Review: definition of nuclear norm
Given a matrix W =


| |
w1 · · · wN
| |

 ∈ Rd×N
W 2
∗ := sum of singular values of W
= trace
√
WW = trace
N
i=1
wi wi
Optimization with nuclear norm regularization
min
wi ∈Wi
N
i=1
f i
(wi ) + γ W 2
∗
favors vectors {wi }N
i=1 belonging to a low-dimensional subspace
14 / 21
Distributed low-rank matrix completion
Tara Philip Mauricio Miroslav
Toy Story
Jurasic Park
· · · · · · · · · · · · · · ·
W = [ W:,1 W:,2 W:,3 W:,4 ]
Estimate W from the revealed entries {Zij }
min
W ∈Rd×N
(i,j)∈revealed
Wij − Zij
2
2 + γ W ∗
Nuclear norm
γ depends on application, dimensions... (Regularization, not penalty)
Netflix: users N ∼ 107 movies d ∼ 105
Why making it distributed? Because users may not want to share
their ratings
15 / 21
Formulation of nuclear norm as saddle-point problem
Drawing from another paper of the authors (ignore details)
min
wi ∈Wi
N
i=1
f i
(wi ) + γ [ W |
√
Id ] ∗
= min
wi ∈Wi , Di ∈{D cId }
Di = Dj ∀i, j
sup
xi ∈Rd
Yi ∈Rd×d
N
i=1
Fi (wi , Di
convex
, xi , Yi
concave
)
with convex-concave local functions
Fi (w, D, x, Y )
Rd ×{D cId }×Rd ×Rd×d
:= fi (w)
convex
+ γ trace D(−xx − N YY
quadratic concave part because D 0
)
−2γw x − 2γ
N
trace(Y ) +
1
N
trace(D)
linear in each variable
See Distributed optimization for multi-task learning via nuclear-norm
approximation, NecSys15, D. Mateos-N´u˜nez, J. Cort´es 16 / 21
Distributed saddle-point dynamics for nuclear optimization
wi (k + 1) = PW wi (k) − ηk gi (k) − 2γxi (k)
Di (k + 1) = P{D cId } Di (k) − ηkγ − xi xi − N Yi Yi + 1
N Id
+ σ
N
j=1
aij,t(Dj (k) − Di (k))
“Only” communication, size d × d
xi (k + 1) = xi (k) + ηkγ − 2Di (k)xi (k) − 2wi (k)
Yi (k + 1) = Yi (k) + ηkγ −
2
N
Di (k)Yi (k) −
2
N
Id
Convergence is a corollary from previous theorem
User i does not need to share wi with its neighbors!!
Di → N
i=1 wi wi + Id conveys only mixed information
Complexity per iteration: orthogonal projection onto {D cId }
17 / 21
Simulation of matrix completion
20 users × 8 movies. Each user rates 5 movies. Ratings are private
10
0
10
1
10
2
10
3
10
4
0
0.5
1
1.5
2
Distributed saddle-point dynamics
Centralized subgradient descent
10
0
10
1
10
2
10
3
10
4
10
3
10
4
0 5000 10000 15000
0
1
2
3
4
5
W (k)−Z F
Z F
N
i=1 j∈Υi
(Wij (k) − Zij )2 + γ W (k)|
√
Id ∗
( N
i=1 Di (k) − 1
N
N
i=1 Di (k) 2
F )1/2
Matrix
fitting error
Network
cost function
Disagreement
local matrices
18 / 21
Conclusions
More details in
arXiv “Distributed saddle-point subgradient algorithms with
Laplacian averaging,” submitted to Transactions on Automatic Control
Our algorithms particularize to deal with
Saddle-points of Lagrangians for distributed constrained optimization
Less studied type of constraints/information structure in the literature
Constraints couple decisions of agents that can’t communicate directly
Min-max distributed formulations of nuclear norm
“Distributed optimization for multi-task learning via
nuclear-norm approximation”, D. Mateos-N´u˜nez, J. Cort´es
First multi-agent treatment of nuclear norm regularization
19 / 21
Future directions
Bounds on Lagrange multipliers in a distributed way
Necessary to guarantee boundedness of the dynamics’ trajectories
One such procedure in arXiv version
Application to semidefinite constraints with chordal sparsity
agents update the entries corresponding to maximal cliques subject to
agreement on the intersections
Other applications that you can find...
IEEE Spectrum. Japan project of Orbital Solar Farm
20 / 21
Thank you for listening!
21 / 21
(Back slide) Outline of the proof
Inequality techniques from A. Nedi´c and A. Ozdaglar, 2009
Saddle-point evaluation error
tφ(wav
t+1, D
av
t+1, µav
t+1, zav
t+1) − tφ(w∗
, D
∗
, µ∗
, z∗
) (1)
at running-time averages, wav
t+1 := 1
t
t
s=1 ws, etc.
Bound for (1) in terms of
initial conditions
bound on subgradients and states of the dynamics
disagreement
sum of learning rates
Input-to-state stability with respect to agreement
LKDt 2 ≤ CI D1 2 1 −
˜δ
4N2
t−1
B
+ CU max
1≤s≤t−1
ds 2
subgradients as disturbances
Doubling Trick scheme: for m = 0, 1, 2, . . . , log2 t , take
ηs = 1√
2m
in each period of 2m rounds s = 2m, . . . , 2m+1 − 1
1 / 1

Weitere ähnliche Inhalte

Was ist angesagt?

Lesson 27: Integration by Substitution (Section 041 slides)
Lesson 27: Integration by Substitution (Section 041 slides)Lesson 27: Integration by Substitution (Section 041 slides)
Lesson 27: Integration by Substitution (Section 041 slides)Matthew Leingang
 
Machine learning in science and industry — day 4
Machine learning in science and industry — day 4Machine learning in science and industry — day 4
Machine learning in science and industry — day 4arogozhnikov
 
"Demystifying Deep Neural Networks," a Presentation from BDTI
"Demystifying Deep Neural Networks," a Presentation from BDTI"Demystifying Deep Neural Networks," a Presentation from BDTI
"Demystifying Deep Neural Networks," a Presentation from BDTIEdge AI and Vision Alliance
 
State of art pde based ip to bt vijayakrishna rowthu
State of art pde based ip to bt  vijayakrishna rowthuState of art pde based ip to bt  vijayakrishna rowthu
State of art pde based ip to bt vijayakrishna rowthuvijayakrishna rowthu
 
TS-IASSL2014
TS-IASSL2014TS-IASSL2014
TS-IASSL2014ychaubey
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1arogozhnikov
 
Machine learning in science and industry — day 2
Machine learning in science and industry — day 2Machine learning in science and industry — day 2
Machine learning in science and industry — day 2arogozhnikov
 
Lesson 27: Integration by Substitution (Section 041 handout)
Lesson 27: Integration by Substitution (Section 041 handout)Lesson 27: Integration by Substitution (Section 041 handout)
Lesson 27: Integration by Substitution (Section 041 handout)Matthew Leingang
 
Lesson 27: Integration by Substitution (Section 021 handout)
Lesson 27: Integration by Substitution (Section 021 handout)Lesson 27: Integration by Substitution (Section 021 handout)
Lesson 27: Integration by Substitution (Section 021 handout)Matthew Leingang
 
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)Frank Nielsen
 
Distributed Parallel Process Particle Swarm Optimization on Fixed Charge Netw...
Distributed Parallel Process Particle Swarm Optimization on Fixed Charge Netw...Distributed Parallel Process Particle Swarm Optimization on Fixed Charge Netw...
Distributed Parallel Process Particle Swarm Optimization on Fixed Charge Netw...Corey Clark, Ph.D.
 
Quantitative Propagation of Chaos for SGD in Wide Neural Networks
Quantitative Propagation of Chaos for SGD in Wide Neural NetworksQuantitative Propagation of Chaos for SGD in Wide Neural Networks
Quantitative Propagation of Chaos for SGD in Wide Neural NetworksValentin De Bortoli
 
Neural Networks: Support Vector machines
Neural Networks: Support Vector machinesNeural Networks: Support Vector machines
Neural Networks: Support Vector machinesMostafa G. M. Mostafa
 
Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...
Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...
Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...Taiji Suzuki
 

Was ist angesagt? (20)

Lesson 27: Integration by Substitution (Section 041 slides)
Lesson 27: Integration by Substitution (Section 041 slides)Lesson 27: Integration by Substitution (Section 041 slides)
Lesson 27: Integration by Substitution (Section 041 slides)
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
QMC Opening Workshop, Support Points - a new way to compact distributions, wi...
QMC Opening Workshop, Support Points - a new way to compact distributions, wi...QMC Opening Workshop, Support Points - a new way to compact distributions, wi...
QMC Opening Workshop, Support Points - a new way to compact distributions, wi...
 
Decision theory
Decision theoryDecision theory
Decision theory
 
Machine learning in science and industry — day 4
Machine learning in science and industry — day 4Machine learning in science and industry — day 4
Machine learning in science and industry — day 4
 
"Demystifying Deep Neural Networks," a Presentation from BDTI
"Demystifying Deep Neural Networks," a Presentation from BDTI"Demystifying Deep Neural Networks," a Presentation from BDTI
"Demystifying Deep Neural Networks," a Presentation from BDTI
 
State of art pde based ip to bt vijayakrishna rowthu
State of art pde based ip to bt  vijayakrishna rowthuState of art pde based ip to bt  vijayakrishna rowthu
State of art pde based ip to bt vijayakrishna rowthu
 
TS-IASSL2014
TS-IASSL2014TS-IASSL2014
TS-IASSL2014
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
 
Machine learning in science and industry — day 2
Machine learning in science and industry — day 2Machine learning in science and industry — day 2
Machine learning in science and industry — day 2
 
Lesson 27: Integration by Substitution (Section 041 handout)
Lesson 27: Integration by Substitution (Section 041 handout)Lesson 27: Integration by Substitution (Section 041 handout)
Lesson 27: Integration by Substitution (Section 041 handout)
 
Lesson 27: Integration by Substitution (Section 021 handout)
Lesson 27: Integration by Substitution (Section 021 handout)Lesson 27: Integration by Substitution (Section 021 handout)
Lesson 27: Integration by Substitution (Section 021 handout)
 
CSC446: Pattern Recognition (LN7)
CSC446: Pattern Recognition (LN7)CSC446: Pattern Recognition (LN7)
CSC446: Pattern Recognition (LN7)
 
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
 
Distributed Parallel Process Particle Swarm Optimization on Fixed Charge Netw...
Distributed Parallel Process Particle Swarm Optimization on Fixed Charge Netw...Distributed Parallel Process Particle Swarm Optimization on Fixed Charge Netw...
Distributed Parallel Process Particle Swarm Optimization on Fixed Charge Netw...
 
main
mainmain
main
 
Quantitative Propagation of Chaos for SGD in Wide Neural Networks
Quantitative Propagation of Chaos for SGD in Wide Neural NetworksQuantitative Propagation of Chaos for SGD in Wide Neural Networks
Quantitative Propagation of Chaos for SGD in Wide Neural Networks
 
Neural Networks: Support Vector machines
Neural Networks: Support Vector machinesNeural Networks: Support Vector machines
Neural Networks: Support Vector machines
 
Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...
Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...
Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...
 
kcde
kcdekcde
kcde
 

Andere mochten auch

Andere mochten auch (20)

2016.17. 3º eso. f&q. ex2. ev1
2016.17. 3º eso. f&q. ex2. ev12016.17. 3º eso. f&q. ex2. ev1
2016.17. 3º eso. f&q. ex2. ev1
 
Sahara
SaharaSahara
Sahara
 
1228 fianl project
1228 fianl project1228 fianl project
1228 fianl project
 
The Hideaways Club Deb12
The Hideaways Club Deb12The Hideaways Club Deb12
The Hideaways Club Deb12
 
David_Mateos_Núñez_thesis_distributed_algorithms_convex_optimization
David_Mateos_Núñez_thesis_distributed_algorithms_convex_optimizationDavid_Mateos_Núñez_thesis_distributed_algorithms_convex_optimization
David_Mateos_Núñez_thesis_distributed_algorithms_convex_optimization
 
Golf - no cover
Golf - no coverGolf - no cover
Golf - no cover
 
2015.2016. mat. 3ºeso. 3ºtri. 2ºexa
2015.2016. mat. 3ºeso. 3ºtri. 2ºexa2015.2016. mat. 3ºeso. 3ºtri. 2ºexa
2015.2016. mat. 3ºeso. 3ºtri. 2ºexa
 
2015 2016. 3ºeso mat. temas 7 y9
2015 2016. 3ºeso mat. temas 7 y92015 2016. 3ºeso mat. temas 7 y9
2015 2016. 3ºeso mat. temas 7 y9
 
Final project
Final projectFinal project
Final project
 
S.T. Dupont Deb12
S.T. Dupont Deb12S.T. Dupont Deb12
S.T. Dupont Deb12
 
slides_online_optimization_david_mateos
slides_online_optimization_david_mateosslides_online_optimization_david_mateos
slides_online_optimization_david_mateos
 
Andrew Davis Resume ONLINE
Andrew Davis Resume ONLINEAndrew Davis Resume ONLINE
Andrew Davis Resume ONLINE
 
Fotos, almuerzo en el rascacielos
Fotos, almuerzo en el rascacielosFotos, almuerzo en el rascacielos
Fotos, almuerzo en el rascacielos
 
Documents Controller
Documents ControllerDocuments Controller
Documents Controller
 
Small Business Radio Network 2014
Small Business Radio Network 2014Small Business Radio Network 2014
Small Business Radio Network 2014
 
2015 2016. 3º eso. recuperacion 2º trimestre
2015 2016. 3º eso. recuperacion 2º trimestre2015 2016. 3º eso. recuperacion 2º trimestre
2015 2016. 3º eso. recuperacion 2º trimestre
 
vimal pathak - visual cv
vimal pathak - visual cvvimal pathak - visual cv
vimal pathak - visual cv
 
Thiramas
ThiramasThiramas
Thiramas
 
Dka+hhs
Dka+hhsDka+hhs
Dka+hhs
 
Caras del mundo
Caras del mundoCaras del mundo
Caras del mundo
 

Ähnlich wie slides_nuclear_norm_regularization_david_mateos

Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Fabian Pedregosa
 
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...Yandex
 
From RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphsFrom RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphstuxette
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphsDavid Gleich
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...David Gleich
 
Vector Distance Transform Maps for Autonomous Mobile Robot Navigation
Vector Distance Transform Maps for Autonomous Mobile Robot NavigationVector Distance Transform Maps for Autonomous Mobile Robot Navigation
Vector Distance Transform Maps for Autonomous Mobile Robot NavigationJanindu Arukgoda
 
1 hofstad
1 hofstad1 hofstad
1 hofstadYandex
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...MLconf
 
Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential David Gleich
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Valentin De Bortoli
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzerbutest
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsElvis DOHMATOB
 
Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networksDavid Gleich
 
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...Adam Fausett
 

Ähnlich wie slides_nuclear_norm_regularization_david_mateos (20)

Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
 
KAUST_talk_short.pdf
KAUST_talk_short.pdfKAUST_talk_short.pdf
KAUST_talk_short.pdf
 
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
 
From RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphsFrom RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphs
 
Presentation.pdf
Presentation.pdfPresentation.pdf
Presentation.pdf
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphs
 
Pres metabief2020jmm
Pres metabief2020jmmPres metabief2020jmm
Pres metabief2020jmm
 
QMC: Operator Splitting Workshop, Incremental Learning-to-Learn with Statisti...
QMC: Operator Splitting Workshop, Incremental Learning-to-Learn with Statisti...QMC: Operator Splitting Workshop, Incremental Learning-to-Learn with Statisti...
QMC: Operator Splitting Workshop, Incremental Learning-to-Learn with Statisti...
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
 
Vector Distance Transform Maps for Autonomous Mobile Robot Navigation
Vector Distance Transform Maps for Autonomous Mobile Robot NavigationVector Distance Transform Maps for Autonomous Mobile Robot Navigation
Vector Distance Transform Maps for Autonomous Mobile Robot Navigation
 
1 hofstad
1 hofstad1 hofstad
1 hofstad
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
 
Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...
 
02 basics i-handout
02 basics i-handout02 basics i-handout
02 basics i-handout
 
CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...
CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...
CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networks
 
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
 

slides_nuclear_norm_regularization_david_mateos

  • 1. Distributed Subgradient Methods for Saddle-Point Problems David Mateos-N´u˜nez Jorge Cort´es University of California, San Diego {dmateosn,cortes}@ucsd.edu Conference on Decision and Control Osaka, Japan, December 17, 2015 1 / 21
  • 2. Context: decentralized, peer-to-peer systems Sensor networks Medical diagnosis Formation control Recommender systems 2 / 21
  • 3. General agenda for today Review of (consensus based) distributed convex optimization Part 1: Distributed optimization with separable constraints via agreement on the Lagrange multipliers General saddle-point problems with explicit agreement Part 2: Convex-concave problems not arising from Lagrangians e.g., strict concave part Distributed low-rank matrix completion through a sadlle-point characterization of the nuclear norm 3 / 21
  • 4. Review: consensus based distributed convex optimization x∗ ∈ arg min x∈Rd N i=1 f i (x) (basic unconstrained problem) Agent i has access to f i Agent i can share its estimate of x∗ with “neighboring” agents f 1 f 2 f 3 f 4 f 5 A =       a13 a14 a21 a25 a31 a32 a41 a52       Adjacency matrix Parallel computations: Tsitsiklis 84, Bertsekas and Tsitsiklis 95 Consensus: Jadbabaie et al. 03, Olfati-Saber, Murray 04, Boyd et al. 05 Distributed multi-agent optimization: A. Nedi´c and A. Ozdaglar 07 4 / 21
  • 5. Review: the Laplacian matrix L = diag(A1) − A =       a13 + a14 −a13 −a14 −a21 a21 + a25 −a25 −a31 −a32 a31 + a32 −a41 a41 −a52 a52       Nullspace is agreement ⇔ graph has spanning tree Consensus via feedback on disagreement −[Lx]i = N j=1 aij (xj − xi ) 5 / 21
  • 6. Part 1: Distributed constrained convex optimization min wi ∈Wi , ∀i D∈D N i=1 f i (wi , D) s.t. g1 (w1 , D) + · · · + gN (wN , D) ≤ 0 Constraints might couple decisions of agents that cannot communicate directly e.g., wi 2 2 − 10 ≤ 0 Agent i only knows how wi enters the constraint through gi D is the usual decision vector the agents need to agree upon Constraints are useful models for Traffic and routing (Flow conservation) Resource allocation (Budgets) Optimal control (System evolution) Network formation (Relative positions/angles) 6 / 21
  • 7. Agenda for distributed constrained optimization Previous work and limitations Distributing the constraints through the Lagrangian decomposition Idea: Agreement on the multiplier General saddle-point problems with agreement constraints Our distributed saddle-point dynamics with Laplacian averaging Theorem of convergence: ∼ 1√ # iter. saddle-point evaluation error 7 / 21
  • 8. Previous work by type of constraint & info. structure N i=1 f i (x) s.t. g(x) ≤ 0 All agents know g 2011 D. Yuan, S. Xu, and H. Zhao 2012 M. Zhu and S. Mart´ınez Increasing literature min wi ∈Wi N i=1 f i (wi ) s.t. N i=1 gi (wi ) ≤ 0 Agent i knows only gi when Aw ≤ 0, knows only column i versus column i & row i of A Less studied information structure: ’10 D. Mosk-Aoyama, T. Roughgarden and D. Shah (only linear constraints) ’13 M. B¨urger, G. Notarstefano, and F. Allgwer Dual cutting-plane consensus methods ’13 T.-H. Chang, A. Nedi´c, and A. Scaglione Primal-dual perturbation methods 8 / 21
  • 9. Distributing the constraint via agreement on multipliers min wi ∈Wi D∈D N i=1 f i (wi , D) s.t. g1 (w1 , D) + · · · + gN (wN , D) ≤ 0 same as min wi ∈Wi D∈D max z∈Rm ≥0 N i=1 f i (wi , D) + z N i=1 gi (wi , D) = min wi ∈Wi D∈D max zi ∈Rm ≥0 zi =zj ∀i,j N i=1 f i (wi , D) + zi gi (wi , D) = min wi ∈Wi Di ∈D Di =Dj ∀i,j max zi ∈Rm ≥0 zi =zj ∀i,j N i=1 f i (wi , D i ) + zi gi (wi , D i ) Local coupled through agreement (Existence of saddle-points ⇒ Max-min property = Strong duality )9 / 21
  • 10. Saddle-point problems with explicit agreement A more general framework min w∈W (D1,...,DN )∈DN Di =Dj , ∀i,j max µ∈M (z1,...,zN )∈ZN zi =zj , ∀i,j φ w, (D 1 , . . . , D N ) convex , µ, (z1 , . . . , zN ) concave Distributed setting unstudied in the literature Inspiration from A. Nedi´c and A. Ozdaglar, 09 and K. Arrow, et al. 1958 Particularizes to... Convex-concave functions arising from Lagrangians The concave part is linear Min-max formulation of nuclear norm regularization (later in talk) The concave part is quadratic 10 / 21
  • 11. Our general algorithm Projected saddle-point subgradient algorithm with Laplacian averaging (provided a saddle-point exists under agreement) ˆwt+1 = wt − ηtgwt ˆDt+1 = Dt −σLtDt Lap. aver. −ηtgDt ˆµt+1 = µt + ηtgµt ˆzt+1 = zt −σLtzt Lap. aver. +ηt ˆgzt (wt+1, Dt+1, µt+1, zt+1) = (PW ˆwt+1 , PDN ˆDt+1 , PM ˆµt+1 , PZN ˆzt+1 ) Orthogonal projections onto compact convex sets gwt , gDt , gµt , ˆgzt are subgradients and supergradients of φ(w, D, µ, z) Any initial conditions; not “anytime constraint satisfaction” 11 / 21
  • 12. Theorem (Distributed saddle-point approximation) Assume that φ(w, D, µ, z) is convex in (w, D) ∈ W × DN and concave in (µ, z) ∈ M × ZN The dynamics is bounded (maybe achieved through projections) The sequence of weight-balanced communication digraphs is δ-nondegenerate (aij > δ whenever aij > 0) B-jointly-connected (unions of length B are strongly connected) For a suitable choice of consensus stepsize σ and (decreasing) subgradient stepsizes {ηt}, then, for any saddle point (w∗ , D ∗ , µ∗ , z∗ ) D∗=D∗⊗1,z∗=z∗⊗1 of φ , − α √ t − 1 ≤ φ(wav t , D av t , zav t , µav t ) − φ(w∗ , D ∗ , z∗ , µ∗ ) ≤ α √ t − 1 wav t+1 := 1 t+1 t+1 s=1 ws = t t+1wav t + 1 t+1wt+1 can be computed recursively 12 / 21
  • 13. Part 2: Beyond Lagrangians Lagrangians are particular cases of convex-concave functions – the concave part (Lagrange multipliers) is always linear Other min-max problems can benefit from distributed formulations – e.g., min-max formulations of the nuclear norm Agenda for distributed optimization with nuclear norm regularization Definition of nuclear norm Application to low-rank matrix completion Our dynamics for distributed optimization with nuclear norm Theorem of convergence (corollary from previous result) 13 / 21
  • 14. Review: definition of nuclear norm Given a matrix W =   | | w1 · · · wN | |   ∈ Rd×N W 2 ∗ := sum of singular values of W = trace √ WW = trace N i=1 wi wi Optimization with nuclear norm regularization min wi ∈Wi N i=1 f i (wi ) + γ W 2 ∗ favors vectors {wi }N i=1 belonging to a low-dimensional subspace 14 / 21
  • 15. Distributed low-rank matrix completion Tara Philip Mauricio Miroslav Toy Story Jurasic Park · · · · · · · · · · · · · · · W = [ W:,1 W:,2 W:,3 W:,4 ] Estimate W from the revealed entries {Zij } min W ∈Rd×N (i,j)∈revealed Wij − Zij 2 2 + γ W ∗ Nuclear norm γ depends on application, dimensions... (Regularization, not penalty) Netflix: users N ∼ 107 movies d ∼ 105 Why making it distributed? Because users may not want to share their ratings 15 / 21
  • 16. Formulation of nuclear norm as saddle-point problem Drawing from another paper of the authors (ignore details) min wi ∈Wi N i=1 f i (wi ) + γ [ W | √ Id ] ∗ = min wi ∈Wi , Di ∈{D cId } Di = Dj ∀i, j sup xi ∈Rd Yi ∈Rd×d N i=1 Fi (wi , Di convex , xi , Yi concave ) with convex-concave local functions Fi (w, D, x, Y ) Rd ×{D cId }×Rd ×Rd×d := fi (w) convex + γ trace D(−xx − N YY quadratic concave part because D 0 ) −2γw x − 2γ N trace(Y ) + 1 N trace(D) linear in each variable See Distributed optimization for multi-task learning via nuclear-norm approximation, NecSys15, D. Mateos-N´u˜nez, J. Cort´es 16 / 21
  • 17. Distributed saddle-point dynamics for nuclear optimization wi (k + 1) = PW wi (k) − ηk gi (k) − 2γxi (k) Di (k + 1) = P{D cId } Di (k) − ηkγ − xi xi − N Yi Yi + 1 N Id + σ N j=1 aij,t(Dj (k) − Di (k)) “Only” communication, size d × d xi (k + 1) = xi (k) + ηkγ − 2Di (k)xi (k) − 2wi (k) Yi (k + 1) = Yi (k) + ηkγ − 2 N Di (k)Yi (k) − 2 N Id Convergence is a corollary from previous theorem User i does not need to share wi with its neighbors!! Di → N i=1 wi wi + Id conveys only mixed information Complexity per iteration: orthogonal projection onto {D cId } 17 / 21
  • 18. Simulation of matrix completion 20 users × 8 movies. Each user rates 5 movies. Ratings are private 10 0 10 1 10 2 10 3 10 4 0 0.5 1 1.5 2 Distributed saddle-point dynamics Centralized subgradient descent 10 0 10 1 10 2 10 3 10 4 10 3 10 4 0 5000 10000 15000 0 1 2 3 4 5 W (k)−Z F Z F N i=1 j∈Υi (Wij (k) − Zij )2 + γ W (k)| √ Id ∗ ( N i=1 Di (k) − 1 N N i=1 Di (k) 2 F )1/2 Matrix fitting error Network cost function Disagreement local matrices 18 / 21
  • 19. Conclusions More details in arXiv “Distributed saddle-point subgradient algorithms with Laplacian averaging,” submitted to Transactions on Automatic Control Our algorithms particularize to deal with Saddle-points of Lagrangians for distributed constrained optimization Less studied type of constraints/information structure in the literature Constraints couple decisions of agents that can’t communicate directly Min-max distributed formulations of nuclear norm “Distributed optimization for multi-task learning via nuclear-norm approximation”, D. Mateos-N´u˜nez, J. Cort´es First multi-agent treatment of nuclear norm regularization 19 / 21
  • 20. Future directions Bounds on Lagrange multipliers in a distributed way Necessary to guarantee boundedness of the dynamics’ trajectories One such procedure in arXiv version Application to semidefinite constraints with chordal sparsity agents update the entries corresponding to maximal cliques subject to agreement on the intersections Other applications that you can find... IEEE Spectrum. Japan project of Orbital Solar Farm 20 / 21
  • 21. Thank you for listening! 21 / 21
  • 22. (Back slide) Outline of the proof Inequality techniques from A. Nedi´c and A. Ozdaglar, 2009 Saddle-point evaluation error tφ(wav t+1, D av t+1, µav t+1, zav t+1) − tφ(w∗ , D ∗ , µ∗ , z∗ ) (1) at running-time averages, wav t+1 := 1 t t s=1 ws, etc. Bound for (1) in terms of initial conditions bound on subgradients and states of the dynamics disagreement sum of learning rates Input-to-state stability with respect to agreement LKDt 2 ≤ CI D1 2 1 − ˜δ 4N2 t−1 B + CU max 1≤s≤t−1 ds 2 subgradients as disturbances Doubling Trick scheme: for m = 0, 1, 2, . . . , log2 t , take ηs = 1√ 2m in each period of 2m rounds s = 2m, . . . , 2m+1 − 1 1 / 1