SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Downloaden Sie, um offline zu lesen
1
Dec. 17, 2018
Jay Chang
ADMM
Distributed Stochastic ADMM for
Matrix Factorization
2
Z.-Q. Yu, X.-J. Shi, L. Yan, W.-J. Li, "Distributed stochastic admm for matrixfactorization", Proceedings of the 23rd ACM Int. Conf. on Inform. and
Knowl. Manag, pp. 1259-1268, 2014.
3
Matrix Factorization
4
ALS-based Parallel MF Models
• ALS (Alternating Least Square) allow the columns in both U and V can be independently
updated by following equations:
Parallel ALS
: , , ,
:
randinit( , )
ra
Initial factors
1
1 Executed in parallel
ndinit , )(
m n
iter T
k T
m k
m
k
i
n
λ×
←
∈
←
←
←
Algorithm :
Input R
for to do
par fo
Initializa
r to
tio
par
n
U
V
⊳
⊳
ℝ
1 Executed in parallelj n←for to
end for
⊳
5
ALS-based Improvement
CCD (Cyclic Coordinate Descent):
Instead of optimizing the whole vector or at one time, CCD adopts the coordinate descent
to optimize each element of or separately in order to avoid t
i j
i j
∗ ∗
∗ ∗
U V
U V
i
1
he matrix inverse.
CCD :
Further improves CCD's performance by changing the updating sequence in CCD.
updates one element in or each time by using coordinate descent.
kT T
d d d dd ∗ ∗ ∗ ∗=
+ +
= U V U V U V
i
CCD: Pilerl'szy, D. Zibriczky, and D. Tikk. “Fast als-based matrix factorization for explicit and implicit feedback datasets.” In RecSys, pages 71-78, 2010.
CCD++: H.-F. Yu, C.-J. Hsieh, S. Si, and I. S. Dhillon. “Scalable coordinate descent approaches to parallel matrix factorization for recommender systems.”
In ICDM, pages 765-774, 2012.
6
Some Derivations for ALS
2 22
,
1
1
( , ) ( )
( , )
0
( , )
0
T
m n m k n k
T
ui u i u i
i u u i
T
u i i ui i
i iu
T
i u u ui u
u ui
Loss r
Loss
r
Loss
r
λ
λ
λ
× × ×
−
−
≈
 
= − + + 
 
∂  
=  = + ∂  
∂  
=  = + ∂  
  
 
 
R X Y
X Y x y x y
X Y
x y y I y
x
X Y
y x x I x
y
SGD (Stochastic Gradient Descent) randomly select one rating index ( , ) from each time,
then update the corresponding variables:
where , is the learning rate.
Conflicts ex
T
ij ij i j
i j
Rε η∗ ∗
Ω
= − U V
i
i ist between two nodes when their randomly selected ratings share
either the same user index or the same item index.
Hogwild!
DSGD (Distributed SGD)
FPSGD (fast parallel SGD)
i
i
i
7
SGD-based Parallel MF Models
Hogwild!: F. Niu, B. Recht, C. Re, and S. J. Wright. "Hogwild!: A lock-free approach to parallelizing stochastic gradient descent." In NIPS, pages
693-701, 2011.
DSGD: R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. "Large-scale matrix factorization with distributed stochastic gradient descent." In KDD,
pages 69-77, 2011.
FPSGD: Y. Zhuang, W.-S. Chin, Y.-C. Juan, and C.-J. Lin. "A fast parallel sgd for matrix factorization in shared memory systems." In RecSys, pages
249-256, 2013.
8
: the number of latent factor , the learning rata , regularization parameters , , max iteration step, and the rating matrix
: Initialize a random matrix for user matrix and
U VD R
U
λη λInput
Initialization
= 1, 2,...step
( , , ) in
make prediction
error
update and
(1 )
(1 )
item matrix
T
i j
i j
i U i ij j
j V
t
u i r R
pr U V
e r pr
U V
U U e V
V
V
ηλ η
ηλ
=
= −
← − +
← −
for do
for
where T
j ij i ij ij i jV e U e R U Vη+ = −
end
end
Pseudo Code
9
Some Derivations for SGD
10
ADMM (Alternating Direction Methods of Multipliers)
• ADMM is used to solve the constrained problems as follows
• First gets the augmented Lagrangian as follows:
• The ADMM solution can be got be repeating the following three steps:
• If f(x) or g(z) are separable, the corresponding steps of ADMM can be done in parallel.
• So ADMM can be used to design distributed learning algorithms for large-scale problems.
11
Data Split Strategy
P nodes, 1~p denote computer node id.
Local item
latent matrix
Global item
latent matrix
m n×
k n×
k n×
k n×
m k×
( )m p k×
( )m p n×
12
LocalMFSplit
parallel
• Based on split strategy LocalMFSplit, the MF problem can be reformulated as follows:
• Define objective function by using augmented Lagrangian method:
where
13
Distributed ADMM
1 denotes the ( , ) indices of the ratiwher ngs located ine { } , nodep P p
p i j p== ΩVVVVV
Analog, but objective function Lp is non-convexconvex convex
14
Distributed ADMM
• We can get
• ADMM get the solutions by repeating the following three steps:
• The solution for is:
, which can be calculated efficiently.
• The problem lies in getting efficiently.
// ? We hope we can decouple U, V and
do parallel
// parallel: locally updated on each node
1 1
1 1 1
1 1 1
scatter , , , ; update , (
, ; update (global)
scatt
in each node,
er
in parallel)
gather
in each node, in p, ; update arallel)(
p p p p p
t t t t
t t t
p p
t t t
+ +
+ + +
+ + +
U V Θ V U V
U V V
V V Θ
Problems become how to
construct a surrogate objective
function, which is convex so
that we can use ADMM three
steps update.
non-convex
15
P.S. Consensus optimization via ADMM
S. P. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. “Distributed optimization and statistical learning via the alternating direction method of multipliers.”
Foundations and Trends in Machine Learning, pages 1-122, 2011.
16
P.S. Consensus optimization via ADMM
S. P. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. “Distributed optimization and statistical learning via the alternating direction method of multipliers.”
Foundations and Trends in Machine Learning, pages 1-122, 2011.
g,
with regularization,
averaging in update is followed byz ρprox
• Batch Learning: construct a surrogate objective function Gp, which is convex and can make U and
V decouple from each other, but is still not very efficient.
• For improve efficiency Stochastic Learning are proposed for DS-ADMM
17
Stochastic Learning for Distributed ADMM
convex
access all ratings related to U*i to update each U*I, so slow.
18
DS-ADMM algorithm
19
Scheduler Comparison
20
In x-axis why not use # iteration but time ?
21
A Flexible and Efficient Algorithmic
Framework for
Constrained Matrix and Tensor
Factorization
Kejun Huang, Nicholas D Sidiropoulos, and Athanasios P Liavas. “A flexible and efficient algorithmic framework for constrained matrix and tensor
factorization.” IEEE Transactions on Signal Processing, 64(19) 5052–5065, 2016.
• Optimization problem
• Optimization problem is not convex in both W and H, but is convex in one if we fix the other,
thus alternating optimization (AO) is usually employed.
• ADMM for regularized least-squares
• And derived the following iterates:
22
Constrained Matrix Factorization
where ( ) and ( ) are latent factors and regularizations,
, , , .
W H
m n m k n k k n
r r
× × × ×
⋅ ⋅
∈ ∈ ∈ ∈
W H
Y W H Hɶℝ ℝ ℝ ℝ
2
cache ,
Cholesky decomposition ,
update by forward and backward substitution,
complexity ( ).
T
T T
k n
ρ+ =

W Y
W W I LL
H
OOOO
ɶ
Update while fixing using ADMM.H W
23
Proximity Operator
• And derived the following iterates:
of function (1/ ) ( ) around point
T
proximity operator rρ ⋅
−H Uɶ
24
Most Commonly Used Constraints/Regularizations
+
1
Non-negativity. ( ) is the indicator function of .
Lasso regularization. For ( ) the update is the well-known -
operator
Simplex constraint. In some probabilistic mode
r
r soft thresholdingλ
⋅
=H H
i ℝ
i
i l we need to constrain the columns or rows
to be elementwise non-negative and sum up to one.
Smoothness regularization. We can encourage the columns of to be smooth by
adding the regularizatio
Hi
2
n ( ) ( / 2) .
Projections onto non-convex constraints, e.g. cardinality constraints can be handled by
- . But ADMM is not guaranteed to converge to the optimal solution.
F
r
hard thresholding
λ=H TH
i
25
ADMM Algorithm
Define relative primal residual
and relative dual residual where H0 is H from the previous ADMM iteration.
26
AO-ADMM Algorithm
• We can initialize the current ADMM update using the previous results W and H.
• ADMM is an subroutine for alternating optimization (AO) so called AO-ADMM.
27
General Loss Solution for ADMM
where is the scaled dual variable corresponding to the constraint ,
is the scaled dual variable corresponding to the equality constraint .
T
=
=
U H H
V Y WH
ɶ
ɶ ɶ
28
Most Commonly Used Non-Least-Squares Loss Functions
• Missing values. In the case that only a subset of the entries in Y are available.
• Robust fitting. In the case that data entries are not uniformly corrupted by noise but only sparingly
corrupted by outliers, or when the noise is dense but heavy-tailed (e.g., Laplacian-distributed), we
can use the L1 norm as the loss function for robust fitting.
• Huber fitting. Another way to deal with possible outliers in Y is to use the Huber function to
measure the loss.
• Kullback-Leibler divergence. A commonly adopted loss function for non-negative integer data is
the Kullback-Leibler (K-L) divergence defined as
29
…(12)
General Loss ADMM Algorithm
Define relative primal residual
and relative dual residual where H0 is H from the previous ADMM iteration.
30
Extension to Tensor Factorization
2
(1)
(1)
(1)
arg min ( )
Khatri-Rao product and ( ), is mode-1 matricization of tensor ,
element-wise (Hadamard) product, , ( ) .
T T
F
T
T T T T T
← −
=
∗ = ∗ =
A
A Y C B A
W C B Y
W W C C B B W Y C B Y
YYYY
⊙
⊙ ⊙
⊙
31
Tensor
AO-ADMM Algorithm
…(1)
• The outer AO framework naturally
provides a good initial point to the
inner ADMM iteration called warm-
start strategy.
LS loss function
General loss function
Low-Rank Regularized
Heterogeneous Tensor
Decomposition (LRRHTD) for
Subspace Clustering
32
J. Zhang, X. Li, P. Jing, J. Liu, Y. Su, "Low-rank regularized heterogeneous tensor decomposition for subspace clustering", IEEE Signal Process. Lett.,
vol. 25, no. 3, pp. 333-337, Mar. 2018.
33
Tucker Decomposition
tensor with , , ,I J K P Q R I P J Q K R× × × × × × ×
∈ ∈ ∈ ∈ ∈A B Cℝ ℝ ℝ ℝ ℝX GX GX GX G
core tensor
factor matrix
factor matrix
factor matrix
The factor matrices (which are usually orthogonal) A, B, and C are often referred to as
the principal component in the respective tensor mode.
this will result in a compression of XXXX, with GGGG being the compressed version of XXXX.
is Kronecker product⊗
34
Tucker Decomposition
• Tucker model can be generalized to N-way tensors
• The concept of n-rank (denoted by rankn(XXXX)): corresponds to the column rank of the n-th
unfolding of the tensor XXXX.
• According to the type of constraints Tucker decomposition approaches can be roughly grouped
into three categories:
• orthogonal tensor decomposition
• non-negative tensor decomposition
• sparse tensor decomposition
• Almost all of the above algorithms decompose tensors based on the isotropy hypothesis (i.e.
orthogonal, non-negative...), meaning that the factor matrices are learned in an equivalent way
for all modes.
• Not suitable for heterogeneous tensor data.
( )rank ( ) rank( )n n= XXXXX
• For all but the last mode, LRRHTD seeks a set of orthogonal projection matrices to map the
original tensor data into a low-dimensional common subspace.
• But for the last mode, a low-rank projection matrix is learned by imposing a nuclear-norm so that
a lowest rank representation that reveals the global structure of samples is obtained for
performing clustering.
• M-th order tensors, N is the total number of samples:
• We concatenate the N tensors to yield a (M+1)-th order tensor
• The goal of LRRHTD is to find M orthogonal factor matrices for intrinsic
low-dimensional representation and the lowest rank representation using the mapped
low-dimensional tensor as a dictionary, and D < N.
35
Low-Rank Regularized Heterogeneous Tensor
Decomposition (LRRHTD)
• Tucker decomposition of the concatenated tensor XXXX can be estimated in a general form as follows:
where is the core tensor and is the approximation error tensor.
• Cost function:
and
36
Low-Rank Regularized Heterogeneous Tensor
Decomposition (LRRHTD)
arg min
37
Low-Rank Regularized Heterogeneous Tensor
Decomposition (LRRHTD)
38
Thank you for your attention

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

スパースモデリングによる多次元信号・画像復元
スパースモデリングによる多次元信号・画像復元スパースモデリングによる多次元信号・画像復元
スパースモデリングによる多次元信号・画像復元
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine Learning
 
線形計画法入門
線形計画法入門線形計画法入門
線形計画法入門
 
PRML輪読#14
PRML輪読#14PRML輪読#14
PRML輪読#14
 
Probabilistic PCA, EM, and more
Probabilistic PCA, EM, and moreProbabilistic PCA, EM, and more
Probabilistic PCA, EM, and more
 
Stochastic gradient descent and its tuning
Stochastic gradient descent and its tuningStochastic gradient descent and its tuning
Stochastic gradient descent and its tuning
 
パターン認識と機械学習 13章 系列データ
パターン認識と機械学習 13章 系列データパターン認識と機械学習 13章 系列データ
パターン認識と機械学習 13章 系列データ
 
パターン認識と機械学習6章(カーネル法)
パターン認識と機械学習6章(カーネル法)パターン認識と機械学習6章(カーネル法)
パターン認識と機械学習6章(カーネル法)
 
Graph Kernelpdf
Graph KernelpdfGraph Kernelpdf
Graph Kernelpdf
 
Lec17 sparse signal processing & applications
Lec17 sparse signal processing & applicationsLec17 sparse signal processing & applications
Lec17 sparse signal processing & applications
 
東京都市大学 データ解析入門 4 スパース性と圧縮センシング1
東京都市大学 データ解析入門 4 スパース性と圧縮センシング1東京都市大学 データ解析入門 4 スパース性と圧縮センシング1
東京都市大学 データ解析入門 4 スパース性と圧縮センシング1
 
[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models
[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models
[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models
 
PRML輪読#5
PRML輪読#5PRML輪読#5
PRML輪読#5
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것
 
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
 
Sparse estimation tutorial 2014
Sparse estimation tutorial 2014Sparse estimation tutorial 2014
Sparse estimation tutorial 2014
 
PRML輪読#12
PRML輪読#12PRML輪読#12
PRML輪読#12
 
勾配降下法の 最適化アルゴリズム
勾配降下法の最適化アルゴリズム勾配降下法の最適化アルゴリズム
勾配降下法の 最適化アルゴリズム
 
深層自己符号化器+混合ガウスモデルによる教師なし異常検知
深層自己符号化器+混合ガウスモデルによる教師なし異常検知深層自己符号化器+混合ガウスモデルによる教師なし異常検知
深層自己符号化器+混合ガウスモデルによる教師なし異常検知
 

Ähnlich wie Distributed ADMM

A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
Devansh16
 
4optmizationtechniques-150308051251-conversion-gate01.pdf
4optmizationtechniques-150308051251-conversion-gate01.pdf4optmizationtechniques-150308051251-conversion-gate01.pdf
4optmizationtechniques-150308051251-conversion-gate01.pdf
BechanYadav4
 
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
Sean Golliher
 
Global analysis of nonlinear dynamics
Global analysis of nonlinear dynamicsGlobal analysis of nonlinear dynamics
Global analysis of nonlinear dynamics
Springer
 

Ähnlich wie Distributed ADMM (20)

Monte Carlo Berkeley.pptx
Monte Carlo Berkeley.pptxMonte Carlo Berkeley.pptx
Monte Carlo Berkeley.pptx
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
 
Reweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEPReweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEP
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
My Prize Winning Physics Poster from 2006
My Prize Winning Physics Poster from 2006My Prize Winning Physics Poster from 2006
My Prize Winning Physics Poster from 2006
 
Differential Evolution Algorithm with Triangular Adaptive Control Parameter f...
Differential Evolution Algorithm with Triangular Adaptive Control Parameter f...Differential Evolution Algorithm with Triangular Adaptive Control Parameter f...
Differential Evolution Algorithm with Triangular Adaptive Control Parameter f...
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
4optmizationtechniques-150308051251-conversion-gate01.pdf
4optmizationtechniques-150308051251-conversion-gate01.pdf4optmizationtechniques-150308051251-conversion-gate01.pdf
4optmizationtechniques-150308051251-conversion-gate01.pdf
 
Optmization techniques
Optmization techniquesOptmization techniques
Optmization techniques
 
optmizationtechniques.pdf
optmizationtechniques.pdfoptmizationtechniques.pdf
optmizationtechniques.pdf
 
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...
 
Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
 
Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...
 
Global analysis of nonlinear dynamics
Global analysis of nonlinear dynamicsGlobal analysis of nonlinear dynamics
Global analysis of nonlinear dynamics
 

Mehr von Pei-Che Chang

Mehr von Pei-Che Chang (20)

PLL Note
PLL NotePLL Note
PLL Note
 
Phase Locked Loops (PLL) 1
Phase Locked Loops (PLL) 1Phase Locked Loops (PLL) 1
Phase Locked Loops (PLL) 1
 
NTHU Comm Presentation
NTHU Comm PresentationNTHU Comm Presentation
NTHU Comm Presentation
 
Introduction to Compressive Sensing in Wireless Communication
Introduction to Compressive Sensing in Wireless CommunicationIntroduction to Compressive Sensing in Wireless Communication
Introduction to Compressive Sensing in Wireless Communication
 
Distributed Architecture of Subspace Clustering and Related
Distributed Architecture of Subspace Clustering and RelatedDistributed Architecture of Subspace Clustering and Related
Distributed Architecture of Subspace Clustering and Related
 
PMF BPMF and BPTF
PMF BPMF and BPTFPMF BPMF and BPTF
PMF BPMF and BPTF
 
Brief Introduction About Topological Interference Management (TIM)
Brief Introduction About Topological Interference Management (TIM)Brief Introduction About Topological Interference Management (TIM)
Brief Introduction About Topological Interference Management (TIM)
 
Patch antenna
Patch antennaPatch antenna
Patch antenna
 
Antenna basic
Antenna basicAntenna basic
Antenna basic
 
PAPR Reduction
PAPR ReductionPAPR Reduction
PAPR Reduction
 
Channel Estimation
Channel EstimationChannel Estimation
Channel Estimation
 
Introduction to OFDM
Introduction to OFDMIntroduction to OFDM
Introduction to OFDM
 
The Wireless Channel Propagation
The Wireless Channel PropagationThe Wireless Channel Propagation
The Wireless Channel Propagation
 
MIMO Channel Capacity
MIMO Channel CapacityMIMO Channel Capacity
MIMO Channel Capacity
 
Digital Passband Communication
Digital Passband CommunicationDigital Passband Communication
Digital Passband Communication
 
Digital Baseband Communication
Digital Baseband CommunicationDigital Baseband Communication
Digital Baseband Communication
 
The relationship between bandwidth and rise time
The relationship between bandwidth and rise timeThe relationship between bandwidth and rise time
The relationship between bandwidth and rise time
 
Millimeter wave 5G antennas for smartphones
Millimeter wave 5G antennas for smartphonesMillimeter wave 5G antennas for smartphones
Millimeter wave 5G antennas for smartphones
 
Introduction of GPS BPSK-R and BOC
Introduction of GPS BPSK-R and BOCIntroduction of GPS BPSK-R and BOC
Introduction of GPS BPSK-R and BOC
 
Filtering Requirements for FDD + TDD CA Scenarios
Filtering Requirements for FDD + TDD CA ScenariosFiltering Requirements for FDD + TDD CA Scenarios
Filtering Requirements for FDD + TDD CA Scenarios
 

Kürzlich hochgeladen

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Kürzlich hochgeladen (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Distributed ADMM

  • 1. 1 Dec. 17, 2018 Jay Chang ADMM
  • 2. Distributed Stochastic ADMM for Matrix Factorization 2 Z.-Q. Yu, X.-J. Shi, L. Yan, W.-J. Li, "Distributed stochastic admm for matrixfactorization", Proceedings of the 23rd ACM Int. Conf. on Inform. and Knowl. Manag, pp. 1259-1268, 2014.
  • 4. 4 ALS-based Parallel MF Models • ALS (Alternating Least Square) allow the columns in both U and V can be independently updated by following equations: Parallel ALS : , , , : randinit( , ) ra Initial factors 1 1 Executed in parallel ndinit , )( m n iter T k T m k m k i n λ× ← ∈ ← ← ← Algorithm : Input R for to do par fo Initializa r to tio par n U V ⊳ ⊳ ℝ 1 Executed in parallelj n←for to end for ⊳
  • 5. 5 ALS-based Improvement CCD (Cyclic Coordinate Descent): Instead of optimizing the whole vector or at one time, CCD adopts the coordinate descent to optimize each element of or separately in order to avoid t i j i j ∗ ∗ ∗ ∗ U V U V i 1 he matrix inverse. CCD : Further improves CCD's performance by changing the updating sequence in CCD. updates one element in or each time by using coordinate descent. kT T d d d dd ∗ ∗ ∗ ∗= + + = U V U V U V i CCD: Pilerl'szy, D. Zibriczky, and D. Tikk. “Fast als-based matrix factorization for explicit and implicit feedback datasets.” In RecSys, pages 71-78, 2010. CCD++: H.-F. Yu, C.-J. Hsieh, S. Si, and I. S. Dhillon. “Scalable coordinate descent approaches to parallel matrix factorization for recommender systems.” In ICDM, pages 765-774, 2012.
  • 6. 6 Some Derivations for ALS 2 22 , 1 1 ( , ) ( ) ( , ) 0 ( , ) 0 T m n m k n k T ui u i u i i u u i T u i i ui i i iu T i u u ui u u ui Loss r Loss r Loss r λ λ λ × × × − − ≈   = − + +    ∂   =  = + ∂   ∂   =  = + ∂          R X Y X Y x y x y X Y x y y I y x X Y y x x I x y
  • 7. SGD (Stochastic Gradient Descent) randomly select one rating index ( , ) from each time, then update the corresponding variables: where , is the learning rate. Conflicts ex T ij ij i j i j Rε η∗ ∗ Ω = − U V i i ist between two nodes when their randomly selected ratings share either the same user index or the same item index. Hogwild! DSGD (Distributed SGD) FPSGD (fast parallel SGD) i i i 7 SGD-based Parallel MF Models Hogwild!: F. Niu, B. Recht, C. Re, and S. J. Wright. "Hogwild!: A lock-free approach to parallelizing stochastic gradient descent." In NIPS, pages 693-701, 2011. DSGD: R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. "Large-scale matrix factorization with distributed stochastic gradient descent." In KDD, pages 69-77, 2011. FPSGD: Y. Zhuang, W.-S. Chin, Y.-C. Juan, and C.-J. Lin. "A fast parallel sgd for matrix factorization in shared memory systems." In RecSys, pages 249-256, 2013.
  • 8. 8 : the number of latent factor , the learning rata , regularization parameters , , max iteration step, and the rating matrix : Initialize a random matrix for user matrix and U VD R U λη λInput Initialization = 1, 2,...step ( , , ) in make prediction error update and (1 ) (1 ) item matrix T i j i j i U i ij j j V t u i r R pr U V e r pr U V U U e V V V ηλ η ηλ = = − ← − + ← − for do for where T j ij i ij ij i jV e U e R U Vη+ = − end end Pseudo Code
  • 10. 10 ADMM (Alternating Direction Methods of Multipliers) • ADMM is used to solve the constrained problems as follows • First gets the augmented Lagrangian as follows: • The ADMM solution can be got be repeating the following three steps: • If f(x) or g(z) are separable, the corresponding steps of ADMM can be done in parallel. • So ADMM can be used to design distributed learning algorithms for large-scale problems.
  • 11. 11 Data Split Strategy P nodes, 1~p denote computer node id. Local item latent matrix Global item latent matrix m n× k n× k n× k n× m k× ( )m p k× ( )m p n×
  • 13. • Based on split strategy LocalMFSplit, the MF problem can be reformulated as follows: • Define objective function by using augmented Lagrangian method: where 13 Distributed ADMM 1 denotes the ( , ) indices of the ratiwher ngs located ine { } , nodep P p p i j p== ΩVVVVV Analog, but objective function Lp is non-convexconvex convex
  • 14. 14 Distributed ADMM • We can get • ADMM get the solutions by repeating the following three steps: • The solution for is: , which can be calculated efficiently. • The problem lies in getting efficiently. // ? We hope we can decouple U, V and do parallel // parallel: locally updated on each node 1 1 1 1 1 1 1 1 scatter , , , ; update , ( , ; update (global) scatt in each node, er in parallel) gather in each node, in p, ; update arallel)( p p p p p t t t t t t t p p t t t + + + + + + + + U V Θ V U V U V V V V Θ Problems become how to construct a surrogate objective function, which is convex so that we can use ADMM three steps update. non-convex
  • 15. 15 P.S. Consensus optimization via ADMM S. P. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. “Distributed optimization and statistical learning via the alternating direction method of multipliers.” Foundations and Trends in Machine Learning, pages 1-122, 2011.
  • 16. 16 P.S. Consensus optimization via ADMM S. P. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. “Distributed optimization and statistical learning via the alternating direction method of multipliers.” Foundations and Trends in Machine Learning, pages 1-122, 2011. g, with regularization, averaging in update is followed byz ρprox
  • 17. • Batch Learning: construct a surrogate objective function Gp, which is convex and can make U and V decouple from each other, but is still not very efficient. • For improve efficiency Stochastic Learning are proposed for DS-ADMM 17 Stochastic Learning for Distributed ADMM convex access all ratings related to U*i to update each U*I, so slow.
  • 20. 20 In x-axis why not use # iteration but time ?
  • 21. 21 A Flexible and Efficient Algorithmic Framework for Constrained Matrix and Tensor Factorization Kejun Huang, Nicholas D Sidiropoulos, and Athanasios P Liavas. “A flexible and efficient algorithmic framework for constrained matrix and tensor factorization.” IEEE Transactions on Signal Processing, 64(19) 5052–5065, 2016.
  • 22. • Optimization problem • Optimization problem is not convex in both W and H, but is convex in one if we fix the other, thus alternating optimization (AO) is usually employed. • ADMM for regularized least-squares • And derived the following iterates: 22 Constrained Matrix Factorization where ( ) and ( ) are latent factors and regularizations, , , , . W H m n m k n k k n r r × × × × ⋅ ⋅ ∈ ∈ ∈ ∈ W H Y W H Hɶℝ ℝ ℝ ℝ 2 cache , Cholesky decomposition , update by forward and backward substitution, complexity ( ). T T T k n ρ+ =  W Y W W I LL H OOOO ɶ Update while fixing using ADMM.H W
  • 23. 23 Proximity Operator • And derived the following iterates: of function (1/ ) ( ) around point T proximity operator rρ ⋅ −H Uɶ
  • 24. 24 Most Commonly Used Constraints/Regularizations + 1 Non-negativity. ( ) is the indicator function of . Lasso regularization. For ( ) the update is the well-known - operator Simplex constraint. In some probabilistic mode r r soft thresholdingλ ⋅ =H H i ℝ i i l we need to constrain the columns or rows to be elementwise non-negative and sum up to one. Smoothness regularization. We can encourage the columns of to be smooth by adding the regularizatio Hi 2 n ( ) ( / 2) . Projections onto non-convex constraints, e.g. cardinality constraints can be handled by - . But ADMM is not guaranteed to converge to the optimal solution. F r hard thresholding λ=H TH i
  • 25. 25 ADMM Algorithm Define relative primal residual and relative dual residual where H0 is H from the previous ADMM iteration.
  • 26. 26 AO-ADMM Algorithm • We can initialize the current ADMM update using the previous results W and H. • ADMM is an subroutine for alternating optimization (AO) so called AO-ADMM.
  • 27. 27 General Loss Solution for ADMM where is the scaled dual variable corresponding to the constraint , is the scaled dual variable corresponding to the equality constraint . T = = U H H V Y WH ɶ ɶ ɶ
  • 28. 28 Most Commonly Used Non-Least-Squares Loss Functions • Missing values. In the case that only a subset of the entries in Y are available. • Robust fitting. In the case that data entries are not uniformly corrupted by noise but only sparingly corrupted by outliers, or when the noise is dense but heavy-tailed (e.g., Laplacian-distributed), we can use the L1 norm as the loss function for robust fitting. • Huber fitting. Another way to deal with possible outliers in Y is to use the Huber function to measure the loss. • Kullback-Leibler divergence. A commonly adopted loss function for non-negative integer data is the Kullback-Leibler (K-L) divergence defined as
  • 29. 29 …(12) General Loss ADMM Algorithm Define relative primal residual and relative dual residual where H0 is H from the previous ADMM iteration.
  • 30. 30 Extension to Tensor Factorization 2 (1) (1) (1) arg min ( ) Khatri-Rao product and ( ), is mode-1 matricization of tensor , element-wise (Hadamard) product, , ( ) . T T F T T T T T T ← − = ∗ = ∗ = A A Y C B A W C B Y W W C C B B W Y C B Y YYYY ⊙ ⊙ ⊙ ⊙
  • 31. 31 Tensor AO-ADMM Algorithm …(1) • The outer AO framework naturally provides a good initial point to the inner ADMM iteration called warm- start strategy. LS loss function General loss function
  • 32. Low-Rank Regularized Heterogeneous Tensor Decomposition (LRRHTD) for Subspace Clustering 32 J. Zhang, X. Li, P. Jing, J. Liu, Y. Su, "Low-rank regularized heterogeneous tensor decomposition for subspace clustering", IEEE Signal Process. Lett., vol. 25, no. 3, pp. 333-337, Mar. 2018.
  • 33. 33 Tucker Decomposition tensor with , , ,I J K P Q R I P J Q K R× × × × × × × ∈ ∈ ∈ ∈ ∈A B Cℝ ℝ ℝ ℝ ℝX GX GX GX G core tensor factor matrix factor matrix factor matrix The factor matrices (which are usually orthogonal) A, B, and C are often referred to as the principal component in the respective tensor mode. this will result in a compression of XXXX, with GGGG being the compressed version of XXXX. is Kronecker product⊗
  • 34. 34 Tucker Decomposition • Tucker model can be generalized to N-way tensors • The concept of n-rank (denoted by rankn(XXXX)): corresponds to the column rank of the n-th unfolding of the tensor XXXX. • According to the type of constraints Tucker decomposition approaches can be roughly grouped into three categories: • orthogonal tensor decomposition • non-negative tensor decomposition • sparse tensor decomposition • Almost all of the above algorithms decompose tensors based on the isotropy hypothesis (i.e. orthogonal, non-negative...), meaning that the factor matrices are learned in an equivalent way for all modes. • Not suitable for heterogeneous tensor data. ( )rank ( ) rank( )n n= XXXXX
  • 35. • For all but the last mode, LRRHTD seeks a set of orthogonal projection matrices to map the original tensor data into a low-dimensional common subspace. • But for the last mode, a low-rank projection matrix is learned by imposing a nuclear-norm so that a lowest rank representation that reveals the global structure of samples is obtained for performing clustering. • M-th order tensors, N is the total number of samples: • We concatenate the N tensors to yield a (M+1)-th order tensor • The goal of LRRHTD is to find M orthogonal factor matrices for intrinsic low-dimensional representation and the lowest rank representation using the mapped low-dimensional tensor as a dictionary, and D < N. 35 Low-Rank Regularized Heterogeneous Tensor Decomposition (LRRHTD)
  • 36. • Tucker decomposition of the concatenated tensor XXXX can be estimated in a general form as follows: where is the core tensor and is the approximation error tensor. • Cost function: and 36 Low-Rank Regularized Heterogeneous Tensor Decomposition (LRRHTD) arg min
  • 37. 37 Low-Rank Regularized Heterogeneous Tensor Decomposition (LRRHTD)
  • 38. 38 Thank you for your attention