Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
PCA on graph/network
1. On PCA with network/graph constraints
Daisuke Yoneoka
October 15, 2018
1 Introduction and notations
We are interested in a principal component analysis (PCA) with network constraints between
features and/or samples. Given a data matrix X ∈ Rp×n
, the classical PCA can be formulated
as a learner of the projection V ∈ Rn×d
(principal directions) of X in a d-dimensional linear
space spanned by an orthonormal basis U ∈ Rp×d
(principal components), and the product
L = UV T
∈ Rp×n
can be viewed as the low-rank approximation of X. PCA find the optimal
U and V by minimizing
min
U,V
X − UV T 2
F s.t. UT
U = I. (1)
PCA provides an embedding for the data on a linear manifold. In contrast, Laplacian
embedding can provide an embedding for the data on the non-linear manifold and thus can
preserve the geometrical relationships [1]. Let assume that there is a graph G = (S, E) with
vertex set S and pairwise edges E. Based on the graph, we can calculate the adjacency matrix
W which encodes the weights of edges E. Then, the graph Laplacian matrix Ψ or the normalized
graph Laplacian matrix Ψnorm can be defined as Ψ = D −A or Ψnorm = D−1/2
(D −W)D−1/2
=
I − D−1/2
WD−1/2
, respectively, where D is the degree matrix defined as D = diag(dii) and
dii = j Wij. The Laplacian embedding finds Q ∈ Rn×d
, which is the embedded coordinates
of the n data point, by minimizing
min
Q
n
i,j=1
qi − qj
2
Wij = tr{QT
(D − W)Q} = tr{QT
ΨQ} s.t. QT
Q = I, (2)
where W ∈ Rn×n
is the adjacency matrix of a graph with n nodes.
Throughout this review, · F , · ∗, · 1, and · 2 denote the Frobenius, nuclear, l1, and
l2 norms, respectively.
1
2. 2 Graph-Laplacian PCA: Closed from Solution and Ro-
bustness [5]
This is the first article to incorporate data X and graph G for PCA. They proposed a graph-
Laplacian PCA (gLPCA) and robust version of gLPCA (RgLCPA). In this article, they focus
on the adjacency matrix W ∈ Rn×n
(or equivalent Ψ) containing the edge weights on a graph
with n nodes.
2.1 Objective function for gLPCA
min
U,V
X − UV T 2
F + α tr(V T
ΨV ) s.t. V T
V = I, (3)
where α ≥ 0 is a parameter balance the two parts.
2.2 Optimization for gLPCA
gLPCA has the closed-form solution:
Theorem 1. The optimal solution (U, V ) of gLPCA are given by
V ∗
= (v1, . . . , vd) (4)
U∗
= XV ∗
, (5)
where v1, . . . , vd are the eigenvectors from the first d smallest eigenvalues of the matrix
Gα = −XT
X + αΨ. (6)
2.3 Objective function for RgLPCA
min
U,V
X − UV T
2,1 + α tr(V T
ΨV ) s.t. V T
V = I, (7)
where A 2,1 = n
j=1
p
i=1 A2
ij [2].
2.4 Optimization for RgLPCA
RgLPCA uses the updating algorithm to solve the problem by using the Augmented Lagrange
Multiplier (ALM) method. ALM extends (8) as
min
U,V,E
E 2,1 + tr{CT
(E − X + UV T
)} +
µ
2
E − X + UV T 2
F + α tr{V T
ΨV }
s.t. E = X − UV T
, V T
V = I, (8)
2
3. where C is Lagrange multiplier and µ is the penalty parameter.
The algorithm is divided into two parts:
1. Solving the sub-problems First, solve U, V while fixing E. After several calculation, (8)
is
min
U,V
µ
2
E − X + UV T
+
C
µ
2
F
+ α tr{V T
ΨV } ( A 2
F = tr(AAT
)), (9)
which can be solved by Theorem 1. Second, solve E while fixing U, V . (8) becomes
min
E
E 2,1 +
µ
2
E − X + UV T
+
C
µ
2
F . (10)
Let ei and ai be the ith column of matrix E and X − UV T
−
C
µ
, respectively. (10) is
decomposed into n independent problems
min
ei
ei =
µ
2
ei − ai
2
, (11)
and this solution is known to be
ei = max(1 −
1
µ ai
, 0)ai (12)
2. Updating parameters C and µ are updated as follow:
C = C + µ(E − X + UV T
) (13)
µ = ρµ, (14)
where ρ > 1 is a step size parameter.
2.5 What’s new?
• This is a first article that advocates PCA accounting for the graph Laplacian.
• In summary, the idea is to add the (manifold/graph) regularization term for the graph
smoothness of principal component V to the standard PCA using tr{V T
ΨV }.
3 Robust Principal Component Analysis on Graphs [7]
It generalizes an robust PCA framework by incorporating the graph data similarity. Note that
classical PCA formulation is susceptible to error or outliers in the data because of its quadratic
term, and thus robust PCA solves this problem by using sparse penalty term. The first idea
for robust PCA was proposed by Candes et al. [2].
3
4. 3.1 Objective function for Robust PCA on Graph
min
L,S
L ∗ + λ S 1 + γ tr(LΨLT
) s.t. X = L + S (15)
where S ∈ Rp×n
is the sparse matrix, and λ and γ control the amount of sparsity of S and the
smoothness of L on the graph, respectively.
3.2 Optimization for Robust PCA on Graph
They propose using an Alternative Direction Method of Multiplier (ADMM) that is a variant
of the standard augmented Lagrangian method. By using the augmented Lagrangian, (15) can
be rewritten as:
min
L,S,W
L ∗ + λ S 1 + γ tr(WΨWT
)
+ tr(ZT
1 (X − L − S)) +
r1
2
X − L − S 2
F
+ tr(ZT
2 (W − L)) +
r2
2
W − L 2
F (16)
s.t. X = L + S, L = W,
where Zi ∈ Rp×n
(i = 1, 2) are the Lagrange multipliers. For Z1 and Z2, the augmented
Lagrangian proposes the iteration scheme:
Zk+1
1 = Zk
1 + r1(X − Lk+1
− Sk+1
) (17)
Zk+1
2 = Zk
2 + r2(Wk+1
− Lk+1
) (18)
To update L, S and W, they consider the proximity operator of f, which is given by the
following minimization problem
proxf (x) = argmin
y
f(y) +
1
2
x − y 2
2, (19)
where f ∈ F(Rn
) and F(Rn
) is the class of lower semi-continuous convex function from Rn
to ] − ∞, +∞] with dom(F) = ∅. The updates for L, S and W at k + 1th iteration can be
formulated by using proxf .
4
5. Update L Solve L while fixing other terms. That is
Lk+1
= argmin
L
L ∗ + tr(ZT
1 (X − L − S)) +
r1
2
X − L − S 2
F
+ tr(ZT
2 (W − L)) +
r2
2
W − L 2
F
= argmin
L
L ∗ +
r1 + r2
2
L −
r1Hk
1 + r2Hk
2
r1 + r2
2
F
= prox 2
r1+r2
L ∗
r1Hk
1 + r2Hk
2
r1 + r2
= PΩ1
r
(Σ)QT
, (20)
where Hk
1 = X−Sk
+Zk
1 /r1, Hk
2 = Wk
+Zk
2 /r2, Ωτ (x) is the element-wise soft-thresholding
operator Ωτ (x) = sgn(x) max(|x| − τ, 0), r = (r1 + r2)/2, and
r1Hk
1 + r2Hk
2
r1 + r2
= PΣQT
is a
singular value decomposition.
Update S Following a similar way, solve S while fixing other terms. That is
Sk+1
= prox λ
r1
S 1
(X − Lk+1
+
Zk
1
r1
)
= Ω λ
r1
(X − Lk+1
+
Zk
1
r1
) (21)
Update W For updating W, (16) is a smooth function in W, thus we can use several standard
methods (such as conjugate gradient method) to find a solution:
Wk+1
= r2(γΨ + r2I)−1
(Lk+1
−
Zk
2
r2
). (22)
3.3 What’s new?
• Instead of learning the principal direction U and principal component V , it learns their
product L = UV T
.
• It assumes the smoothness of L instead of assuming the smoothness of the principal
component matrix V .
• It is strictly convex problem and a unique global maxima can be obtained by using
standard methods such as ADMM.
• As with the classical robust PCA, it does not require the rank of L to be specified because
the nuclear norm enables the automatic selection of an appropriate rank (but it depends
on the choice of λ and γ)
5
6. 4 Low-rank matrix approximation with manifold regu-
larization [11]
Manifold Regularized Matrix Factorization (MMF) exploits the orthonormality constrain on
the principal direction U to obtain a low-dimensional matrix L = UV T
. Unlike other standard
Nonnegative Matrix Factorization (NMF), it relax the restriction of nonnegativity on U and V
but impose the orthonormal constraint on U.
4.1 Objective function for MMF
min
U,V
f(U, V ) = X − UV T 2
F + α tr(V T
ΨV ) s.t. UT
U = I, (23)
where U is an orthonormal matrix whose columns consist of a principle orthogonal basis of the
data space, and V can be viewed as as a projection of the data in the low dimension space.
In addition, the extensions of this model with an ensemble of graph and hypergraph regu-
ralisation terms have been proposed by Tao et al. [10] and Jin et al. [6] by extending (23)
as
min
U,V,α
f(U, V, α) = X − UV T 2
F + α tr(V T
k
αkΨkV ) + β α 2
s.t. UT
U = I, 1T
α = 1. (24)
4.2 Optimization for MMF
The iterative algorithm is divided into two parts:
1. Updateing V First, solve V while fixing U. It is easy to derive the optimal solution by
setting the derivative of (23) with respect to V to be zero: that is
∂f(U, V )
∂V
2(V T
Ψ − UT
X) = 0, (25)
where this derivative is zero iif V = UT
Ψ−1
. Thus, update V k+1
= UT(k+1)
Ψ−1
2. Updating U Solve U while fixing V . (23) can be rewritten as:
min
U
X − UV T 2
F s.t. UT
U = I
⇔ min
U
−2 tr(UT
XV ) s.t. UT
U = I. (26)
That is equivalent to maximizing tr(UT
XV ) that can be represented in terms of the
singular values of XV . Let XV = GΣST
be the SVD of XV with Σ a diagonal matrix
6
7. of positive diagonal entries. We have
tr(UT
XV ) = tr(UT
GΣST
) =
d
i=1
(ST
UT
Σ)iiDii. (27)
Because ST
UT
G is an orthogonal matrix, tr(ST
UT
G) ≤ d. Then the maximum of
tr(ST
UT
G) is achieved when ST
UT
G = I. Thus U = GST
. In summary, update
Uk+1
= GST
where XV = GΣST
(SVD of XV ).
4.3 What’s new?
• This model has globally optimal solutions that can be computed directly (in small sample
case) or iteratively (in many sample case).
• This model put the orthonormality constrain on the principal direction U instead of V
such as [5].
5 Joint-L1/2 Norm Constraint and Graph-Laplacian PCA
Method for Feature Extraction [3]
Jiang et al. proposed robust graph Laplacian PCA (RgLPCA) by using L2,3 norm [5]. But, the
major contribution of L2,1 norm is to introduce sparseness on rows, in which the effect is not
so obvious. Instead, they replace it with L1/2 (called L1/2 gLPCA).
5.1 Objective function for L1/2 gLPCA
min
U,V
X − UV T 1/2
1/2 + α tr(V T
ΨV ) s.t. V T
V = I, (28)
where A
1/2
1/2 = n
j=1
n
i=1 |aij|1/2
.
5.2 Optimization for L1/2 gLPCA
As with RgLPCA [5], the Augmented Lagrange Multiplier (ALM) method is used. ALM
extends (28) as
min
U,V,E
E
1/2
1/2 + tr{CT
(E − X + UV T
)} +
µ
2
E − X + UV T 2
F + α tr{V T
ΨV }
s.t. E = X − UV T
, V T
V = I, (29)
7
8. where C is Lagrange multiplier and µ is the penalty parameter.
The algorithm is divided into two parts:
1. Updating E First, solve U, V while fixing E. After several calculation, (29) is
min
E
E
1/2
1/2 +
µ
2
EX + UV T
+
C
µ
2
F
. (30)
This can be viewed as the proximal operator of L1/2 norm. Thus the following lemma
can be used:
Lemma 1. The proximal operator of L1/2 norm minimizes the following problem:
proxλ A
1/2
1/2
= min
Xp×n
R
X − A 2
F + λ A
1/2
1/2,
which is given by
Udiag(Hλ(σ))V T
, (31)
where Hλ(σ) = (hλ(σ1), . . . , hλ(σn))T
and hλ(σi)(i = 1, . . . , n) is the half thresholding
operator and given as
hλ(σi) =
2
3
σi(1 + cos(
2π
3
−
2
3
φλ(σi))), if |σi| >
543/2
4
λ2/3
0, otherwise,
(32)
where φλ(σi) = arccos((λ/8)(|σi|/3)−2/3
).
2. Updating U and V The same procedures with gLPCA can be used. First, solve U while
fixing others. This is given by
U = (X − E −
C
µ
). (33)
Then, we solve V while fixing others, and the updating equation is given by solving the
following problem
V = min
V
tr V T
(− (X − E −
C
µ
)T
(X − E −
C
µ
) +
2α
µ
Ψ V (34)
3. Updating C and µ C and µ are updated as follow:
C = C + µ(E − X + UV T
) (35)
µ = ρµ, (36)
where ρ > 1 is a step size parameter.
8
9. 5.3 What’s new?
• Extend RgLPCA by using L1/2 norm instead of L2,1. The optimization algorithm is
similar with RgLCPA.
6 Fast Robust PCA on graph [8]
It introduces Fast Robust PCA on Graph (FRPCAG) by extending RgLCPA [5].
6.1 Objective function for FRPCAG
min
L,S
L 1 + λ1 tr(LΨ1LT
) + λ2 tr(LT
Ψ2L) s.t. X = L + S (37)
where L ∈ Rp×n
is the low-rank noiseless matrix, S ∈ Rp×n
is the sparse matrix, and Ψ1 ∈ Rn×n
and Ψ2 ∈ Rp×p
are the graph Laplacian of graph connecting the different samples (column of
X) and different features (row of X) of X, respectively.
6.2 Optimization for FRPCAG
It uses the Fast Iterative Soft Thresholding Algorithm (FASTA) to solve (37). Let define h :
RN
→ R a convex function with the proximity operator proxλh(y) = argminx
1
2
x−y 2
F +λh(x).
Our goal is to minimize h(L) + g(L) = L 1 + λ1 tr(LΨ1LT
) + λ2 tr(LT
Ψ2L), which is done
with proximal splitting method. In this case, the proximal operator of the h(L) = L 1 is the
l1 soft-thresholding given by the element-wise operations:
proxλh(L) = X + sgn(L − X) ◦ max(|L − X| − λ, 0), (38)
where ◦ is the Hadamard product. Also, the gradient of g(L) = λ1 tr(LΨ1LT
) + λ2 tr(LT
Ψ2L))
is given by g(U) = 2(γ1LΨ1 + γ2Ψ2L).
By using these notations, the update procedure for U is given by
Lj = proxλjh(Yj − λj g(Yj))
tj+1 =
1 + 1 + 4t2
j
2
(39)
Yj+1 = Uj +
tj − 1
tj + 1
(Uj − Uj−1),
where λj is the jth step size. If Yj+1 − Yj
2
F < Yj
2
F where is the stopping tolerance, the
algorithm is stopped.
9
10. 6.3 What’s new?
• It targets directly the recovery of the low-rank matrix U.
• It uses the graph smooth assumption both between the samples and between the features.
• It is convex, but non-smooth. But, it can be solved efficiently (linear time in the number
of samples) with the FISTA algorithm.
• It is parallelizable and scalable for large datasets because it requires only the multiplica-
tion of two sparse matrix and element-wise soft-thresholding operations.
7 Nonlinear dimensionality reduction on graphs [9]
It introduces an approach to graph-adaptive nonlinear dimensionality reduction which can be
viewed as the extension of the kernel PCA with the graph regularisation. They call it graph-
adaptive nonlinear Kernel PCA (GRAD KPCA).
7.1 Objective function for GRAD KPCA
min
B
Kx − BT
B 2
F + γ tr(BΨBT
) s.t BBT
= Id (40)
where Kx = XT
X ∈ Rp×p
is a Gram matrix and B = UX ∈ Rd×p
.
In addition, instead of directly using Ψ, we can use a family of graph kernels r∗
(Ψ) =
UGr∗
(Λ)UG where r∗
() is a non-decreasing scalar function of the eigenvalues of Ψ and UG is the
eigenvectors of Ψ. By using r∗
(Ψ) as a kernel matrix, we can rewrite (40) as
min
B
Kx − BT
B 2
F + γ tr(Br∗
(Ψ)BT
) s.t BBT
= Id. (41)
7.2 Optimization for GRAD KPCA
As kernel PCA can be written as a trace minimization problem, (40) can be reduced to
min
B
− tr(BKxBT
) + γ tr(BΨBT
) s.t BBT
= Id
⇔ min
B
− tr B(Kx − γΨ)BT
s.t BBT
= Id (42)
The optimal solution of B of (42) is given by the d largest eigenvalues and corresponding
eigenvectors of Kx − γΨ = ¯V Λ¯V . Then, we can get a closed form solution as B = ¯V T
d , which
is the sub-matrix of ¯V formed by columns corresponding the d largest eigenvalues.
10
11. 7.3 What’s new?
• It extends the concept of kernel PCA by using graph regularisation.
8 Locality Preserving Projections [4]
Locality Preserving Projections (LPP) is not a PCA based method, but uses similar idea as PCA
and provide useful insight to interpret PCA. Note that Section 3.4 (Geometrical Justification)
provides good explanation about why you should use the Laplacian matrix, which can be viewed
as analogy to the Laplace Beltrami operator on compact Riemannian manifolds [1].
8.1 Objective function for LPP
min
y
i,j
(yi − yj)2
Wij s.t. yT
Dy = 1 (43)
where y = (y1, . . . , yd)T
is the map that maps the graph G to a line so that connected points
stay as close together as possible and D is the degree matrix defined as D = diag(dii) and
dii = j Wij. Let a be a transformation vector y = aT
X. By simple calculation, the objective
function can be reduced to
min
a
i,j
(aT
xi − aT
xj)2
Wij s.t. aXDXT
a = 1
⇔ min
a
i
aT
xiDiixT
i aT
−
i,j
aT
xiWijxT
j a s.t. aXDXT
a = 1
⇔ min
a
aT
X(DW )XT
a = aXΨXT
a s.t. aXDXT
a = 1 (44)
In addition, suppose that the Euclidean space Rn
is mapped to a Hilbert space H through
a nonlinear mapping function φ : Rn
→ H. We consider the kernel function K(xi, xj) =
ψ(xi)T
ψ(xj) on the Hilbert space H. After simple algebra, kernel version of (44) can be given
by the following eigenvector problem:
KΨKb = λKDKb. (45)
8.2 Optimization for LPP
The optimal solution of a that minimizes the objective function is given by the minimum
eigenvalue solution to the generalized eigenvalue problem:
XΨXT
a = λXDXT
a. (46)
11
12. 8.3 What’s new?
• It imposes a new restriction aXDXT
a = 1: the matrix D provides a natural measure on
the data points. The bigger the value Dii is (corresponding to yi), the more ”important”
is yi.
• It is linear. This makes it fast and suitable for practical big data.
• It may be conducted in the original or the reproducing kernel Hilbert space (RKHS) into
which data points are mapped.
12
13. References
[1] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embed-
ding and clustering. In Advances in neural information processing systems, pages 585–591,
2002.
[2] Emmanuel J Cand`es, Xiaodong Li, Yi Ma, and John Wright. Robust principal component
analysis? Journal of the ACM (JACM), 58(3):11, 2011.
[3] Chun-Mei Feng, Ying-Lian Gao, Jin-Xing Liu, Juan Wang, Dong-Qin Wang, and Chang-
Gang Wen. Joint-norm constraint and graph-laplacian pca method for feature extraction.
BioMed research international, 2017, 2017.
[4] Xiaofei He and Partha Niyogi. Locality preserving projections. In Advances in neural
information processing systems, pages 153–160, 2004.
[5] Bo Jiang, Chris Ding, Bio Luo, and Jin Tang. Graph-laplacian pca: Closed-form solution
and robustness. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 3492–3498, 2013.
[6] Taisong Jin, Jun Yu, Jane You, Kun Zeng, Cuihua Li, and Zhengtao Yu. Low-rank matrix
factorization with multiple hypergraph regularizer. Pattern Recognition, 48(3):1011–1022,
2015.
[7] Nauman Shahid, Vassilis Kalofolias, Xavier Bresson, Michael Bronstein, and Pierre Van-
dergheynst. Robust principal component analysis on graphs. In Proceedings of the IEEE
International Conference on Computer Vision, pages 2812–2820, 2015.
[8] Nauman Shahid, Nathanael Perraudin, Vassilis Kalofolias, Gilles Puy, and Pierre Van-
dergheynst. Fast robust pca on graphs. IEEE Journal of Selected Topics in Signal Pro-
cessing, 10(4):740–756, 2016.
[9] Yanning Shen, Panagiotis A Traganitis, and Georgios B Giannakis. Nonlinear dimension-
ality reduction on graphs. In Computational Advances in Multi-Sensor Adaptive Processing
(CAMSAP), 2017 IEEE 7th International Workshop on, pages 1–5. IEEE, 2017.
[10] Liang Tao, Horace HS Ip, Yinglin Wang, and Xin Shu. Low rank approximation with sparse
integration of multiple manifolds for data representation. Applied Intelligence, 42(3):430–
446, 2015.
13
14. [11] Zhenyue Zhang and Keke Zhao. Low-rank matrix approximation with manifold regular-
ization. IEEE transactions on pattern analysis and machine intelligence, 35(7):1717–1729,
2013.
14