This document discusses subspace clustering with missing data. It summarizes two algorithms for solving this problem: 1) an EM-type algorithm that formulates the problem probabilistically and iteratively estimates the subspace parameters using an EM approach. 2) A k-means form algorithm called k-GROUSE that alternates between assigning vectors to subspaces based on projection residuals and updating each subspace using incremental gradient descent on the Grassmannian manifold. It also discusses the sampling complexity results from a recent paper, showing subspace clustering is possible without an impractically large sample size.
FITTED OPERATOR FINITE DIFFERENCE METHOD FOR SINGULARLY PERTURBED PARABOLIC C...
machinelearning project
1. 000
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
046
047
048
049
050
051
Subspace Clustering with Missing Data
Lianli Liu Dejiao Zhang Jiabei Zheng
Abstract
1
Subspace clustering with missing data can be seen as the combination of subspace clustering
and low rank matrix completion, which is essentially equivalent to high-rank matrix completion
under the assumption that columns of the matrix X ∈ Rd×N
belong to a union of subspaces. It’s
a challenging problem, both in terms of computation and inference. In this report, we study two
efficient algorithms proposed for solving this problem, the EM-type algorithm [1] and k-means
form algorithm – k-GROUSE[2], and implement the two algorithms on both simulated and real
dataset. Besides that we will also give a brief description of recently developed theorem [1] on
the sampling complexity for subspace clustering, which is a great improvement over the previously
existing theoretic analysis [3] that requires impractically sample size, e.g. the number of observed
vectors per subspace should be super-polynomial in the dimension d.
1 Problem Statement
Consider a matrix X ∈ Rd×N
, with entries missing at random, whose columns lie in the union of at most K unknown
subspaces of Rd
, each has dimension at most rk < d. The goal of subspace clustering is to infer the underlying K
subspaces from the partially observed matrix XΩ, where Ω denote the indexes of the observed entries, and to cluster
the columns of XΩ into groups that lie in the same subspaces.
2 Sampling Complexity for Generic Subspace Clustering and EM Algorithm
2.1 Sampling Complexity
[1] shows that subspace clustering is possible without impractically large sample size as that required by previous work
[3]. The main assumptions used in the theoretic analysis of [1] is different from those arising from traditional Low-
Rank Matrix Completion, e.g. [3]. It assumes the columns of X are drawn from a non-atomic distribution supported
on a union of low-dimensional generic subspaces, and entries in the columns are missing uniformly at random. Before
stating the main result in [1], we provide the following definition first.
Definition 1. We denote the set of d × n matrices with rank r by M(r, d × n). A generic (d × n) matrix of rank r
is a continuous M(r, d × n) valued random variable. We say a subspace S is generic if a matrix whose columns are
drawn i.i.d according to a non-atomic distribution with support on S is generic a.s.
The key assumptions of this method is following:
• A1. The columns of the d × N data matrix X are drawn according to a non-atomic distribution with support
on the union of at most K generic subspaces. The subspaces, denoted by S = {Sk|k = 0, . . . , K − 1}, each
has rank exactly r < d.
• A2. The probability that a column is drawn from subspace k is ρk. Let ρ∗ be the bound on mink{ρk}.
1
Authors by alphabetical order
1
2. 052
053
054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
090
091
092
093
094
095
096
097
098
099
100
101
102
103
• A3. We observe X only on a set of entries Ω and denote the observation XΩ. Each entry in XΩ is sampled
independently with probability p.
We now state the main results in [1], which implies that if we observe at least order dr+1
(log d/r + log K) columns
and at least O(r log2
d) entries in each column, then identification of the subspaces is possible with high probability.
Although by assumption A2 the rank of each subspace is exactly r, this result can be generalized to a relax version
that the dimension of each subspace is upper bounded by r.
Theorem 1. [1] Suppose A1-A3 hold. Let > 0 be given. Assume the number of subspaces K ≤ 6 ed/4
, the total
number of columns N ≥ (2d + 4M)/ρ∗, and
p ≥
1
d
128µ2
1rβ0 log2
(2d), β0 = 1 +
log 6K
12 log(d)
2 log(2d)
,
M =
de
r + 1
r+1
(r + 1) log
de
r + 1
+ log
8K
(1)
where µ2
1 := maxk
d2
r UkV ∗
k
2
∞ and UkΣkV ∗
k is the singular value decomposition of ˜X[k]2
. Then with probability at
least 1 − , S can be uniquely determined from XΩ.
2.2 EM Algorithm
In [4], computing the principal subspaces of a set of data vectors is formulated in a probabilistic framework and
obtained by a maximum-likelihood method. Here we use the same model but extend it to the case of missing data:
xo
i
xm
i
=
K
i=1
1zi=k
Woi
k
Wmi
k
yi +
µoi
k
µmi
k
+ ηi (2)
where 1, · · · , K zi ∼ ρ ⊥ yi ∼ N(0, I). Wk is a d × r matrix whose span is Sk, and ηi|zi ∼ N(0, σ2
zi
I) is the
noise in the zth
i subspace. To find the Maximum Likelihood Estimate of θ = W, µ, ρ, σ2
, i.e.
ˆθ = arg max
θ
l(θ, xo
) = arg max
θ
N
i=1
log(
K
k=1
ρkP(xo
i |zi = k; θ)) (3)
an EM algorithm is proposed where Xo
:= {xo
i }N
i=1 is the observed data, Xm
:= {xm
i }N
i=1, Y := {yi}N
i=1, Z :=
{zi}N
i=1 are hidden variables. The iterates of the algorithm are as follows, with detailed derivation provided in Ap-
pendix A.
Wk =
N
i=1
pi,kEk xiyT
i −
N
i=1 pi,kEk xi
N
i=1 pi,kEk yi
T
N
i=1 pi,kEk yi
T
N
i=1
pi,kEk yiyT
i −
N
i=1 pi,kEk yi
N
i=1 pi,kEk yi
T
N
i=1 pi,k
−1
(4)
µk =
N
i=1 pi,k Ek xi − WkEk yi
N
i=1 pi,k
(5)
σ2
k =
1
d
N
i=1 pi,k
N
i=1
pi,k tr(E xixT
i ) − 2µT
k Ek xi + µk
T
µk
− 2tr(Ek yixT
i Wk) + 2µT
k WkEk yi + tr(Ek yiyT
i Wk
T
Wk) (6)
2 ˜X[k]
denote the columns of X corresponding to the kth
subspace
2
3. 104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
ρk =
1
N
N
i=1
pi,k , pi,k := Pzi|xo
i ,ˆθ(k) =
ˆρkPxo
i |zi=k,ˆθ(xo
i )
K
j=1 ˆρjPxo
i |zi=j,ˆθ(xo
i )
(7)
where Ek · denotes the conditional expectation E·|xo
i ,zi=k,ˆθ[·]. This conditional expectation can be easily calculated
from its conditional distribution. Denote Mk := σ2
I + WoiT
k Woi
k , by Gauss-Markov theorem,
xm
i
yi xo
i ,zi=k,θ
∼ N
µmi
k + Wmi
k M−1
k WoiT
k (xo
i − µoi
k )
M−1
k WoiT
k (xo
i − µoi
k )
,
σ2 I + Wmi
k M−1
k WmiT
k Wmi
k M−1
k
M−1
k WmiT
k M−1
k
(8)
Detailed derivation is provided in Appendix B.
3 Projection Residual Based K-MEANS Form Algorithm
[2] proposes a form of k-means algorithm adapted to k subspaces clustering. It alternats between assigning vectors to
subspaces based on the projection residuals and updates each subspace by performing an incremental gradient descent
along the geodesic curve of Grassmannian 3
manifold(GROUSE[6]).
3.1 Subspace Assignment
For subspace assignment, [2] proves that the incomplete data vector can be assigned to the correct subspace with high
probability if the angles between the vector and the various subspaces satisfied some mild condition.
For each data vector, when there is no entry missing, it’s natural to use the following
v − PSi v 2
2 < v − PSj v 2
2 ∀j = i, and i, j ∈ {0, . . . , K − 1}. (9)
where PSi is the projection operator onto subspace Si
, to decide whether v can be assigned to Si
or not. Now suppose
we only observe the a subset of the entries of v, denoted as vΩ, where Ω ⊂ {1, . . . , d}, |Ω| = m < d. And define the
projection operator restricted to Ω as
PSΩ
= (UT
Ω UΩ)−1
UT
Ω (10)
In [7], the authors prove that UT
Ω UΩ is invertible w.h.p as long as the assumptions of Theorem.2 are satisfied.
Let Uk
∈ Rd×rk
whose orthonormal columns span the rk dimensional subspaces Sk
respectively. Then an natural
question arises as under what condition we can use a similar decision rule as that of (9) for the subspace assignment
for the incomplete data case. To answer this question, we will first restate a theorem in [2] to show that the projection
residual will concentrated around that of the full data case by a scaling factor if the sampling size m satisfies certain
condition. Let v = x + y, where x ∈ S, y ∈ S⊥
.
Theorem 2. [2]4
Let δ > 0 and m ≥ 8
3 µ(S)r log 2r
δ , then with probability at least 1 − 3δ,
m(1 − α) − dµ(S)(1+β)2
1−γ
d
v − PSv 2
2 ≤ vΩ − PSΩ
vΩ
2
2 (11)
and with probability at least 1 − δ,
vΩ − PSΩ
vΩ
2
2 ≤ (1 + α)
m
d
v − PSv 2
2 (12)
where α = 2µ2(y)
m log 1
δ , β = 2µ(y) log 1
δ and γ = 8rµ(S)
3m log 2r
δ .
3
Grassmannian is a compact Riemannian manifold, and it’s geodesics can be explicitly computed [5]
4
In this theorem, µ(S), µ(y) is coherence parameter of subspace S and vector y, here we follow the definitions proposed in [8],
µS := d
r
maxj PS ej
2
2 and µ(y) =
d z 2
∞
z 2
2
3
4. 156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
Next we will state the requirement on the angles between the vector and the subspaces, by which the incomplete data
vector can be assigned to the correct subspace w.h.p. Define the angle between the vector v and its projection into the rk
dimensional subspace Sk
, k = 0, . . . , K −1 as θi = sin−1 v−PSk v 2
v 2
. And let Ck(m) =
m(1−αk)−rkµ(Sk
)
(1+βi)2
1−γi
m(1+α0) ,
where αk, βk and γk are defined as that in Theorem.2 for different rk. Notice that Ck(m) < 1 and Ck(m) 1 as
m → ∞. Without loss of generality, we assume θ0 < θk, ∀k = 0.
Corollary 1. [2] Let m ≥ 8
3 maxi=0 riµ(Si
) log 2ri
δ for fixed δ > 0. Assume that
sin2
(θ0) ≤ Ci(m) sin2
(θi), ∀i = 0.
Then with probability at least 1 − 4(K − 1)δ,
vΩ − PS0 vΩ
2
2 < vΩ − PSi vΩ
2
2, ∀i = 0
Proof. Detail proof is available in Appendix C.
3.2 Subspace Update
GROUSE [6] is a very efficient algorithm for single subspace estimation, although the theoretic support of this method
is still unavailable5
. k-GROUSE [2] can be seen as a combination of k-subspaces and GROUSE for multiple subspace
estimation, which is detailed in Algorithm.1.
Algorithm 1 k-subspaces with the GROUSE
Require: A collection of vectors vΩ(t), t = 1, . . . , T, and the observed indices Ω(t). An integer number of
subspaces k and dimensions rk, k = 1, . . . , K. A maximum number of iterations, maxIter. A fixed step size η.
Initialize Subspaces: Zero-fill the vectors and collect them in a mtrix V . Initialize k subspaces estimates using
probabilistic farthest insertion.
Calculate Orthonormal Bases Uj, k = 1, . . . , K. Let QjΩ
= (UT
kΩ
UkΩ
)−1
UT
kΩ
for i = 1, . . . , maxIter do
Select a vector at random, vΩ.
for k = 1, ..., K do
Calculate weights of projection onto kth
subspace: w(k) = QkΩ vΩ.
Calculate Projection Residuals to kth
subspace: rΩ(k) = vΩ − UkΩ w(k).
end for
Select min residual: k = arg mink rΩ(k) 2
2. Set w = w(k). Define p = Ukw and rΩc = 0, i.e, r is the zero
filled rΩ(k)
Update Subspace:
Uk := Uk + (cos(ση) − 1)
p
p
+ sin(ση)
r
r
wT
w
. (13)
where σ = r p .
end for
4 Experiment on simulated data
In this section, we will explore the performance of k-GROUSE and EM algorithm on simulated data. Before we pre-
sented the results we will first give a brief description about two key points of the implement of the algorithm. Firstly,
we use a k-means++ like method to initialize the subspaces for both k-GROUSE and EM algorithm. Specifically,
we randomly pick a point x0 ∈ X, and select it’s q 6
nearest nearest neighbors and calculate the best S0
w.r.t these
selected q points. Next, we recursively initialize Sj
by randomly fist pick a point x with probability proportional to
5
Local convergence of GROUSE is available in [9]
6
q is a nonnegative parameter, usually we set q = 2 maxk rk
4
5. 208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
min[dist(x, S0
), . . . , dist(x, Sj−1
)], and then find the best fit Sj
of its q neighborhood. Secondly, the error is calcu-
lated by examining the cluster assignments if a set of vectors who come from the same underlying subspace(which is
known). For each set, we consider the cluster ID own the largest number of vectors to the true one, and the others as
incorrectly clustered.
4.1 EM Algorithm
The first experiment of EM algorithm is dealing with the simulated data generated by the Gaussian mixture model in
section 2. We used d = 100, K = 4 and r = 5 for this part. For simulation, each of K subspaces is the span of
an orthonormal basis generated from r i.i.d. standard Gaussian d-dimensional vectors. The performance of the EM
algorithm strongly depends on initialization. With poor initial estimation of W and µ, estimated θ will get stuck by
a local minimum of l(θ, xo
). Fig.1(a) gives an example of processing the same XΩ with different initial estimations,
where we observed an obvious difference in performance.
We ran 50 trials for each set of |ω|, Nk and σ2
, where |ω| is the number of observed entries in each column, Nk is
the number of columns of each subspace and σ2
is the noise level. The stopping criteria is chosen as l(θ(t+1)
, xo
) −
l(θ(t)
, xo
) < 10−4
l(θ(t)
, xo
). The result are summarized in Fig.1(b)-(d). The difference between our results and
results reported in [1] mainly comes from different initialization strategies. Generally, larger |ω| and Nk, and smaller
σ2
gives less misclassified points, which is reasonable. For some cases, all misclassified points come from getting
stuck at a local minimum.
0 50 100 150 200 250 300 350 400
−4
−2
0
2
4
6
x 10
4
Iterations
LogLikelihood
Good
Bad
(a)
0 50 100 150 200 250
0
0.2
0.4
0.6
0.8
1
N
k
Prop.ofmisclassifiedpoints
(b)
10 15 20 25 30 35 40
0
0.2
0.4
0.6
0.8
1
|w|
Prop.ofmisclassifiedpoints
(c)
−4 −3 −2 −1 0
0
0.2
0.4
0.6
0.8
1
log(σ
2
)
Prop.ofmisclassifiedpoints
(d)
Figure 1: (a) Good initial estimation: converges after 83 iterations, label correct rate = 100%. Bad initial estimation:
converges after 395 iterations, label correct rate = 45.6%. (Nk = 200, ω = 24, σ2
= 10−4
). (b)-(d) shows proportion
of misclassified points: (b) as a function of Nk with |ω| = 24, σ2
= 10−4
. (c) as a function of |ω| with σ2
= 10−4
,
Nk = 300. (d) as a function of σ2
with |ω| = 24, Nk = 210.
5
6. 260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
4.2 k-GROUSE
In this section, we will explore the performance of k-GROUSE in two simulation scenarios. We set the ambient
dimension d = 40 and the data matrix X consists of 80 vectors per subspace. In the first case, we let the number of
subspaces K1 = 4 and and r = 5 for each subspace, i.e, the sum of the dimensions of the subspaces R =
K
k=1 rk <
d. In the second case, we set the number of subspaces K2 = 5 and make one of them depended on the other four.
From Fig.2(a) we can see that k-GROUSE perform well as long as the number of observations is a little bit more that
r log2
d, which is within the order predicted by the theory. However, the performance of K-GROUSE is also depended
on the initialization. Although k-means++ initialization can significantly improve the convergence of k-GROUSE, it
can converge to some local point in the worse case.
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Sampling density
probabilityoferror
d = 40 dim = 5
K1 = 4
K2 = 5
(a)
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
Sampling densityprobabilityoferror
d = 100 dim = 5 K = 4
k−GROUSE
EM
(b)
Figure 2: (a) Performance of k-GROUSE with independent and dependent subspaces, obtained over 50 trials; (b) k
-GROUSE vs EM Algorithm, obtained over 50 trials
4.3 k-GROUSE vs EM Algorithm
Now we compare the performance of k-GROUSE and EM Algorithm in simulation data. In this setting, we set
K = 4, r = 5, d = 100. From Figure.2(a) we can see that both k-GROUSE and EM Algorithm perform well in
this situation, while k-GROUSE is more efficient. However, as we will see in the next section, the performance gap
between the two will become obvious when we apply them to some worse conditioned real data, specifically when
the conditional number of the data matrix X is large, k-GROUSE can perform poorly while EM algorithm still work
under the same condition. In traditional matrix completion, the sampling density can heavily depend on the condition
number of the matrix, it seems that it’s inherent here by k-GROUSE.
5 Experiment on real data
5.1 EM algorithm
We test the algorithm on real data by applying it to image compression. An image of 264×536 is used, which is the
same as the one used in mixture probabilistic PCA[4]. Similar with [4], the image was segmented into 8×8 non-
overlapping blocks, giving a total dataset of 2211×64-dimensional vectors. In the experiment, we set the number of
subspaces as 3, and test the performance of the algorithm under different missing rate (number of missing pixels in
image block) and compression ratio. To make sure the resulting coded image has the same bit rate as the conventional
SVD coded image, after the EM algorithm converges, the image coding was performed in a hard fashion, i.e. the
coding efficient of each image block is only estimated using the subspace it has the largest probability of belonging to.
We test the algorithm under different missing rate (number of missing pixels in each image block) and compression
ratio. An initialization scheme that is slightly different from the simulated experiment is used: after setting all missing
data to zero, a simple k-means clustering is performed on the data set. µk,Wk and σk of each subspace is initialized
from the kth
data cluster {xi} ∈ Rd
using probabilistic PCA [4], where µk is initialized by the sample mean. Denote
the sample covariance matrix as Sk = UΛV ,Wk ∈ Rd×r
is initialized as
Wk = Ur(Λr − σ2
I)1/2
(14)
6
7. 312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
where σ2
is the initial value of σ2
k = 1
d−r
d
j=r+1 λj, Ur ∈ Rd×r
is composed of the leading r left singular vectors
from U and Λr ∈ Λd×r
is the corresponding singular value matrix.
Sample images reconstructed under different missing rate (equivalently, the number of observing pixels per block ω)
and compression ratio (number of components in each subspace r) are shown in Fig.3. Mean squared error (MSE)
between the compressed image and the original image is plotted in Fig5. As compression ratio decreases, the quality
of image improves significantly, as we can see from Fig3, the mean square error also decreases. While for different
missing rate, the images appear visually the same, no significant difference is observed between missing data case
and complete data case (though means square error decreases slightly). This validates the theory in [1] that as long as
certain conditions are met and a reasonable number of datapoints are observed, with a high probability we can recover
the true subspace from incomplete dataset.
(a) ω=14,r = 4 (b) ω=64,r = 4 (c) ω=14,r = 12
(d) ω=64,r = 12 (e) ω=14,r = 32 (f) ω=64,r = 32
Figure 3: Image compression using EM algorithm
5.2 k-GROUSE algorithm
We also test the k-GROUSE algorithm in image compression. We use the same block transformation scheme and
number of subspaces as in the EM algorithm. The sample compressed images are shown below.
6 Conclusion
From this project, we learned different subspace learning techniques, especially under missing data case. We also
studied the theoretical requirement for recovering subspaces when data is incomplete. We also practiced the imple-
mentation of EM algorithm and k-GROUSE.
7 Contribution
Dejiao Zhang is in charge of abstract, sections 1, 2.1, 3, 4.2, 4.3 and (Appendix C). Jiabei Zheng rederived the EM
algorithm for Gaussian mixture model (Appendix A), did coding and made plots in 4.1. Lianli Liu rederived the
conditional distribution (Appendix B), did experiment on real dataset (section 5) and debugged the EM algorithm.
7
8. 364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
(a) missing rate 30%, number of modes 32 (b) missing rate 30%, number of modes 50
Figure 4: Image compression using k-GROUSE
(a) MSE under different missing rate,
mode = 12
(b) MSE under different compression
ratio, missing rate = 78%
Figure 5: Plot of mean square error
References
[1] D. Pimentel and R. Nowak, “On the sample complexity of subspace clustering with missing data.”
[2] L. Balzano, A. Szlam, B. Recht, and R. Nowak, “k-subspaces with missing data,” in Statistical Signal Processing
Workshop (SSP), 2012 IEEE. IEEE, 2012, pp. 612–615.
[3] B. Eriksson, L. Balzano, and R. Nowak, “High-rank matrix completion and subspace clustering with missing
data,” arXiv preprint arXiv:1112.5629, 2011.
[4] M. Tipping and C. Bishop, “Mixtures of probabilistic principal component analyzers,” Neural computation,
vol. 11, no. 2, pp. 443–482, 1999.
[5] A. Edelman, T. A. Arias, and S. T. Smith, “The geometry of algorithms with orthogonality constraints,” SIAM
journal on Matrix Analysis and Applications, vol. 20, no. 2, pp. 303–353, 1998.
[6] L. Balzano, R. Nowak, and B. Recht, “Online identification and tracking of subspaces from highly incomplete
information,” in Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on.
IEEE, 2010, pp. 704–711.
[7] L. Balzano, B. Recht, and R. Nowak, “High-dimensional matched subspace detection when data are missing,” in
Information Theory Proceedings (ISIT), 2010 IEEE International Symposium on. IEEE, 2010, pp. 1638–1642.
[8] E. J. Cand`es and B. Recht, “Exact matrix completion via convex optimization,” Foundations of Computational
mathematics, vol. 9, no. 6, pp. 717–772, 2009.
[9] L. Balzano and S. J. Wright, “Local convergence of an algorithm for subspace identification from partial data,”
arXiv preprint arXiv:1306.3391, 2013.
8
9. 416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
Appendix A
Here is the complete derivation of the EM algorithm. The complete data probability is
Pr(x, y, z; θ) =
N
i=1
Pr(xi, yi, zi)
=
N
i=1
Pr(xi|yi, zi)Pr(yi|zi)Pr(zi)
=
N
i=1
Pr(xi|yi, zi)Pr(yi)Pr(zi)
=
N
i=1
ρkφ(xi; µk + Wkyi, σ2
kI)φ(yi; 0, I)
(15)
The complete data log-likelihood is
l(θ; x, y, z) =
N
i=1
log(ρkφ(xi; µk + Wkyi, σ2
kI)φ(yi; 0, I))
=
N
i=1
K
k=1
∆iklog(ρkφ(xi; µk + Wkyi, σ2
kI)φ(yi; 0, I))
(16)
where
∆ik =
1 if zi = k
0 else.
(17)
The E-step is
Q(θ; ˆθ) = E[l(θ; x, y, z)|xo
, ˆθ]
=
K
k=1
N
i=1
E[∆iklog(ρkφ(xi; µk + Wkyi, σ2
kI)φ(yi; 0, I))|xo
, ˆθ]
=
K
k=1
N
i=1
K
l=1
E[∆iklog(ρkφ(xi; µk + Wkyi, σ2
kI)φ(yi; 0, I))|xo
, zi = l, ˆθ]Pr(zi = l|xo
, ˆθ)
=
K
k=1
N
i=1
E[log(ρkφ(xi; µk + Wkyi, σ2
kI)φ(yi; 0, I))|xo
, zi = k, ˆθ]Pr(zi = k|xo
, ˆθ)
=
K
k=1
N
i=1
pik{log(ρk) −
d + r
2
log(2π) −
d
2
log(σ2
k)
−
1
2σ2
k
Ek (xi − µk − Wkyi)T
(xi − µk − Wkyi) −
1
2
Ek yT
i yi }
=
K
k=1
(
N
i=1
pik){log(ρk) −
d + r
2
log(2π) −
d
2
log(σ2
k) −
1
2σ2
k
Uk −
1
2
Ek yT y }
(18)
where
pik = Pr(zi = k|xo
i , ˆθ) (19)
9
10. 468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
Ek · = E[·|xo
i , zi = k, ˆθ] (20)
Ek u =
N
i=1 pikEk ui
N
i=1 pik
(21)
Ek uT v =
N
i=1 pikEk uT
i vi
N
i=1 pik
(22)
Uk = Ek xT x + µT
k µk + tr(Ek yT y WT
k Wk)
− 2µT
k Ek x + 2µT
k WkE y − 2tr(E yxT Wk)
(23)
The M-step is
˜θ = arg min
θ
Q(θ, ˆθ) (24)
First derive ρ. ρ satisfies
K
i=1 ρk = 1. Using Lagrangian multiplier λ:
∂Q + λ
K
i=1 ρk
∂ρk
= 0 (25)
N
i=1
pik + λρk = 0 (26)
K
k=1
N
i=1
pik + λ = 0 (27)
λ = −1 (28)
Use this to eq(), we have:
ρk =
N
i=1
pik (29)
Then we eliminate σ2
with W fixed as W
∂Q
∂σ2
k
= 0 ⇒ σk
2
=
Uk
d
(30)
σk
2
=
1
d
(Ek xT x + µT
k µk + tr(Ek yT y Wk
T
Wk)
− 2µT
k Ek x + 2µT
k WkE y − 2tr(E yxT Wk))
(31)
The elimination of µ is similar:
∂Q
∂µk
= 0 ⇒
∂Uk
∂µk
= 0 (32)
10
11. 520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
µk = Ek x − WkEk y (33)
Then we derive expression for W. From Eq(18) we have:
∂Q
∂Wk
= −
d
2σ2
k
∂σ2
k
∂Wk
−
1
2σ2
k
∂Uk
∂Wk
+
Uk
2σ4
k
∂σ2
k
∂Wk
= −
1
2σ2
k
∂Uk
∂Wk
= 0 (34)
As a result,
∂Q
∂Wk
= 0 ⇒
∂Uk
∂Wk
= 0 (35)
We plug in expression of µk to Uk, We get
∂
∂Wk
{Ek xT x + (Ek x − WkEk y )T
(Ek x − WkEk y ) + tr(Ek yT y WT
k Wk)
− 2(Ek x − WkEk y )T
Ek x + 2(Ek x − WkEk y )T
WkE y − 2tr(E yxT Wk)} = 0
(36)
Wk = (Ek xyT − Ek x Ek yT )(Ek yyT − Ek y Ek yT )−1
(37)
pik is calculated with Bayes’ rule:
pik = Pr(zi = k; xo
i , ˆθ)
=
ˆρkP(xo
i |zi = k; ˆθ)
K
j=1 ˆρjP(xo
i |zi = j; ˆθ)
=
ˆρkφ(xo
i ; µoi
k , Woi
k Woi
k
T
+ σ2
kI)
K
j=1 ˆρjφ(xo
i ; µoi
j , Woi
j Woi
j
T
+ σ2
j I)
(38)
8 Appendix B
By model definition,
xo
i
xm
i
yi zi=k,θ
=
Woi
k
Wmi
k
I
yi +
µoi
k
µmi
k
0
+
ηoi
ηmi
0
(39)
Eq(39) is a linear transform of yi ∼ N(0, I). By property of mulivariable Gaussian distribution,
xo
i
xm
i
yi zi=k,θ
∼ N
µoi
k
µmi
k
0
,
Woi
k WoiT
k + σ2
I Woi
k WmiT
k Woi
k
Wmi
k WoiT
k Wmi
k WmiT
k + σ2
I Wmi
k
WoiT
k WmiT
k I
(40)
By Bayesian Gauss-Markov Theorem,
xm
i
yi
xo
i ,zi=k,θ
∼ N(µmyi
+ Rmyoi
R−1
oi
(xo
i − µoi
k ), Rmyi
− Rmyoi
R−1
oi
Romyi
) (41)
where
µmyi
=
µmi
k
0
, Rmyoi
=
Wmi
k WoiT
k
WoiT
k
, Rmyi
=
Wmi
k WmiT
k + σ2
I Wmi
k
WmiT
k I
(42)
Roi = Woi
k WoiT
k + σ2
I, Romyi = [Woi
k WmiT
k Woi
k ] (43)
Plug Eq(42) and Eq(43) into Eq(41),
Rmyoi R−1
oi
=
Wmi
k WoiT
k
WoiT
k
(Woi
k WoiT
k + σ2
I)−1
(44)
11
12. 572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
By matrix identities,
(I + PQ)−1
P = P(I + QP)−1
(45)
Therefore
WoiT
k (Woi
k WoiT
k + σ2
I)−1
= (WoiT
k Woi
k + σ2
I)−1
WoiT
k = (M−1
k )WoiT
k (46)
Put Eq(46) and Eq(44) together,
Rmyoi
R−1
oi
=
Wmi
k WoiT
k
WoiT
k
[Woi
k WoiT
k + σ2
I]−1
=
Wmi
k M−1
k WoiT
k
M−1
k WoiT
k
(47)
Therefore
µmyi + Rmyoi R−1
oi
(xo
i − µoi
k ) =
µmi
k + Wmi
k M−1
k WoiT
k (xo
i − µoi
k )
M−1
k WoiT
k (xo
i − µoi
k )
(48)
which is exactly the formula of mean given in the paper.
To prove the formula of variance, by Eq(47)
Rmyoi R−1
oi
Romyi =
Wmi
k WoiT
k
WoiT
k
(Woi
k WoiT
k + σ2
I)−1
[Woi
k WmiT
k Woi
k ]
=
Wmi
k M−1
k WoiT
k Woi
k WmiT
k Wmi
k M−1
k WoiT
k Woi
k
M−1
k WoiT
k Woi
k WmiT
k M−1
k WoiT
k Woi
k
(49)
Thus
Rmyi
− Rmyoi
R−1
oi
Romyi
=
σ2
I + Wmi
k (I − M−1
k WoiT
k Woi
k )WmiT
k Wmi
k (I − M−1
k WoiT
k Woi
k )
(I − M−1
k WoiT
k Woi
k )WmiT
k I − M−1
k WoiT
k Woi
k
(50)
Notice that
Mk − WoiT
k Woi
k = σ2
I ⇒ I − M−1
k WoiT
k Woi
k = σ2
M−1
k (51)
Thus Eq(50) simplifies to
Rmyi − Rmyoi R−1
oi
Romyi = σ2 I + Wmi
k M−1
k WmiT
k Wmi
k M−1
k
M−1
k WmiT
k M−1
k
(52)
which is exactly the formula of variance given in the paper.
Appendix C
Before we give the proof of Corollary.1, we examining the case of binary subspace assignment, i.e, K = 2. Let
v ∈ Rd
and S0
, S1
⊂ Rn
with dimension r0, r1 separately. Define the angle between v and its projection onto
subspace S0
, S1
as θ0, θ1 :
θ0 = sin−1 v − PS0 v 2
v 2
and θ1 = sin−1 v − PS1 v 2
v 2
(53)
Theorem 3. [2] Let δ > 0 and m ≥ 8
3 d1µ(S1
log 2d1
δ . Assume that
sin2
(θ0) < C(m) sin2
(θ1)
Then with probability at least 1 − 4δ
vΩ − PS0 vΩ
2
2 < vΩ − PS1 vΩ
2
2
Proof. By Theorem.2 and union bound, the following two statements hold simultaneously with probability at least
1 − 4δ,
vΩ − PS0
Ω
vΩ
2
2 ≤ (1 + α0)
m
n
v − PS0 v and
m(1 − α1) − r1µ(S1
)(1+β1)2
1−γ1
d
v − PS1 v 2
2 ≤ vΩ − PS1
Ω
vΩ
2
2
(54)
12