SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Downloaden Sie, um offline zu lesen
000
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
046
047
048
049
050
051
Subspace Clustering with Missing Data
Lianli Liu Dejiao Zhang Jiabei Zheng
Abstract
1
Subspace clustering with missing data can be seen as the combination of subspace clustering
and low rank matrix completion, which is essentially equivalent to high-rank matrix completion
under the assumption that columns of the matrix X ∈ Rd×N
belong to a union of subspaces. It’s
a challenging problem, both in terms of computation and inference. In this report, we study two
efficient algorithms proposed for solving this problem, the EM-type algorithm [1] and k-means
form algorithm – k-GROUSE[2], and implement the two algorithms on both simulated and real
dataset. Besides that we will also give a brief description of recently developed theorem [1] on
the sampling complexity for subspace clustering, which is a great improvement over the previously
existing theoretic analysis [3] that requires impractically sample size, e.g. the number of observed
vectors per subspace should be super-polynomial in the dimension d.
1 Problem Statement
Consider a matrix X ∈ Rd×N
, with entries missing at random, whose columns lie in the union of at most K unknown
subspaces of Rd
, each has dimension at most rk < d. The goal of subspace clustering is to infer the underlying K
subspaces from the partially observed matrix XΩ, where Ω denote the indexes of the observed entries, and to cluster
the columns of XΩ into groups that lie in the same subspaces.
2 Sampling Complexity for Generic Subspace Clustering and EM Algorithm
2.1 Sampling Complexity
[1] shows that subspace clustering is possible without impractically large sample size as that required by previous work
[3]. The main assumptions used in the theoretic analysis of [1] is different from those arising from traditional Low-
Rank Matrix Completion, e.g. [3]. It assumes the columns of X are drawn from a non-atomic distribution supported
on a union of low-dimensional generic subspaces, and entries in the columns are missing uniformly at random. Before
stating the main result in [1], we provide the following definition first.
Definition 1. We denote the set of d × n matrices with rank r by M(r, d × n). A generic (d × n) matrix of rank r
is a continuous M(r, d × n) valued random variable. We say a subspace S is generic if a matrix whose columns are
drawn i.i.d according to a non-atomic distribution with support on S is generic a.s.
The key assumptions of this method is following:
• A1. The columns of the d × N data matrix X are drawn according to a non-atomic distribution with support
on the union of at most K generic subspaces. The subspaces, denoted by S = {Sk|k = 0, . . . , K − 1}, each
has rank exactly r < d.
• A2. The probability that a column is drawn from subspace k is ρk. Let ρ∗ be the bound on mink{ρk}.
1
Authors by alphabetical order
1
052
053
054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
090
091
092
093
094
095
096
097
098
099
100
101
102
103
• A3. We observe X only on a set of entries Ω and denote the observation XΩ. Each entry in XΩ is sampled
independently with probability p.
We now state the main results in [1], which implies that if we observe at least order dr+1
(log d/r + log K) columns
and at least O(r log2
d) entries in each column, then identification of the subspaces is possible with high probability.
Although by assumption A2 the rank of each subspace is exactly r, this result can be generalized to a relax version
that the dimension of each subspace is upper bounded by r.
Theorem 1. [1] Suppose A1-A3 hold. Let > 0 be given. Assume the number of subspaces K ≤ 6 ed/4
, the total
number of columns N ≥ (2d + 4M)/ρ∗, and
p ≥
1
d
128µ2
1rβ0 log2
(2d), β0 = 1 +
log 6K
12 log(d)
2 log(2d)
,
M =
de
r + 1
r+1
(r + 1) log
de
r + 1
+ log
8K
(1)
where µ2
1 := maxk
d2
r UkV ∗
k
2
∞ and UkΣkV ∗
k is the singular value decomposition of ˜X[k]2
. Then with probability at
least 1 − , S can be uniquely determined from XΩ.
2.2 EM Algorithm
In [4], computing the principal subspaces of a set of data vectors is formulated in a probabilistic framework and
obtained by a maximum-likelihood method. Here we use the same model but extend it to the case of missing data:
xo
i
xm
i
=
K
i=1
1zi=k
Woi
k
Wmi
k
yi +
µoi
k
µmi
k
+ ηi (2)
where 1, · · · , K zi ∼ ρ ⊥ yi ∼ N(0, I). Wk is a d × r matrix whose span is Sk, and ηi|zi ∼ N(0, σ2
zi
I) is the
noise in the zth
i subspace. To find the Maximum Likelihood Estimate of θ = W, µ, ρ, σ2
, i.e.
ˆθ = arg max
θ
l(θ, xo
) = arg max
θ
N
i=1
log(
K
k=1
ρkP(xo
i |zi = k; θ)) (3)
an EM algorithm is proposed where Xo
:= {xo
i }N
i=1 is the observed data, Xm
:= {xm
i }N
i=1, Y := {yi}N
i=1, Z :=
{zi}N
i=1 are hidden variables. The iterates of the algorithm are as follows, with detailed derivation provided in Ap-
pendix A.
Wk =
N
i=1
pi,kEk xiyT
i −
N
i=1 pi,kEk xi
N
i=1 pi,kEk yi
T
N
i=1 pi,kEk yi
T
N
i=1
pi,kEk yiyT
i −
N
i=1 pi,kEk yi
N
i=1 pi,kEk yi
T
N
i=1 pi,k
−1
(4)
µk =
N
i=1 pi,k Ek xi − WkEk yi
N
i=1 pi,k
(5)
σ2
k =
1
d
N
i=1 pi,k
N
i=1
pi,k tr(E xixT
i ) − 2µT
k Ek xi + µk
T
µk
− 2tr(Ek yixT
i Wk) + 2µT
k WkEk yi + tr(Ek yiyT
i Wk
T
Wk) (6)
2 ˜X[k]
denote the columns of X corresponding to the kth
subspace
2
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
ρk =
1
N
N
i=1
pi,k , pi,k := Pzi|xo
i ,ˆθ(k) =
ˆρkPxo
i |zi=k,ˆθ(xo
i )
K
j=1 ˆρjPxo
i |zi=j,ˆθ(xo
i )
(7)
where Ek · denotes the conditional expectation E·|xo
i ,zi=k,ˆθ[·]. This conditional expectation can be easily calculated
from its conditional distribution. Denote Mk := σ2
I + WoiT
k Woi
k , by Gauss-Markov theorem,
xm
i
yi xo
i ,zi=k,θ
∼ N
µmi
k + Wmi
k M−1
k WoiT
k (xo
i − µoi
k )
M−1
k WoiT
k (xo
i − µoi
k )
,
σ2 I + Wmi
k M−1
k WmiT
k Wmi
k M−1
k
M−1
k WmiT
k M−1
k
(8)
Detailed derivation is provided in Appendix B.
3 Projection Residual Based K-MEANS Form Algorithm
[2] proposes a form of k-means algorithm adapted to k subspaces clustering. It alternats between assigning vectors to
subspaces based on the projection residuals and updates each subspace by performing an incremental gradient descent
along the geodesic curve of Grassmannian 3
manifold(GROUSE[6]).
3.1 Subspace Assignment
For subspace assignment, [2] proves that the incomplete data vector can be assigned to the correct subspace with high
probability if the angles between the vector and the various subspaces satisfied some mild condition.
For each data vector, when there is no entry missing, it’s natural to use the following
v − PSi v 2
2 < v − PSj v 2
2 ∀j = i, and i, j ∈ {0, . . . , K − 1}. (9)
where PSi is the projection operator onto subspace Si
, to decide whether v can be assigned to Si
or not. Now suppose
we only observe the a subset of the entries of v, denoted as vΩ, where Ω ⊂ {1, . . . , d}, |Ω| = m < d. And define the
projection operator restricted to Ω as
PSΩ
= (UT
Ω UΩ)−1
UT
Ω (10)
In [7], the authors prove that UT
Ω UΩ is invertible w.h.p as long as the assumptions of Theorem.2 are satisfied.
Let Uk
∈ Rd×rk
whose orthonormal columns span the rk dimensional subspaces Sk
respectively. Then an natural
question arises as under what condition we can use a similar decision rule as that of (9) for the subspace assignment
for the incomplete data case. To answer this question, we will first restate a theorem in [2] to show that the projection
residual will concentrated around that of the full data case by a scaling factor if the sampling size m satisfies certain
condition. Let v = x + y, where x ∈ S, y ∈ S⊥
.
Theorem 2. [2]4
Let δ > 0 and m ≥ 8
3 µ(S)r log 2r
δ , then with probability at least 1 − 3δ,
m(1 − α) − dµ(S)(1+β)2
1−γ
d
v − PSv 2
2 ≤ vΩ − PSΩ
vΩ
2
2 (11)
and with probability at least 1 − δ,
vΩ − PSΩ
vΩ
2
2 ≤ (1 + α)
m
d
v − PSv 2
2 (12)
where α = 2µ2(y)
m log 1
δ , β = 2µ(y) log 1
δ and γ = 8rµ(S)
3m log 2r
δ .
3
Grassmannian is a compact Riemannian manifold, and it’s geodesics can be explicitly computed [5]
4
In this theorem, µ(S), µ(y) is coherence parameter of subspace S and vector y, here we follow the definitions proposed in [8],
µS := d
r
maxj PS ej
2
2 and µ(y) =
d z 2
∞
z 2
2
3
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
Next we will state the requirement on the angles between the vector and the subspaces, by which the incomplete data
vector can be assigned to the correct subspace w.h.p. Define the angle between the vector v and its projection into the rk
dimensional subspace Sk
, k = 0, . . . , K −1 as θi = sin−1 v−PSk v 2
v 2
. And let Ck(m) =
m(1−αk)−rkµ(Sk
)
(1+βi)2
1−γi
m(1+α0) ,
where αk, βk and γk are defined as that in Theorem.2 for different rk. Notice that Ck(m) < 1 and Ck(m) 1 as
m → ∞. Without loss of generality, we assume θ0 < θk, ∀k = 0.
Corollary 1. [2] Let m ≥ 8
3 maxi=0 riµ(Si
) log 2ri
δ for fixed δ > 0. Assume that
sin2
(θ0) ≤ Ci(m) sin2
(θi), ∀i = 0.
Then with probability at least 1 − 4(K − 1)δ,
vΩ − PS0 vΩ
2
2 < vΩ − PSi vΩ
2
2, ∀i = 0
Proof. Detail proof is available in Appendix C.
3.2 Subspace Update
GROUSE [6] is a very efficient algorithm for single subspace estimation, although the theoretic support of this method
is still unavailable5
. k-GROUSE [2] can be seen as a combination of k-subspaces and GROUSE for multiple subspace
estimation, which is detailed in Algorithm.1.
Algorithm 1 k-subspaces with the GROUSE
Require: A collection of vectors vΩ(t), t = 1, . . . , T, and the observed indices Ω(t). An integer number of
subspaces k and dimensions rk, k = 1, . . . , K. A maximum number of iterations, maxIter. A fixed step size η.
Initialize Subspaces: Zero-fill the vectors and collect them in a mtrix V . Initialize k subspaces estimates using
probabilistic farthest insertion.
Calculate Orthonormal Bases Uj, k = 1, . . . , K. Let QjΩ
= (UT
kΩ
UkΩ
)−1
UT
kΩ
for i = 1, . . . , maxIter do
Select a vector at random, vΩ.
for k = 1, ..., K do
Calculate weights of projection onto kth
subspace: w(k) = QkΩ vΩ.
Calculate Projection Residuals to kth
subspace: rΩ(k) = vΩ − UkΩ w(k).
end for
Select min residual: k = arg mink rΩ(k) 2
2. Set w = w(k). Define p = Ukw and rΩc = 0, i.e, r is the zero
filled rΩ(k)
Update Subspace:
Uk := Uk + (cos(ση) − 1)
p
p
+ sin(ση)
r
r
wT
w
. (13)
where σ = r p .
end for
4 Experiment on simulated data
In this section, we will explore the performance of k-GROUSE and EM algorithm on simulated data. Before we pre-
sented the results we will first give a brief description about two key points of the implement of the algorithm. Firstly,
we use a k-means++ like method to initialize the subspaces for both k-GROUSE and EM algorithm. Specifically,
we randomly pick a point x0 ∈ X, and select it’s q 6
nearest nearest neighbors and calculate the best S0
w.r.t these
selected q points. Next, we recursively initialize Sj
by randomly fist pick a point x with probability proportional to
5
Local convergence of GROUSE is available in [9]
6
q is a nonnegative parameter, usually we set q = 2 maxk rk
4
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
min[dist(x, S0
), . . . , dist(x, Sj−1
)], and then find the best fit Sj
of its q neighborhood. Secondly, the error is calcu-
lated by examining the cluster assignments if a set of vectors who come from the same underlying subspace(which is
known). For each set, we consider the cluster ID own the largest number of vectors to the true one, and the others as
incorrectly clustered.
4.1 EM Algorithm
The first experiment of EM algorithm is dealing with the simulated data generated by the Gaussian mixture model in
section 2. We used d = 100, K = 4 and r = 5 for this part. For simulation, each of K subspaces is the span of
an orthonormal basis generated from r i.i.d. standard Gaussian d-dimensional vectors. The performance of the EM
algorithm strongly depends on initialization. With poor initial estimation of W and µ, estimated θ will get stuck by
a local minimum of l(θ, xo
). Fig.1(a) gives an example of processing the same XΩ with different initial estimations,
where we observed an obvious difference in performance.
We ran 50 trials for each set of |ω|, Nk and σ2
, where |ω| is the number of observed entries in each column, Nk is
the number of columns of each subspace and σ2
is the noise level. The stopping criteria is chosen as l(θ(t+1)
, xo
) −
l(θ(t)
, xo
) < 10−4
l(θ(t)
, xo
). The result are summarized in Fig.1(b)-(d). The difference between our results and
results reported in [1] mainly comes from different initialization strategies. Generally, larger |ω| and Nk, and smaller
σ2
gives less misclassified points, which is reasonable. For some cases, all misclassified points come from getting
stuck at a local minimum.
0 50 100 150 200 250 300 350 400
−4
−2
0
2
4
6
x 10
4
Iterations
LogLikelihood
Good
Bad
(a)
0 50 100 150 200 250
0
0.2
0.4
0.6
0.8
1
N
k
Prop.ofmisclassifiedpoints
(b)
10 15 20 25 30 35 40
0
0.2
0.4
0.6
0.8
1
|w|
Prop.ofmisclassifiedpoints
(c)
−4 −3 −2 −1 0
0
0.2
0.4
0.6
0.8
1
log(σ
2
)
Prop.ofmisclassifiedpoints
(d)
Figure 1: (a) Good initial estimation: converges after 83 iterations, label correct rate = 100%. Bad initial estimation:
converges after 395 iterations, label correct rate = 45.6%. (Nk = 200, ω = 24, σ2
= 10−4
). (b)-(d) shows proportion
of misclassified points: (b) as a function of Nk with |ω| = 24, σ2
= 10−4
. (c) as a function of |ω| with σ2
= 10−4
,
Nk = 300. (d) as a function of σ2
with |ω| = 24, Nk = 210.
5
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
4.2 k-GROUSE
In this section, we will explore the performance of k-GROUSE in two simulation scenarios. We set the ambient
dimension d = 40 and the data matrix X consists of 80 vectors per subspace. In the first case, we let the number of
subspaces K1 = 4 and and r = 5 for each subspace, i.e, the sum of the dimensions of the subspaces R =
K
k=1 rk <
d. In the second case, we set the number of subspaces K2 = 5 and make one of them depended on the other four.
From Fig.2(a) we can see that k-GROUSE perform well as long as the number of observations is a little bit more that
r log2
d, which is within the order predicted by the theory. However, the performance of K-GROUSE is also depended
on the initialization. Although k-means++ initialization can significantly improve the convergence of k-GROUSE, it
can converge to some local point in the worse case.
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Sampling density
probabilityoferror
d = 40 dim = 5
K1 = 4
K2 = 5
(a)
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
Sampling densityprobabilityoferror
d = 100 dim = 5 K = 4
k−GROUSE
EM
(b)
Figure 2: (a) Performance of k-GROUSE with independent and dependent subspaces, obtained over 50 trials; (b) k
-GROUSE vs EM Algorithm, obtained over 50 trials
4.3 k-GROUSE vs EM Algorithm
Now we compare the performance of k-GROUSE and EM Algorithm in simulation data. In this setting, we set
K = 4, r = 5, d = 100. From Figure.2(a) we can see that both k-GROUSE and EM Algorithm perform well in
this situation, while k-GROUSE is more efficient. However, as we will see in the next section, the performance gap
between the two will become obvious when we apply them to some worse conditioned real data, specifically when
the conditional number of the data matrix X is large, k-GROUSE can perform poorly while EM algorithm still work
under the same condition. In traditional matrix completion, the sampling density can heavily depend on the condition
number of the matrix, it seems that it’s inherent here by k-GROUSE.
5 Experiment on real data
5.1 EM algorithm
We test the algorithm on real data by applying it to image compression. An image of 264×536 is used, which is the
same as the one used in mixture probabilistic PCA[4]. Similar with [4], the image was segmented into 8×8 non-
overlapping blocks, giving a total dataset of 2211×64-dimensional vectors. In the experiment, we set the number of
subspaces as 3, and test the performance of the algorithm under different missing rate (number of missing pixels in
image block) and compression ratio. To make sure the resulting coded image has the same bit rate as the conventional
SVD coded image, after the EM algorithm converges, the image coding was performed in a hard fashion, i.e. the
coding efficient of each image block is only estimated using the subspace it has the largest probability of belonging to.
We test the algorithm under different missing rate (number of missing pixels in each image block) and compression
ratio. An initialization scheme that is slightly different from the simulated experiment is used: after setting all missing
data to zero, a simple k-means clustering is performed on the data set. µk,Wk and σk of each subspace is initialized
from the kth
data cluster {xi} ∈ Rd
using probabilistic PCA [4], where µk is initialized by the sample mean. Denote
the sample covariance matrix as Sk = UΛV ,Wk ∈ Rd×r
is initialized as
Wk = Ur(Λr − σ2
I)1/2
(14)
6
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
where σ2
is the initial value of σ2
k = 1
d−r
d
j=r+1 λj, Ur ∈ Rd×r
is composed of the leading r left singular vectors
from U and Λr ∈ Λd×r
is the corresponding singular value matrix.
Sample images reconstructed under different missing rate (equivalently, the number of observing pixels per block ω)
and compression ratio (number of components in each subspace r) are shown in Fig.3. Mean squared error (MSE)
between the compressed image and the original image is plotted in Fig5. As compression ratio decreases, the quality
of image improves significantly, as we can see from Fig3, the mean square error also decreases. While for different
missing rate, the images appear visually the same, no significant difference is observed between missing data case
and complete data case (though means square error decreases slightly). This validates the theory in [1] that as long as
certain conditions are met and a reasonable number of datapoints are observed, with a high probability we can recover
the true subspace from incomplete dataset.
(a) ω=14,r = 4 (b) ω=64,r = 4 (c) ω=14,r = 12
(d) ω=64,r = 12 (e) ω=14,r = 32 (f) ω=64,r = 32
Figure 3: Image compression using EM algorithm
5.2 k-GROUSE algorithm
We also test the k-GROUSE algorithm in image compression. We use the same block transformation scheme and
number of subspaces as in the EM algorithm. The sample compressed images are shown below.
6 Conclusion
From this project, we learned different subspace learning techniques, especially under missing data case. We also
studied the theoretical requirement for recovering subspaces when data is incomplete. We also practiced the imple-
mentation of EM algorithm and k-GROUSE.
7 Contribution
Dejiao Zhang is in charge of abstract, sections 1, 2.1, 3, 4.2, 4.3 and (Appendix C). Jiabei Zheng rederived the EM
algorithm for Gaussian mixture model (Appendix A), did coding and made plots in 4.1. Lianli Liu rederived the
conditional distribution (Appendix B), did experiment on real dataset (section 5) and debugged the EM algorithm.
7
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
(a) missing rate 30%, number of modes 32 (b) missing rate 30%, number of modes 50
Figure 4: Image compression using k-GROUSE
(a) MSE under different missing rate,
mode = 12
(b) MSE under different compression
ratio, missing rate = 78%
Figure 5: Plot of mean square error
References
[1] D. Pimentel and R. Nowak, “On the sample complexity of subspace clustering with missing data.”
[2] L. Balzano, A. Szlam, B. Recht, and R. Nowak, “k-subspaces with missing data,” in Statistical Signal Processing
Workshop (SSP), 2012 IEEE. IEEE, 2012, pp. 612–615.
[3] B. Eriksson, L. Balzano, and R. Nowak, “High-rank matrix completion and subspace clustering with missing
data,” arXiv preprint arXiv:1112.5629, 2011.
[4] M. Tipping and C. Bishop, “Mixtures of probabilistic principal component analyzers,” Neural computation,
vol. 11, no. 2, pp. 443–482, 1999.
[5] A. Edelman, T. A. Arias, and S. T. Smith, “The geometry of algorithms with orthogonality constraints,” SIAM
journal on Matrix Analysis and Applications, vol. 20, no. 2, pp. 303–353, 1998.
[6] L. Balzano, R. Nowak, and B. Recht, “Online identification and tracking of subspaces from highly incomplete
information,” in Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on.
IEEE, 2010, pp. 704–711.
[7] L. Balzano, B. Recht, and R. Nowak, “High-dimensional matched subspace detection when data are missing,” in
Information Theory Proceedings (ISIT), 2010 IEEE International Symposium on. IEEE, 2010, pp. 1638–1642.
[8] E. J. Cand`es and B. Recht, “Exact matrix completion via convex optimization,” Foundations of Computational
mathematics, vol. 9, no. 6, pp. 717–772, 2009.
[9] L. Balzano and S. J. Wright, “Local convergence of an algorithm for subspace identification from partial data,”
arXiv preprint arXiv:1306.3391, 2013.
8
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
Appendix A
Here is the complete derivation of the EM algorithm. The complete data probability is
Pr(x, y, z; θ) =
N
i=1
Pr(xi, yi, zi)
=
N
i=1
Pr(xi|yi, zi)Pr(yi|zi)Pr(zi)
=
N
i=1
Pr(xi|yi, zi)Pr(yi)Pr(zi)
=
N
i=1
ρkφ(xi; µk + Wkyi, σ2
kI)φ(yi; 0, I)
(15)
The complete data log-likelihood is
l(θ; x, y, z) =
N
i=1
log(ρkφ(xi; µk + Wkyi, σ2
kI)φ(yi; 0, I))
=
N
i=1
K
k=1
∆iklog(ρkφ(xi; µk + Wkyi, σ2
kI)φ(yi; 0, I))
(16)
where
∆ik =
1 if zi = k
0 else.
(17)
The E-step is
Q(θ; ˆθ) = E[l(θ; x, y, z)|xo
, ˆθ]
=
K
k=1
N
i=1
E[∆iklog(ρkφ(xi; µk + Wkyi, σ2
kI)φ(yi; 0, I))|xo
, ˆθ]
=
K
k=1
N
i=1
K
l=1
E[∆iklog(ρkφ(xi; µk + Wkyi, σ2
kI)φ(yi; 0, I))|xo
, zi = l, ˆθ]Pr(zi = l|xo
, ˆθ)
=
K
k=1
N
i=1
E[log(ρkφ(xi; µk + Wkyi, σ2
kI)φ(yi; 0, I))|xo
, zi = k, ˆθ]Pr(zi = k|xo
, ˆθ)
=
K
k=1
N
i=1
pik{log(ρk) −
d + r
2
log(2π) −
d
2
log(σ2
k)
−
1
2σ2
k
Ek (xi − µk − Wkyi)T
(xi − µk − Wkyi) −
1
2
Ek yT
i yi }
=
K
k=1
(
N
i=1
pik){log(ρk) −
d + r
2
log(2π) −
d
2
log(σ2
k) −
1
2σ2
k
Uk −
1
2
Ek yT y }
(18)
where
pik = Pr(zi = k|xo
i , ˆθ) (19)
9
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
Ek · = E[·|xo
i , zi = k, ˆθ] (20)
Ek u =
N
i=1 pikEk ui
N
i=1 pik
(21)
Ek uT v =
N
i=1 pikEk uT
i vi
N
i=1 pik
(22)
Uk = Ek xT x + µT
k µk + tr(Ek yT y WT
k Wk)
− 2µT
k Ek x + 2µT
k WkE y − 2tr(E yxT Wk)
(23)
The M-step is
˜θ = arg min
θ
Q(θ, ˆθ) (24)
First derive ρ. ρ satisfies
K
i=1 ρk = 1. Using Lagrangian multiplier λ:
∂Q + λ
K
i=1 ρk
∂ρk
= 0 (25)
N
i=1
pik + λρk = 0 (26)
K
k=1
N
i=1
pik + λ = 0 (27)
λ = −1 (28)
Use this to eq(), we have:
ρk =
N
i=1
pik (29)
Then we eliminate σ2
with W fixed as W
∂Q
∂σ2
k
= 0 ⇒ σk
2
=
Uk
d
(30)
σk
2
=
1
d
(Ek xT x + µT
k µk + tr(Ek yT y Wk
T
Wk)
− 2µT
k Ek x + 2µT
k WkE y − 2tr(E yxT Wk))
(31)
The elimination of µ is similar:
∂Q
∂µk
= 0 ⇒
∂Uk
∂µk
= 0 (32)
10
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
µk = Ek x − WkEk y (33)
Then we derive expression for W. From Eq(18) we have:
∂Q
∂Wk
= −
d
2σ2
k
∂σ2
k
∂Wk
−
1
2σ2
k
∂Uk
∂Wk
+
Uk
2σ4
k
∂σ2
k
∂Wk
= −
1
2σ2
k
∂Uk
∂Wk
= 0 (34)
As a result,
∂Q
∂Wk
= 0 ⇒
∂Uk
∂Wk
= 0 (35)
We plug in expression of µk to Uk, We get
∂
∂Wk
{Ek xT x + (Ek x − WkEk y )T
(Ek x − WkEk y ) + tr(Ek yT y WT
k Wk)
− 2(Ek x − WkEk y )T
Ek x + 2(Ek x − WkEk y )T
WkE y − 2tr(E yxT Wk)} = 0
(36)
Wk = (Ek xyT − Ek x Ek yT )(Ek yyT − Ek y Ek yT )−1
(37)
pik is calculated with Bayes’ rule:
pik = Pr(zi = k; xo
i , ˆθ)
=
ˆρkP(xo
i |zi = k; ˆθ)
K
j=1 ˆρjP(xo
i |zi = j; ˆθ)
=
ˆρkφ(xo
i ; µoi
k , Woi
k Woi
k
T
+ σ2
kI)
K
j=1 ˆρjφ(xo
i ; µoi
j , Woi
j Woi
j
T
+ σ2
j I)
(38)
8 Appendix B
By model definition,
xo
i
xm
i
yi zi=k,θ
=


Woi
k
Wmi
k
I

 yi +


µoi
k
µmi
k
0

 +
ηoi
ηmi
0
(39)
Eq(39) is a linear transform of yi ∼ N(0, I). By property of mulivariable Gaussian distribution,
xo
i
xm
i
yi zi=k,θ
∼ N
µoi
k
µmi
k
0
,


Woi
k WoiT
k + σ2
I Woi
k WmiT
k Woi
k
Wmi
k WoiT
k Wmi
k WmiT
k + σ2
I Wmi
k
WoiT
k WmiT
k I

 (40)
By Bayesian Gauss-Markov Theorem,
xm
i
yi
xo
i ,zi=k,θ
∼ N(µmyi
+ Rmyoi
R−1
oi
(xo
i − µoi
k ), Rmyi
− Rmyoi
R−1
oi
Romyi
) (41)
where
µmyi
=
µmi
k
0
, Rmyoi
=
Wmi
k WoiT
k
WoiT
k
, Rmyi
=
Wmi
k WmiT
k + σ2
I Wmi
k
WmiT
k I
(42)
Roi = Woi
k WoiT
k + σ2
I, Romyi = [Woi
k WmiT
k Woi
k ] (43)
Plug Eq(42) and Eq(43) into Eq(41),
Rmyoi R−1
oi
=
Wmi
k WoiT
k
WoiT
k
(Woi
k WoiT
k + σ2
I)−1
(44)
11
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
By matrix identities,
(I + PQ)−1
P = P(I + QP)−1
(45)
Therefore
WoiT
k (Woi
k WoiT
k + σ2
I)−1
= (WoiT
k Woi
k + σ2
I)−1
WoiT
k = (M−1
k )WoiT
k (46)
Put Eq(46) and Eq(44) together,
Rmyoi
R−1
oi
=
Wmi
k WoiT
k
WoiT
k
[Woi
k WoiT
k + σ2
I]−1
=
Wmi
k M−1
k WoiT
k
M−1
k WoiT
k
(47)
Therefore
µmyi + Rmyoi R−1
oi
(xo
i − µoi
k ) =
µmi
k + Wmi
k M−1
k WoiT
k (xo
i − µoi
k )
M−1
k WoiT
k (xo
i − µoi
k )
(48)
which is exactly the formula of mean given in the paper.
To prove the formula of variance, by Eq(47)
Rmyoi R−1
oi
Romyi =
Wmi
k WoiT
k
WoiT
k
(Woi
k WoiT
k + σ2
I)−1
[Woi
k WmiT
k Woi
k ]
=
Wmi
k M−1
k WoiT
k Woi
k WmiT
k Wmi
k M−1
k WoiT
k Woi
k
M−1
k WoiT
k Woi
k WmiT
k M−1
k WoiT
k Woi
k
(49)
Thus
Rmyi
− Rmyoi
R−1
oi
Romyi
=
σ2
I + Wmi
k (I − M−1
k WoiT
k Woi
k )WmiT
k Wmi
k (I − M−1
k WoiT
k Woi
k )
(I − M−1
k WoiT
k Woi
k )WmiT
k I − M−1
k WoiT
k Woi
k
(50)
Notice that
Mk − WoiT
k Woi
k = σ2
I ⇒ I − M−1
k WoiT
k Woi
k = σ2
M−1
k (51)
Thus Eq(50) simplifies to
Rmyi − Rmyoi R−1
oi
Romyi = σ2 I + Wmi
k M−1
k WmiT
k Wmi
k M−1
k
M−1
k WmiT
k M−1
k
(52)
which is exactly the formula of variance given in the paper.
Appendix C
Before we give the proof of Corollary.1, we examining the case of binary subspace assignment, i.e, K = 2. Let
v ∈ Rd
and S0
, S1
⊂ Rn
with dimension r0, r1 separately. Define the angle between v and its projection onto
subspace S0
, S1
as θ0, θ1 :
θ0 = sin−1 v − PS0 v 2
v 2
and θ1 = sin−1 v − PS1 v 2
v 2
(53)
Theorem 3. [2] Let δ > 0 and m ≥ 8
3 d1µ(S1
log 2d1
δ . Assume that
sin2
(θ0) < C(m) sin2
(θ1)
Then with probability at least 1 − 4δ
vΩ − PS0 vΩ
2
2 < vΩ − PS1 vΩ
2
2
Proof. By Theorem.2 and union bound, the following two statements hold simultaneously with probability at least
1 − 4δ,
vΩ − PS0
Ω
vΩ
2
2 ≤ (1 + α0)
m
n
v − PS0 v and
m(1 − α1) − r1µ(S1
)(1+β1)2
1−γ1
d
v − PS1 v 2
2 ≤ vΩ − PS1
Ω
vΩ
2
2
(54)
12
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
Combining with (53), we have
v − PS0
Ω
2
2 < C(m) v − PS1 v 2
2
will hold if and only if
sin2
(θ0) < C(m) sin2
(θ1)
which complete the proof
Proof. of Corollary.1. The results of Theorem.3 can be generalized to the situation where there are multiple
subspaces Si
, i = 0, . . . , K − 1. Again without loss of generality we assume that θ0 < θi, ∀i, and define
Ci(m) =
m(1−αi)−riµ(Si
)
(1+βi)2
1−γi
m(1+α0) . Then the conclusion of Corollary.1 can be obtained by applying union bound
and following the similar argument as that in the proof of Theorem.3.
13

Weitere ähnliche Inhalte

Was ist angesagt?

2012 mdsp pr08 nonparametric approach
2012 mdsp pr08 nonparametric approach2012 mdsp pr08 nonparametric approach
2012 mdsp pr08 nonparametric approach
nozomuhamada
 
Solving the energy problem of helium final report
Solving the energy problem of helium final reportSolving the energy problem of helium final report
Solving the energy problem of helium final report
JamesMa54
 
New data structures and algorithms for \\post-processing large data sets and ...
New data structures and algorithms for \\post-processing large data sets and ...New data structures and algorithms for \\post-processing large data sets and ...
New data structures and algorithms for \\post-processing large data sets and ...
Alexander Litvinenko
 
2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machine2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machine
nozomuhamada
 
Fast and efficient exact synthesis of single qubit unitaries generated by cli...
Fast and efficient exact synthesis of single qubit unitaries generated by cli...Fast and efficient exact synthesis of single qubit unitaries generated by cli...
Fast and efficient exact synthesis of single qubit unitaries generated by cli...
JamesMa54
 
2012 mdsp pr04 monte carlo
2012 mdsp pr04 monte carlo2012 mdsp pr04 monte carlo
2012 mdsp pr04 monte carlo
nozomuhamada
 

Was ist angesagt? (20)

Inner product spaces
Inner product spacesInner product spaces
Inner product spaces
 
Integration
IntegrationIntegration
Integration
 
Fixed point theorem in fuzzy metric space with e.a property
Fixed point theorem in fuzzy metric space with e.a propertyFixed point theorem in fuzzy metric space with e.a property
Fixed point theorem in fuzzy metric space with e.a property
 
2012 mdsp pr08 nonparametric approach
2012 mdsp pr08 nonparametric approach2012 mdsp pr08 nonparametric approach
2012 mdsp pr08 nonparametric approach
 
Basic terminology description in convex optimization
Basic terminology description in convex optimizationBasic terminology description in convex optimization
Basic terminology description in convex optimization
 
Solving the energy problem of helium final report
Solving the energy problem of helium final reportSolving the energy problem of helium final report
Solving the energy problem of helium final report
 
New data structures and algorithms for \\post-processing large data sets and ...
New data structures and algorithms for \\post-processing large data sets and ...New data structures and algorithms for \\post-processing large data sets and ...
New data structures and algorithms for \\post-processing large data sets and ...
 
2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machine2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machine
 
Conformable Chebyshev differential equation of first kind
Conformable Chebyshev differential equation of first kindConformable Chebyshev differential equation of first kind
Conformable Chebyshev differential equation of first kind
 
Ee693 questionshomework
Ee693 questionshomeworkEe693 questionshomework
Ee693 questionshomework
 
Fast and efficient exact synthesis of single qubit unitaries generated by cli...
Fast and efficient exact synthesis of single qubit unitaries generated by cli...Fast and efficient exact synthesis of single qubit unitaries generated by cli...
Fast and efficient exact synthesis of single qubit unitaries generated by cli...
 
2012 mdsp pr04 monte carlo
2012 mdsp pr04 monte carlo2012 mdsp pr04 monte carlo
2012 mdsp pr04 monte carlo
 
Third-kind Chebyshev Polynomials Vr(x) in Collocation Methods of Solving Boun...
Third-kind Chebyshev Polynomials Vr(x) in Collocation Methods of Solving Boun...Third-kind Chebyshev Polynomials Vr(x) in Collocation Methods of Solving Boun...
Third-kind Chebyshev Polynomials Vr(x) in Collocation Methods of Solving Boun...
 
Solovay Kitaev theorem
Solovay Kitaev theoremSolovay Kitaev theorem
Solovay Kitaev theorem
 
A Common Fixed Point Theorem on Fuzzy Metric Space Using Weakly Compatible an...
A Common Fixed Point Theorem on Fuzzy Metric Space Using Weakly Compatible an...A Common Fixed Point Theorem on Fuzzy Metric Space Using Weakly Compatible an...
A Common Fixed Point Theorem on Fuzzy Metric Space Using Weakly Compatible an...
 
Liner algebra-vector space-1 introduction to vector space and subspace
Liner algebra-vector space-1   introduction to vector space and subspace Liner algebra-vector space-1   introduction to vector space and subspace
Liner algebra-vector space-1 introduction to vector space and subspace
 
Answers withexplanations
Answers withexplanationsAnswers withexplanations
Answers withexplanations
 
Solution 3.
Solution 3.Solution 3.
Solution 3.
 
NUMERICAL METHODS -Iterative methods(indirect method)
NUMERICAL METHODS -Iterative methods(indirect method)NUMERICAL METHODS -Iterative methods(indirect method)
NUMERICAL METHODS -Iterative methods(indirect method)
 
Common fixed point theorems with continuously subcompatible mappings in fuzz...
 Common fixed point theorems with continuously subcompatible mappings in fuzz... Common fixed point theorems with continuously subcompatible mappings in fuzz...
Common fixed point theorems with continuously subcompatible mappings in fuzz...
 

Andere mochten auch (9)

Digital Contents Report 2012/06/18
Digital Contents Report 2012/06/18Digital Contents Report 2012/06/18
Digital Contents Report 2012/06/18
 
Sariin medee 12 sar
Sariin medee 12 sarSariin medee 12 sar
Sariin medee 12 sar
 
Of. s-pd j-2016-26697-sea-educação
Of. s-pd j-2016-26697-sea-educaçãoOf. s-pd j-2016-26697-sea-educação
Of. s-pd j-2016-26697-sea-educação
 
PKM-Project Start Up Bussiness
PKM-Project Start Up Bussiness PKM-Project Start Up Bussiness
PKM-Project Start Up Bussiness
 
CFD Analysis of Gerotor Lubricating Pumps at High Speed: Geometric Features I...
CFD Analysis of Gerotor Lubricating Pumps at High Speed: Geometric Features I...CFD Analysis of Gerotor Lubricating Pumps at High Speed: Geometric Features I...
CFD Analysis of Gerotor Lubricating Pumps at High Speed: Geometric Features I...
 
Forense Digital - Conceitos e Técnicas
Forense Digital - Conceitos e TécnicasForense Digital - Conceitos e Técnicas
Forense Digital - Conceitos e Técnicas
 
Final Report
Final ReportFinal Report
Final Report
 
Pemanfaatan arang sekam sebagai media tanaman sistem vertiminaponik
Pemanfaatan arang sekam sebagai media tanaman  sistem vertiminaponikPemanfaatan arang sekam sebagai media tanaman  sistem vertiminaponik
Pemanfaatan arang sekam sebagai media tanaman sistem vertiminaponik
 
In-Service 12-6-16
In-Service 12-6-16In-Service 12-6-16
In-Service 12-6-16
 

Ähnlich wie machinelearning project

Electromagnetic Scattering from Objects with Thin Coatings.2016.05.04.02
Electromagnetic Scattering from Objects with Thin Coatings.2016.05.04.02Electromagnetic Scattering from Objects with Thin Coatings.2016.05.04.02
Electromagnetic Scattering from Objects with Thin Coatings.2016.05.04.02
Luke Underwood
 
20070823
2007082320070823
20070823
neostar
 

Ähnlich wie machinelearning project (20)

Cluster
ClusterCluster
Cluster
 
Estimating structured vector autoregressive models
Estimating structured vector autoregressive modelsEstimating structured vector autoregressive models
Estimating structured vector autoregressive models
 
Multivariate Methods Assignment Help
Multivariate Methods Assignment HelpMultivariate Methods Assignment Help
Multivariate Methods Assignment Help
 
Double & triple integral unit 5 paper 1 , B.Sc. 2 Mathematics
Double & triple integral unit 5 paper 1 , B.Sc. 2 MathematicsDouble & triple integral unit 5 paper 1 , B.Sc. 2 Mathematics
Double & triple integral unit 5 paper 1 , B.Sc. 2 Mathematics
 
Rosser's theorem
Rosser's theoremRosser's theorem
Rosser's theorem
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
 
BNL_Research_Report
BNL_Research_ReportBNL_Research_Report
BNL_Research_Report
 
Electromagnetic Scattering from Objects with Thin Coatings.2016.05.04.02
Electromagnetic Scattering from Objects with Thin Coatings.2016.05.04.02Electromagnetic Scattering from Objects with Thin Coatings.2016.05.04.02
Electromagnetic Scattering from Objects with Thin Coatings.2016.05.04.02
 
Relative superior mandelbrot and julia sets for integer and non integer values
Relative superior mandelbrot and julia sets for integer and non integer valuesRelative superior mandelbrot and julia sets for integer and non integer values
Relative superior mandelbrot and julia sets for integer and non integer values
 
Relative superior mandelbrot sets and relative
Relative superior mandelbrot sets and relativeRelative superior mandelbrot sets and relative
Relative superior mandelbrot sets and relative
 
maths convergence.pdf
maths convergence.pdfmaths convergence.pdf
maths convergence.pdf
 
20070823
2007082320070823
20070823
 
Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
 
Lecture_note2.pdf
Lecture_note2.pdfLecture_note2.pdf
Lecture_note2.pdf
 
On the Odd Gracefulness of Cyclic Snakes With Pendant Edges
On the Odd Gracefulness of Cyclic Snakes With Pendant EdgesOn the Odd Gracefulness of Cyclic Snakes With Pendant Edges
On the Odd Gracefulness of Cyclic Snakes With Pendant Edges
 
On approximate bounds of zeros of polynomials within
On approximate bounds of zeros of polynomials withinOn approximate bounds of zeros of polynomials within
On approximate bounds of zeros of polynomials within
 
ON RUN-LENGTH-CONSTRAINED BINARY SEQUENCES
ON RUN-LENGTH-CONSTRAINED BINARY SEQUENCESON RUN-LENGTH-CONSTRAINED BINARY SEQUENCES
ON RUN-LENGTH-CONSTRAINED BINARY SEQUENCES
 
FITTED OPERATOR FINITE DIFFERENCE METHOD FOR SINGULARLY PERTURBED PARABOLIC C...
FITTED OPERATOR FINITE DIFFERENCE METHOD FOR SINGULARLY PERTURBED PARABOLIC C...FITTED OPERATOR FINITE DIFFERENCE METHOD FOR SINGULARLY PERTURBED PARABOLIC C...
FITTED OPERATOR FINITE DIFFERENCE METHOD FOR SINGULARLY PERTURBED PARABOLIC C...
 
FITTED OPERATOR FINITE DIFFERENCE METHOD FOR SINGULARLY PERTURBED PARABOLIC C...
FITTED OPERATOR FINITE DIFFERENCE METHOD FOR SINGULARLY PERTURBED PARABOLIC C...FITTED OPERATOR FINITE DIFFERENCE METHOD FOR SINGULARLY PERTURBED PARABOLIC C...
FITTED OPERATOR FINITE DIFFERENCE METHOD FOR SINGULARLY PERTURBED PARABOLIC C...
 

machinelearning project

  • 1. 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 Subspace Clustering with Missing Data Lianli Liu Dejiao Zhang Jiabei Zheng Abstract 1 Subspace clustering with missing data can be seen as the combination of subspace clustering and low rank matrix completion, which is essentially equivalent to high-rank matrix completion under the assumption that columns of the matrix X ∈ Rd×N belong to a union of subspaces. It’s a challenging problem, both in terms of computation and inference. In this report, we study two efficient algorithms proposed for solving this problem, the EM-type algorithm [1] and k-means form algorithm – k-GROUSE[2], and implement the two algorithms on both simulated and real dataset. Besides that we will also give a brief description of recently developed theorem [1] on the sampling complexity for subspace clustering, which is a great improvement over the previously existing theoretic analysis [3] that requires impractically sample size, e.g. the number of observed vectors per subspace should be super-polynomial in the dimension d. 1 Problem Statement Consider a matrix X ∈ Rd×N , with entries missing at random, whose columns lie in the union of at most K unknown subspaces of Rd , each has dimension at most rk < d. The goal of subspace clustering is to infer the underlying K subspaces from the partially observed matrix XΩ, where Ω denote the indexes of the observed entries, and to cluster the columns of XΩ into groups that lie in the same subspaces. 2 Sampling Complexity for Generic Subspace Clustering and EM Algorithm 2.1 Sampling Complexity [1] shows that subspace clustering is possible without impractically large sample size as that required by previous work [3]. The main assumptions used in the theoretic analysis of [1] is different from those arising from traditional Low- Rank Matrix Completion, e.g. [3]. It assumes the columns of X are drawn from a non-atomic distribution supported on a union of low-dimensional generic subspaces, and entries in the columns are missing uniformly at random. Before stating the main result in [1], we provide the following definition first. Definition 1. We denote the set of d × n matrices with rank r by M(r, d × n). A generic (d × n) matrix of rank r is a continuous M(r, d × n) valued random variable. We say a subspace S is generic if a matrix whose columns are drawn i.i.d according to a non-atomic distribution with support on S is generic a.s. The key assumptions of this method is following: • A1. The columns of the d × N data matrix X are drawn according to a non-atomic distribution with support on the union of at most K generic subspaces. The subspaces, denoted by S = {Sk|k = 0, . . . , K − 1}, each has rank exactly r < d. • A2. The probability that a column is drawn from subspace k is ρk. Let ρ∗ be the bound on mink{ρk}. 1 Authors by alphabetical order 1
  • 2. 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 • A3. We observe X only on a set of entries Ω and denote the observation XΩ. Each entry in XΩ is sampled independently with probability p. We now state the main results in [1], which implies that if we observe at least order dr+1 (log d/r + log K) columns and at least O(r log2 d) entries in each column, then identification of the subspaces is possible with high probability. Although by assumption A2 the rank of each subspace is exactly r, this result can be generalized to a relax version that the dimension of each subspace is upper bounded by r. Theorem 1. [1] Suppose A1-A3 hold. Let > 0 be given. Assume the number of subspaces K ≤ 6 ed/4 , the total number of columns N ≥ (2d + 4M)/ρ∗, and p ≥ 1 d 128µ2 1rβ0 log2 (2d), β0 = 1 + log 6K 12 log(d) 2 log(2d) , M = de r + 1 r+1 (r + 1) log de r + 1 + log 8K (1) where µ2 1 := maxk d2 r UkV ∗ k 2 ∞ and UkΣkV ∗ k is the singular value decomposition of ˜X[k]2 . Then with probability at least 1 − , S can be uniquely determined from XΩ. 2.2 EM Algorithm In [4], computing the principal subspaces of a set of data vectors is formulated in a probabilistic framework and obtained by a maximum-likelihood method. Here we use the same model but extend it to the case of missing data: xo i xm i = K i=1 1zi=k Woi k Wmi k yi + µoi k µmi k + ηi (2) where 1, · · · , K zi ∼ ρ ⊥ yi ∼ N(0, I). Wk is a d × r matrix whose span is Sk, and ηi|zi ∼ N(0, σ2 zi I) is the noise in the zth i subspace. To find the Maximum Likelihood Estimate of θ = W, µ, ρ, σ2 , i.e. ˆθ = arg max θ l(θ, xo ) = arg max θ N i=1 log( K k=1 ρkP(xo i |zi = k; θ)) (3) an EM algorithm is proposed where Xo := {xo i }N i=1 is the observed data, Xm := {xm i }N i=1, Y := {yi}N i=1, Z := {zi}N i=1 are hidden variables. The iterates of the algorithm are as follows, with detailed derivation provided in Ap- pendix A. Wk = N i=1 pi,kEk xiyT i − N i=1 pi,kEk xi N i=1 pi,kEk yi T N i=1 pi,kEk yi T N i=1 pi,kEk yiyT i − N i=1 pi,kEk yi N i=1 pi,kEk yi T N i=1 pi,k −1 (4) µk = N i=1 pi,k Ek xi − WkEk yi N i=1 pi,k (5) σ2 k = 1 d N i=1 pi,k N i=1 pi,k tr(E xixT i ) − 2µT k Ek xi + µk T µk − 2tr(Ek yixT i Wk) + 2µT k WkEk yi + tr(Ek yiyT i Wk T Wk) (6) 2 ˜X[k] denote the columns of X corresponding to the kth subspace 2
  • 3. 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 ρk = 1 N N i=1 pi,k , pi,k := Pzi|xo i ,ˆθ(k) = ˆρkPxo i |zi=k,ˆθ(xo i ) K j=1 ˆρjPxo i |zi=j,ˆθ(xo i ) (7) where Ek · denotes the conditional expectation E·|xo i ,zi=k,ˆθ[·]. This conditional expectation can be easily calculated from its conditional distribution. Denote Mk := σ2 I + WoiT k Woi k , by Gauss-Markov theorem, xm i yi xo i ,zi=k,θ ∼ N µmi k + Wmi k M−1 k WoiT k (xo i − µoi k ) M−1 k WoiT k (xo i − µoi k ) , σ2 I + Wmi k M−1 k WmiT k Wmi k M−1 k M−1 k WmiT k M−1 k (8) Detailed derivation is provided in Appendix B. 3 Projection Residual Based K-MEANS Form Algorithm [2] proposes a form of k-means algorithm adapted to k subspaces clustering. It alternats between assigning vectors to subspaces based on the projection residuals and updates each subspace by performing an incremental gradient descent along the geodesic curve of Grassmannian 3 manifold(GROUSE[6]). 3.1 Subspace Assignment For subspace assignment, [2] proves that the incomplete data vector can be assigned to the correct subspace with high probability if the angles between the vector and the various subspaces satisfied some mild condition. For each data vector, when there is no entry missing, it’s natural to use the following v − PSi v 2 2 < v − PSj v 2 2 ∀j = i, and i, j ∈ {0, . . . , K − 1}. (9) where PSi is the projection operator onto subspace Si , to decide whether v can be assigned to Si or not. Now suppose we only observe the a subset of the entries of v, denoted as vΩ, where Ω ⊂ {1, . . . , d}, |Ω| = m < d. And define the projection operator restricted to Ω as PSΩ = (UT Ω UΩ)−1 UT Ω (10) In [7], the authors prove that UT Ω UΩ is invertible w.h.p as long as the assumptions of Theorem.2 are satisfied. Let Uk ∈ Rd×rk whose orthonormal columns span the rk dimensional subspaces Sk respectively. Then an natural question arises as under what condition we can use a similar decision rule as that of (9) for the subspace assignment for the incomplete data case. To answer this question, we will first restate a theorem in [2] to show that the projection residual will concentrated around that of the full data case by a scaling factor if the sampling size m satisfies certain condition. Let v = x + y, where x ∈ S, y ∈ S⊥ . Theorem 2. [2]4 Let δ > 0 and m ≥ 8 3 µ(S)r log 2r δ , then with probability at least 1 − 3δ, m(1 − α) − dµ(S)(1+β)2 1−γ d v − PSv 2 2 ≤ vΩ − PSΩ vΩ 2 2 (11) and with probability at least 1 − δ, vΩ − PSΩ vΩ 2 2 ≤ (1 + α) m d v − PSv 2 2 (12) where α = 2µ2(y) m log 1 δ , β = 2µ(y) log 1 δ and γ = 8rµ(S) 3m log 2r δ . 3 Grassmannian is a compact Riemannian manifold, and it’s geodesics can be explicitly computed [5] 4 In this theorem, µ(S), µ(y) is coherence parameter of subspace S and vector y, here we follow the definitions proposed in [8], µS := d r maxj PS ej 2 2 and µ(y) = d z 2 ∞ z 2 2 3
  • 4. 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 Next we will state the requirement on the angles between the vector and the subspaces, by which the incomplete data vector can be assigned to the correct subspace w.h.p. Define the angle between the vector v and its projection into the rk dimensional subspace Sk , k = 0, . . . , K −1 as θi = sin−1 v−PSk v 2 v 2 . And let Ck(m) = m(1−αk)−rkµ(Sk ) (1+βi)2 1−γi m(1+α0) , where αk, βk and γk are defined as that in Theorem.2 for different rk. Notice that Ck(m) < 1 and Ck(m) 1 as m → ∞. Without loss of generality, we assume θ0 < θk, ∀k = 0. Corollary 1. [2] Let m ≥ 8 3 maxi=0 riµ(Si ) log 2ri δ for fixed δ > 0. Assume that sin2 (θ0) ≤ Ci(m) sin2 (θi), ∀i = 0. Then with probability at least 1 − 4(K − 1)δ, vΩ − PS0 vΩ 2 2 < vΩ − PSi vΩ 2 2, ∀i = 0 Proof. Detail proof is available in Appendix C. 3.2 Subspace Update GROUSE [6] is a very efficient algorithm for single subspace estimation, although the theoretic support of this method is still unavailable5 . k-GROUSE [2] can be seen as a combination of k-subspaces and GROUSE for multiple subspace estimation, which is detailed in Algorithm.1. Algorithm 1 k-subspaces with the GROUSE Require: A collection of vectors vΩ(t), t = 1, . . . , T, and the observed indices Ω(t). An integer number of subspaces k and dimensions rk, k = 1, . . . , K. A maximum number of iterations, maxIter. A fixed step size η. Initialize Subspaces: Zero-fill the vectors and collect them in a mtrix V . Initialize k subspaces estimates using probabilistic farthest insertion. Calculate Orthonormal Bases Uj, k = 1, . . . , K. Let QjΩ = (UT kΩ UkΩ )−1 UT kΩ for i = 1, . . . , maxIter do Select a vector at random, vΩ. for k = 1, ..., K do Calculate weights of projection onto kth subspace: w(k) = QkΩ vΩ. Calculate Projection Residuals to kth subspace: rΩ(k) = vΩ − UkΩ w(k). end for Select min residual: k = arg mink rΩ(k) 2 2. Set w = w(k). Define p = Ukw and rΩc = 0, i.e, r is the zero filled rΩ(k) Update Subspace: Uk := Uk + (cos(ση) − 1) p p + sin(ση) r r wT w . (13) where σ = r p . end for 4 Experiment on simulated data In this section, we will explore the performance of k-GROUSE and EM algorithm on simulated data. Before we pre- sented the results we will first give a brief description about two key points of the implement of the algorithm. Firstly, we use a k-means++ like method to initialize the subspaces for both k-GROUSE and EM algorithm. Specifically, we randomly pick a point x0 ∈ X, and select it’s q 6 nearest nearest neighbors and calculate the best S0 w.r.t these selected q points. Next, we recursively initialize Sj by randomly fist pick a point x with probability proportional to 5 Local convergence of GROUSE is available in [9] 6 q is a nonnegative parameter, usually we set q = 2 maxk rk 4
  • 5. 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 min[dist(x, S0 ), . . . , dist(x, Sj−1 )], and then find the best fit Sj of its q neighborhood. Secondly, the error is calcu- lated by examining the cluster assignments if a set of vectors who come from the same underlying subspace(which is known). For each set, we consider the cluster ID own the largest number of vectors to the true one, and the others as incorrectly clustered. 4.1 EM Algorithm The first experiment of EM algorithm is dealing with the simulated data generated by the Gaussian mixture model in section 2. We used d = 100, K = 4 and r = 5 for this part. For simulation, each of K subspaces is the span of an orthonormal basis generated from r i.i.d. standard Gaussian d-dimensional vectors. The performance of the EM algorithm strongly depends on initialization. With poor initial estimation of W and µ, estimated θ will get stuck by a local minimum of l(θ, xo ). Fig.1(a) gives an example of processing the same XΩ with different initial estimations, where we observed an obvious difference in performance. We ran 50 trials for each set of |ω|, Nk and σ2 , where |ω| is the number of observed entries in each column, Nk is the number of columns of each subspace and σ2 is the noise level. The stopping criteria is chosen as l(θ(t+1) , xo ) − l(θ(t) , xo ) < 10−4 l(θ(t) , xo ). The result are summarized in Fig.1(b)-(d). The difference between our results and results reported in [1] mainly comes from different initialization strategies. Generally, larger |ω| and Nk, and smaller σ2 gives less misclassified points, which is reasonable. For some cases, all misclassified points come from getting stuck at a local minimum. 0 50 100 150 200 250 300 350 400 −4 −2 0 2 4 6 x 10 4 Iterations LogLikelihood Good Bad (a) 0 50 100 150 200 250 0 0.2 0.4 0.6 0.8 1 N k Prop.ofmisclassifiedpoints (b) 10 15 20 25 30 35 40 0 0.2 0.4 0.6 0.8 1 |w| Prop.ofmisclassifiedpoints (c) −4 −3 −2 −1 0 0 0.2 0.4 0.6 0.8 1 log(σ 2 ) Prop.ofmisclassifiedpoints (d) Figure 1: (a) Good initial estimation: converges after 83 iterations, label correct rate = 100%. Bad initial estimation: converges after 395 iterations, label correct rate = 45.6%. (Nk = 200, ω = 24, σ2 = 10−4 ). (b)-(d) shows proportion of misclassified points: (b) as a function of Nk with |ω| = 24, σ2 = 10−4 . (c) as a function of |ω| with σ2 = 10−4 , Nk = 300. (d) as a function of σ2 with |ω| = 24, Nk = 210. 5
  • 6. 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 4.2 k-GROUSE In this section, we will explore the performance of k-GROUSE in two simulation scenarios. We set the ambient dimension d = 40 and the data matrix X consists of 80 vectors per subspace. In the first case, we let the number of subspaces K1 = 4 and and r = 5 for each subspace, i.e, the sum of the dimensions of the subspaces R = K k=1 rk < d. In the second case, we set the number of subspaces K2 = 5 and make one of them depended on the other four. From Fig.2(a) we can see that k-GROUSE perform well as long as the number of observations is a little bit more that r log2 d, which is within the order predicted by the theory. However, the performance of K-GROUSE is also depended on the initialization. Although k-means++ initialization can significantly improve the convergence of k-GROUSE, it can converge to some local point in the worse case. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Sampling density probabilityoferror d = 40 dim = 5 K1 = 4 K2 = 5 (a) 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 Sampling densityprobabilityoferror d = 100 dim = 5 K = 4 k−GROUSE EM (b) Figure 2: (a) Performance of k-GROUSE with independent and dependent subspaces, obtained over 50 trials; (b) k -GROUSE vs EM Algorithm, obtained over 50 trials 4.3 k-GROUSE vs EM Algorithm Now we compare the performance of k-GROUSE and EM Algorithm in simulation data. In this setting, we set K = 4, r = 5, d = 100. From Figure.2(a) we can see that both k-GROUSE and EM Algorithm perform well in this situation, while k-GROUSE is more efficient. However, as we will see in the next section, the performance gap between the two will become obvious when we apply them to some worse conditioned real data, specifically when the conditional number of the data matrix X is large, k-GROUSE can perform poorly while EM algorithm still work under the same condition. In traditional matrix completion, the sampling density can heavily depend on the condition number of the matrix, it seems that it’s inherent here by k-GROUSE. 5 Experiment on real data 5.1 EM algorithm We test the algorithm on real data by applying it to image compression. An image of 264×536 is used, which is the same as the one used in mixture probabilistic PCA[4]. Similar with [4], the image was segmented into 8×8 non- overlapping blocks, giving a total dataset of 2211×64-dimensional vectors. In the experiment, we set the number of subspaces as 3, and test the performance of the algorithm under different missing rate (number of missing pixels in image block) and compression ratio. To make sure the resulting coded image has the same bit rate as the conventional SVD coded image, after the EM algorithm converges, the image coding was performed in a hard fashion, i.e. the coding efficient of each image block is only estimated using the subspace it has the largest probability of belonging to. We test the algorithm under different missing rate (number of missing pixels in each image block) and compression ratio. An initialization scheme that is slightly different from the simulated experiment is used: after setting all missing data to zero, a simple k-means clustering is performed on the data set. µk,Wk and σk of each subspace is initialized from the kth data cluster {xi} ∈ Rd using probabilistic PCA [4], where µk is initialized by the sample mean. Denote the sample covariance matrix as Sk = UΛV ,Wk ∈ Rd×r is initialized as Wk = Ur(Λr − σ2 I)1/2 (14) 6
  • 7. 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 where σ2 is the initial value of σ2 k = 1 d−r d j=r+1 λj, Ur ∈ Rd×r is composed of the leading r left singular vectors from U and Λr ∈ Λd×r is the corresponding singular value matrix. Sample images reconstructed under different missing rate (equivalently, the number of observing pixels per block ω) and compression ratio (number of components in each subspace r) are shown in Fig.3. Mean squared error (MSE) between the compressed image and the original image is plotted in Fig5. As compression ratio decreases, the quality of image improves significantly, as we can see from Fig3, the mean square error also decreases. While for different missing rate, the images appear visually the same, no significant difference is observed between missing data case and complete data case (though means square error decreases slightly). This validates the theory in [1] that as long as certain conditions are met and a reasonable number of datapoints are observed, with a high probability we can recover the true subspace from incomplete dataset. (a) ω=14,r = 4 (b) ω=64,r = 4 (c) ω=14,r = 12 (d) ω=64,r = 12 (e) ω=14,r = 32 (f) ω=64,r = 32 Figure 3: Image compression using EM algorithm 5.2 k-GROUSE algorithm We also test the k-GROUSE algorithm in image compression. We use the same block transformation scheme and number of subspaces as in the EM algorithm. The sample compressed images are shown below. 6 Conclusion From this project, we learned different subspace learning techniques, especially under missing data case. We also studied the theoretical requirement for recovering subspaces when data is incomplete. We also practiced the imple- mentation of EM algorithm and k-GROUSE. 7 Contribution Dejiao Zhang is in charge of abstract, sections 1, 2.1, 3, 4.2, 4.3 and (Appendix C). Jiabei Zheng rederived the EM algorithm for Gaussian mixture model (Appendix A), did coding and made plots in 4.1. Lianli Liu rederived the conditional distribution (Appendix B), did experiment on real dataset (section 5) and debugged the EM algorithm. 7
  • 8. 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 (a) missing rate 30%, number of modes 32 (b) missing rate 30%, number of modes 50 Figure 4: Image compression using k-GROUSE (a) MSE under different missing rate, mode = 12 (b) MSE under different compression ratio, missing rate = 78% Figure 5: Plot of mean square error References [1] D. Pimentel and R. Nowak, “On the sample complexity of subspace clustering with missing data.” [2] L. Balzano, A. Szlam, B. Recht, and R. Nowak, “k-subspaces with missing data,” in Statistical Signal Processing Workshop (SSP), 2012 IEEE. IEEE, 2012, pp. 612–615. [3] B. Eriksson, L. Balzano, and R. Nowak, “High-rank matrix completion and subspace clustering with missing data,” arXiv preprint arXiv:1112.5629, 2011. [4] M. Tipping and C. Bishop, “Mixtures of probabilistic principal component analyzers,” Neural computation, vol. 11, no. 2, pp. 443–482, 1999. [5] A. Edelman, T. A. Arias, and S. T. Smith, “The geometry of algorithms with orthogonality constraints,” SIAM journal on Matrix Analysis and Applications, vol. 20, no. 2, pp. 303–353, 1998. [6] L. Balzano, R. Nowak, and B. Recht, “Online identification and tracking of subspaces from highly incomplete information,” in Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on. IEEE, 2010, pp. 704–711. [7] L. Balzano, B. Recht, and R. Nowak, “High-dimensional matched subspace detection when data are missing,” in Information Theory Proceedings (ISIT), 2010 IEEE International Symposium on. IEEE, 2010, pp. 1638–1642. [8] E. J. Cand`es and B. Recht, “Exact matrix completion via convex optimization,” Foundations of Computational mathematics, vol. 9, no. 6, pp. 717–772, 2009. [9] L. Balzano and S. J. Wright, “Local convergence of an algorithm for subspace identification from partial data,” arXiv preprint arXiv:1306.3391, 2013. 8
  • 9. 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 Appendix A Here is the complete derivation of the EM algorithm. The complete data probability is Pr(x, y, z; θ) = N i=1 Pr(xi, yi, zi) = N i=1 Pr(xi|yi, zi)Pr(yi|zi)Pr(zi) = N i=1 Pr(xi|yi, zi)Pr(yi)Pr(zi) = N i=1 ρkφ(xi; µk + Wkyi, σ2 kI)φ(yi; 0, I) (15) The complete data log-likelihood is l(θ; x, y, z) = N i=1 log(ρkφ(xi; µk + Wkyi, σ2 kI)φ(yi; 0, I)) = N i=1 K k=1 ∆iklog(ρkφ(xi; µk + Wkyi, σ2 kI)φ(yi; 0, I)) (16) where ∆ik = 1 if zi = k 0 else. (17) The E-step is Q(θ; ˆθ) = E[l(θ; x, y, z)|xo , ˆθ] = K k=1 N i=1 E[∆iklog(ρkφ(xi; µk + Wkyi, σ2 kI)φ(yi; 0, I))|xo , ˆθ] = K k=1 N i=1 K l=1 E[∆iklog(ρkφ(xi; µk + Wkyi, σ2 kI)φ(yi; 0, I))|xo , zi = l, ˆθ]Pr(zi = l|xo , ˆθ) = K k=1 N i=1 E[log(ρkφ(xi; µk + Wkyi, σ2 kI)φ(yi; 0, I))|xo , zi = k, ˆθ]Pr(zi = k|xo , ˆθ) = K k=1 N i=1 pik{log(ρk) − d + r 2 log(2π) − d 2 log(σ2 k) − 1 2σ2 k Ek (xi − µk − Wkyi)T (xi − µk − Wkyi) − 1 2 Ek yT i yi } = K k=1 ( N i=1 pik){log(ρk) − d + r 2 log(2π) − d 2 log(σ2 k) − 1 2σ2 k Uk − 1 2 Ek yT y } (18) where pik = Pr(zi = k|xo i , ˆθ) (19) 9
  • 10. 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 Ek · = E[·|xo i , zi = k, ˆθ] (20) Ek u = N i=1 pikEk ui N i=1 pik (21) Ek uT v = N i=1 pikEk uT i vi N i=1 pik (22) Uk = Ek xT x + µT k µk + tr(Ek yT y WT k Wk) − 2µT k Ek x + 2µT k WkE y − 2tr(E yxT Wk) (23) The M-step is ˜θ = arg min θ Q(θ, ˆθ) (24) First derive ρ. ρ satisfies K i=1 ρk = 1. Using Lagrangian multiplier λ: ∂Q + λ K i=1 ρk ∂ρk = 0 (25) N i=1 pik + λρk = 0 (26) K k=1 N i=1 pik + λ = 0 (27) λ = −1 (28) Use this to eq(), we have: ρk = N i=1 pik (29) Then we eliminate σ2 with W fixed as W ∂Q ∂σ2 k = 0 ⇒ σk 2 = Uk d (30) σk 2 = 1 d (Ek xT x + µT k µk + tr(Ek yT y Wk T Wk) − 2µT k Ek x + 2µT k WkE y − 2tr(E yxT Wk)) (31) The elimination of µ is similar: ∂Q ∂µk = 0 ⇒ ∂Uk ∂µk = 0 (32) 10
  • 11. 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 µk = Ek x − WkEk y (33) Then we derive expression for W. From Eq(18) we have: ∂Q ∂Wk = − d 2σ2 k ∂σ2 k ∂Wk − 1 2σ2 k ∂Uk ∂Wk + Uk 2σ4 k ∂σ2 k ∂Wk = − 1 2σ2 k ∂Uk ∂Wk = 0 (34) As a result, ∂Q ∂Wk = 0 ⇒ ∂Uk ∂Wk = 0 (35) We plug in expression of µk to Uk, We get ∂ ∂Wk {Ek xT x + (Ek x − WkEk y )T (Ek x − WkEk y ) + tr(Ek yT y WT k Wk) − 2(Ek x − WkEk y )T Ek x + 2(Ek x − WkEk y )T WkE y − 2tr(E yxT Wk)} = 0 (36) Wk = (Ek xyT − Ek x Ek yT )(Ek yyT − Ek y Ek yT )−1 (37) pik is calculated with Bayes’ rule: pik = Pr(zi = k; xo i , ˆθ) = ˆρkP(xo i |zi = k; ˆθ) K j=1 ˆρjP(xo i |zi = j; ˆθ) = ˆρkφ(xo i ; µoi k , Woi k Woi k T + σ2 kI) K j=1 ˆρjφ(xo i ; µoi j , Woi j Woi j T + σ2 j I) (38) 8 Appendix B By model definition, xo i xm i yi zi=k,θ =   Woi k Wmi k I   yi +   µoi k µmi k 0   + ηoi ηmi 0 (39) Eq(39) is a linear transform of yi ∼ N(0, I). By property of mulivariable Gaussian distribution, xo i xm i yi zi=k,θ ∼ N µoi k µmi k 0 ,   Woi k WoiT k + σ2 I Woi k WmiT k Woi k Wmi k WoiT k Wmi k WmiT k + σ2 I Wmi k WoiT k WmiT k I   (40) By Bayesian Gauss-Markov Theorem, xm i yi xo i ,zi=k,θ ∼ N(µmyi + Rmyoi R−1 oi (xo i − µoi k ), Rmyi − Rmyoi R−1 oi Romyi ) (41) where µmyi = µmi k 0 , Rmyoi = Wmi k WoiT k WoiT k , Rmyi = Wmi k WmiT k + σ2 I Wmi k WmiT k I (42) Roi = Woi k WoiT k + σ2 I, Romyi = [Woi k WmiT k Woi k ] (43) Plug Eq(42) and Eq(43) into Eq(41), Rmyoi R−1 oi = Wmi k WoiT k WoiT k (Woi k WoiT k + σ2 I)−1 (44) 11
  • 12. 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 By matrix identities, (I + PQ)−1 P = P(I + QP)−1 (45) Therefore WoiT k (Woi k WoiT k + σ2 I)−1 = (WoiT k Woi k + σ2 I)−1 WoiT k = (M−1 k )WoiT k (46) Put Eq(46) and Eq(44) together, Rmyoi R−1 oi = Wmi k WoiT k WoiT k [Woi k WoiT k + σ2 I]−1 = Wmi k M−1 k WoiT k M−1 k WoiT k (47) Therefore µmyi + Rmyoi R−1 oi (xo i − µoi k ) = µmi k + Wmi k M−1 k WoiT k (xo i − µoi k ) M−1 k WoiT k (xo i − µoi k ) (48) which is exactly the formula of mean given in the paper. To prove the formula of variance, by Eq(47) Rmyoi R−1 oi Romyi = Wmi k WoiT k WoiT k (Woi k WoiT k + σ2 I)−1 [Woi k WmiT k Woi k ] = Wmi k M−1 k WoiT k Woi k WmiT k Wmi k M−1 k WoiT k Woi k M−1 k WoiT k Woi k WmiT k M−1 k WoiT k Woi k (49) Thus Rmyi − Rmyoi R−1 oi Romyi = σ2 I + Wmi k (I − M−1 k WoiT k Woi k )WmiT k Wmi k (I − M−1 k WoiT k Woi k ) (I − M−1 k WoiT k Woi k )WmiT k I − M−1 k WoiT k Woi k (50) Notice that Mk − WoiT k Woi k = σ2 I ⇒ I − M−1 k WoiT k Woi k = σ2 M−1 k (51) Thus Eq(50) simplifies to Rmyi − Rmyoi R−1 oi Romyi = σ2 I + Wmi k M−1 k WmiT k Wmi k M−1 k M−1 k WmiT k M−1 k (52) which is exactly the formula of variance given in the paper. Appendix C Before we give the proof of Corollary.1, we examining the case of binary subspace assignment, i.e, K = 2. Let v ∈ Rd and S0 , S1 ⊂ Rn with dimension r0, r1 separately. Define the angle between v and its projection onto subspace S0 , S1 as θ0, θ1 : θ0 = sin−1 v − PS0 v 2 v 2 and θ1 = sin−1 v − PS1 v 2 v 2 (53) Theorem 3. [2] Let δ > 0 and m ≥ 8 3 d1µ(S1 log 2d1 δ . Assume that sin2 (θ0) < C(m) sin2 (θ1) Then with probability at least 1 − 4δ vΩ − PS0 vΩ 2 2 < vΩ − PS1 vΩ 2 2 Proof. By Theorem.2 and union bound, the following two statements hold simultaneously with probability at least 1 − 4δ, vΩ − PS0 Ω vΩ 2 2 ≤ (1 + α0) m n v − PS0 v and m(1 − α1) − r1µ(S1 )(1+β1)2 1−γ1 d v − PS1 v 2 2 ≤ vΩ − PS1 Ω vΩ 2 2 (54) 12
  • 13. 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 Combining with (53), we have v − PS0 Ω 2 2 < C(m) v − PS1 v 2 2 will hold if and only if sin2 (θ0) < C(m) sin2 (θ1) which complete the proof Proof. of Corollary.1. The results of Theorem.3 can be generalized to the situation where there are multiple subspaces Si , i = 0, . . . , K − 1. Again without loss of generality we assume that θ0 < θi, ∀i, and define Ci(m) = m(1−αi)−riµ(Si ) (1+βi)2 1−γi m(1+α0) . Then the conclusion of Corollary.1 can be obtained by applying union bound and following the similar argument as that in the proof of Theorem.3. 13