This document discusses clustering and factorization techniques in SystemML. It begins by describing k-means clustering, including how it takes as input a matrix of records and clusters them to minimize within-cluster sum of squares. It also discusses k-means++ initialization and the standard k-means algorithm. The document then describes weighted non-negative matrix factorization, which approximates a data matrix as the product of two non-negative matrices to find latent topics. It discusses optimizations for WNMF like operator fusion to reduce computation.
2. K-means Clustering
• INPUT: n records x1, x2, …, xn as the rows of matrix X
– Each xi is m-dimensional: xi = (xi1, xi2, …, xim)
– Matrix X is (n × m)-dimensional
• INPUT: k, an integer in {1, 2, …, n}
• OUTPUT: Partition the records into k clusters S1, S2, …, Sk
– May use n labels y1, y2, …, yn in {1, 2, …, k}
– NOTE: Same clusters can label in k! ways – important if checking
correctness (don’t just compare “predicted” and “true” label)
• METRIC: Minimize within-cluster sum of squares (WCSS)
• Cluster “means” are k vectors that capture as much variance
in the data as possible
2
( )
2
21
:meanWCSS ∑=
∈−=
n
i jiji SxSx
3. K-means Clustering
• K-means is a little similar to linear regression:
– Linear regression error = ∑i≤n (yi – xi ·β)2
– BUT: Clustering describes xi ’s themselves, not yi ’s given xi ’s
• K-means can work in “linearization space” (like kernel SVM)
• How to pick k ?
– Try k = 1, 2, …, up to some limit; check for overfitting
– Pick the best k in the context of the whole task
• Caveats for k-means
– They do NOT estimate a mixture of Gaussians
• EM algorithm does this
– The k clusters tend to be of similar size
• Do NOT use for imbalanced clusters!
3
( )
2
21
:meanWCSS ∑=
∈−=
n
i jiji SxSx
4. The K-means Algorithm
• Pick k “centroids” c1, c2, …, ck from the records {x1, x2, …, xn}
– Try to pick centroids far from each other
• Assign each record to the nearest centroid:
– For each xi compute di = min {dist(xi , cj) over all cj }
– Cluster Sj ← { xi : dist(xi , cj) = di }
• Reset each centroid to its cluster’s mean:
– Centroid cj ← mean(Sj) = ∑i≤n (xi in Sj?) ·xi / |Sj|
• Repeat “assign” and “reset” steps until convergence
• Loss decreases: WCSSold ≥ C-WCSSnew ≥ WCSSnew
– Converges to local optimum (often, not global)
4
( )
2
21
:centroidWCSS-C ∑=
∈−=
n
i jiji SxSx
6. K-means: DML Implementation
C = All_C [(k * (run - 1) + 1) : (k * run), ];
iter = 0; term_code = 0; wcss = 0;
while (term_code == 0) {
D = -2 * (X %*% t(C)) + t(rowSums(C ^ 2));
minD = rowMins (D); wcss_old = wcss;
wcss = sumXsq + sum (minD);
if (wcss_old - wcss < eps * wcss & iter > 0) {
term_code = 1; # Convergence is reached
} else {
if (iter >= max_iter) { term_code = 2;
} else { iter = iter + 1;
P = ppred (D, minD, "<=");
P = P / rowSums(P);
if (sum (ppred (colSums (P), 0.0, "<=")) > 0) {
term_code = 3; # "Runaway" centroid
} else {
C = t(P / colSums(P)) %*% X;
} } } }
All_C [(k * (run - 1) + 1) : (k * run), ] = C;
final_wcss [run, 1] = wcss; t_code [run, 1] = term_code; 6
Want smooth assign?
Edit here
Tensor avoidance
maneuver
ParFor I/O
7. K-means++ Initialization Heuristic
• Picks centroids from X at random, pushing them far apart
• Gets WCSS down to O(log k) × optimal in expectation
• How to pick centroids:
– Centroid c1: Pick uniformly at random from X-rows
– Centroid c2: Prob [c2 ←xi ] = (1/Σ) · dist(xi , c1)2
– Centroid cj: Prob [cj ←xi ] = (1/Σ) · min{dist(xi , c1)2, …, dist(xi , cj–1 )2}
– Probability to pick a row is proportional to its squared min-distance
from earlier centroids
• If X is huge, we use a sample of X, different across runs
– Otherwise picking k centroids requires k passes over X
7
David Arthur, Sergei Vassilvitskii “k-means++: the advantages of careful seeding” in SODA 2007
8. K-means Predict Script
• Predictor and Evaluator in one:
– Given X (data) and C (centroids), assigns cluster labels prY
– Compares 2 clusterings, “predicted” prY and “specified” spY
• Computes WCSS, as well as Between-Cluster Sum of Squares
(BCSS) and Total Sum of Squares (TSS)
– Dataset X must be available
– If centroids C are given, also computes C-WCSS and C-BCSS
• Two ways to compare prY and spY :
– Same-cluster and different-cluster PAIRS from prY and spY
– For each prY-cluster find best-matching spY-cluster, and vice versa
– All in count as well as in % to full count
8
9. Weighted Non-Negative Matrix
Factorization (WNMF)
• INPUT: X is non-negative (n × m)-matrix
– Example: Xij = 1 if person #i clicked ad #j, else Xij = 0
• INPUT (OPTIONAL): W is penalty (n × m)-matrix
– Example: Wij = 1 if person #i saw ad #j, else Wij = 0
• OUTPUT: (n × k)-matrix U, (m × k)-matrix V such that:
– k topics: Uic = affinity(prs. #i, topic #c), Vjc = affinity (ad #j, topic #c)
– Approximation: Xij ≈ Ui1 · Vj1 + Ui2 · Vj2 + … + Uik · Vjk
– Predict a “click” if for some #c both Uic and Vjc are high
9
( )( )2
1 1
,
min ij
T
ij
n
i
m
j
ij
VU
VUXW −∑∑= =
0,0t.s. ≥≥ VU
11. 11
§ Easy to parallelize using SystemML
§ Multiple runs help avoid bad local optima
§ Must specify k : Run for k = 1, 2, 3 ... (as in k-means)
( )[ ]
( )[ ] ε+∗
∗
←
ij
TT
ij
T
ijij
UUVW
UXW
VV
( )[ ]
( )[ ] ε+∗
∗
←
ij
T
ij
ijij
VUVW
VXW
UU
WNMF : Multiplicative Update
Daniel D. Lee, H. Sebastian Seung “Algorithms for Non-negative Matrix Factorization” in NIPS 2000
12. Inside A Run of (W)NMF
• Assume that W is a sparse matrix
12
U = RND_U [, (r-1)*k + 1 : r*k];
V = RND_V [, (r-1)*k + 1 : r*k];
f_old = 0; i = 0;
f_new = sum ((X - U %*% t(V)) ^ 2); f_new = sum (W * (X - U %*% t(V)) ^ 2);
while (abs (f_new - f_old) > tol * f_new & i < max_iter)
{ {
f_old = f_new; f_old = f_new;
U = U * (X %*% V)
/ (U %*% (t(V) %*% V) + eps);
U = U * ((W * X) %*% V)
/ ( (W * (U %*% t(V))) %*% V + eps);
V = V * t(t(U) %*% X)
/ (V %*% (t(U) %*% U) + eps);
V = V * (t(W * X) %*% U)
/ (t(W * (U %*% t(V))) %*% U + eps);
f_new = sum ((X - U %*% t(V))^2); f_new = sum (W * (X - U %*% t(V))^2);
i = i + 1; i = i + 1;
} }
13. Sum-Product Rewrites
• Matrix chain product optimization
– Example: (U %*% t(V)) %*% V = U %*% (t(V) %*% V)
• Moving operators from big matrices to smaller ones
– Example: t(X) %*% U = t(t(U) %*% X)
• Opening brackets in expressions (ongoing research)
– Example: sum ((X – U %*% t(V))^2) = sum (X^2) –
2 * sum(X * (U %*% t(V)) + sum((U %*% t(V))^2)
– K-means: D = rowSums (X ^ 2) – 2 * (X %*% t(C)) + t(rowSums (C ^ 2))
• Indexed sum rearrangements:
– sum ((U %*% t(V))^2) = sum ((t(U) %*% U) * (t(V) %*% V))
– sum (U %*% t(V)) = sum (colSums(U) * colSums(V))
13
14. Operator Fusion: W. Sq. Loss
• Weighted Squared Loss: sum (W * (X – U %*% t(V))^2)
– Common pattern for factorization algorithms
– W and X usually very sparse (< 0.001)
– Problem: “Outer” product of U %*% t(V) creates three dense
intermediates in the size of X
è Fused w.sq.loss operator:
– Key observations: Sparse W * allows selective computation, and “sum”
aggregate significantly reduces memory requirements
U–
t(V)
XWsum *
2