This document summarizes an academic paper on optimal interval clustering and its applications to Bregman clustering and statistical mixture learning. It begins by introducing hard clustering and center-based clustering approaches. It then describes how k-means clustering is NP-hard in higher dimensions but polynomial-time in 1D using dynamic programming. The document outlines an optimal interval clustering algorithm using dynamic programming with runtime O(n2kT1(n)) or O(n2T1(n)) using a lookup table. It discusses how this can be applied to 1D Bregman clustering and learning statistical mixtures, providing experimental results on Gaussian mixture models. Finally, it considers perspectives on hierarchical clustering, dynamic clustering maintenance, and streaming approximations.
Optimal interval clustering and statistical mixture learning via dynamic programming
1. Optimal interval clustering: Application to
Bregman clustering and statistical mixture
learning
Frank Nielsen1 Richard Nock2
1Sony Computer Science Laboratories/Ecole Polytechnique
2UAG-CEREGMIA/NICTA
May 2014
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/26
2. Hard clustering: Partitioning the data set
Partition X = {x1, ..., xn} ⊂ X into k clusters
C1 ⊂ X, ..., Ck ⊂ X:
X =
k
i=1
Ci
Center-based (prototype) hard clustering: k-means [2],
k-medians, k-center, r -center [10], etc.
Model-based hard clustering: statistical mixtures maximizing
the complete likelihood (prototype=model paramter).
k-means: NP-hard when d > 1 and k > 1 [11, 7, 1].
k-medians and k-centers: NP-hard [12] (1984)
In 1D, k-means is polynomial [3, 15]: O(n2k).
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/26
3. Euclidean 1D k-means
1D k-means [8] has contiguous partition.
Solved by enumerating all n−1
k−1 partitions in 1D (1958).
Better than Stirling numbers of the second kind S(n, k) that
count all partitions.
Polynomial in time O(n2k) using Dynamic Programming
(DP) [3] (sketched in 1973 in two pages).
R package Ckmeans.1d.dp [15] (2011).
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/26
4. Interval clustering: Structure
Sort X ∈ X with respect to total order < on X in O(n log n).
Output represented by:
k intervals Ii = [xli
, xri
] such that Ci = Ii ∩ X.
or better k − 1 delimiters li (i ∈ {2, ..., k}) since ri = li+1 − 1
(i < k and rk = n) and l1 = 1.
[x1...xl2−1]
C1
[xl2
...xl3−1]
C2
... [xlk
...xn]
Ck
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/26
5. Objective function for interval clustering
Scalars x1 < ... < xn are partitioned contiguously
into k clusters: C1 < ... < Ck.
Clustering objective function:
min ek(X) =
k
j=1
e1(Cj )
c1(·): intra-cluster cost/energy
⊕: inter-cluster cost/energy (commutative, associative)
n = kp + 1 1D points equally distributed → k different optimal
clustering partitions
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/26
6. Examples of objective functions
In arbitrary dimension X = Rd :
r -clustering (r ≥ 1): =
e1(Cj ) = min
p∈X
⎛
⎝
x∈Cj
d(x, p)r
⎞
⎠
(argmin=prototype pj is the same whether we take power of 1
r
of sum or not)
Euclidean r -clustering: r = 1 median, r = 2 means.
k-center (limr→∞): = max
e1(Ci ) = min
p∈X
max
x∈Cj
d(x, p)
Discrete clustering: Search space in min is Cj instead of X.
Note that in 1D, s-norm distance is always d(p, q) = |p − q|,
independent of s ≥ 1.
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/26
7. Optimal interval clustering by Dynamic Programming
Xj,i = {xj , ..., xi } (j ≤ i)
Xi = X1,i = {x1, ..., xi }
E = [ei,j]: n × k cost matrix, O(n × k) memory
ei,m = em(Xi )
Optimality equation:
ei,m = min
m≤j≤i
{ej−1,m−1 ⊕ e1(Xj,i )}
Associative/commutative operator ⊕ (+ or max).
Initialize with ci,1 = c1(Xi )
E: compute from left to right column, from bottom to top.
Best clustering solution cost is at en,k.
Time: n × k × O(n) × T1(n)=O(n2kT1(n)), O(nk) memory
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/26
8. Retrieving the solution: Backtracking
Use an auxiliary matrix S = [si.j] for storing the argmin.
Backtrack in O(k) time.
Left index lk of Ck stored at sn,k: lk = sn,k.
Iteratively retrieve the previous left interval indexes at entries
lj−1 = slj −1,i for j = k − 1, ..., j = 1.
Note that lj − 1 = n − k
l=j nl and lj − 1 = j−1
l=1 nl .
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/26
9. Optimizing time with a Look Up Table (LUT)
Save time when computing e1(Xj,i ) since we perform n × k × O(n)
such computations.
Look Up Table (LUT): Add extra n × n matrix E1 with
E1[j][i] = e1(Xj,i ).
Build in O(n2T1(n))...
Then DP in O(n2k)=O(n2T1(n))).
→ quadratic amount of memory (n > 10000...)
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/26
10. DP solver with cluster size constraints
n−
i and n+
i : lower/upper bound constraints on ni = |Ci |
k
l=1 = n−
i ≤ n and k
l=1 = n+
i ≥ n.
When no constraints: add dummy constraints n−
i = 1 and
n+
i = n − k + 1.
nm = |Cm| = i − j + 1 such that n−
m ≤ nm ≤ n+
m.
→ j ≤ i + 1 − n−
m and j ≥ i + 1 − n+
m.
ei,m = min
max{1+ m−1
l=1 n−
l ,i+1−n+
m}≤j
j≤i+1−n−
m
{ej−1,m−1 ⊕ e1(Xj,i )} ,
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/26
11. Model selection from the DP table
m(k) = ek (X)
e1(X) decreases with k and reaches minimum when k = n.
Model selection: trade-off choose best model among all the models
with k ∈ [1, n].
Regularized objective function: ek (X) = ek(X) + f (k), f (k)
related to model complexity.
Compute the DP table for k = n, ..., 1 and avoids redundant
computations.
Then compute the criterion for the last line (indexed by n) and
choose the argmin of ek.
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/26
12. A Voronoi cell condition for DP optimality
elements → interval clusters → prototypes
interval clusters ← prototypes
Voronoi diagram:
Partition X wrt. P = {p1, ..., pk }.
Voronoi cell:
V (pj) = {x ∈ X : dr
(x, pj ) ≤ dr
(x, pl ) ∀l ∈ {1, ..., k}}.
xr is a monotonically increasing function on R+, equivalent to
V (pj ) = {x ∈ X : d(x : pj ) < d(x : pl )}
DP guarantees optimal clustering when ∀P, V (pj ) is an interval
2-clustering exhibits the Voronoi bisector.
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/26
13. 1-mean (centroid): O(n) time
min
p
n
i=1
(xi − p)2
D(x, p) = (x − p)2
, D (x, p) = 2(x − p), D (x, p) = 2
Convex optimization (existence and unique solution)
n
i=1
D (x, p) = 0 ⇒
n
i=1
xi − np = 0
Center of mass p = 1
n
n
i=1 xi − (barycenter)
Extends to Bregman divergence:
DF (x, p) = F(x) − F(p) − (x − p)F (p)
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/26
14. 2-means: O(n log n) time
Find xl2
(n − 1 potential locations for xl : from x2 to xn):
min
xl2
{e1(C1) + e1(C2)}
Browse from left to right l2 = x2, ...., xn.
Update cost in constant time E2(l + 1) from E2(l) (SATs also
O(1)):
E2(l) = e2(x1...xl−1|xl ...xn)
μ1(l + 1) =
(l − 1)μ1(l) + xl
l
, μ2(l + 1) =
(n − l + 1)μ2(l) − xl
n − l
v1(l + 1) =
l
i=1
(xi − μ1(l + 1))2
=
l
i=1
x2
i − lμ2
1(l + 1)
ΔE2(l) =
l − 1
l
μ1(l) − xl
2
+
n − l + 1
n − l
μ2(l) − xl
2
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/26
15. 2-means: Experiments
Intel Win7 i7-4800
n Brute force SAT Incremental
300000 155.022 0.010 0.0091
1000000 1814.44 0.018 0.015
Do we need sorting and Ω(n log n) time? (k = 1 is linear time)
Note that MaxGap does not yield the separator (because
centroid is sum of squared distance minimizer)
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/26
16. Optimal 1D Bregman k-means
Bregman information [2] e1 (generalizes cluster variance):
e1(Cj ) = min
xl ∈Cj
wl BF (xl : pj ). (1)
Expressed as [14]:
e1(Cj ) = xl ∈Cj
wl (pj F (pj ) − F(pj )) + xl ∈Cj
wl F(xl ) −
F (pj ) x∈Cj
wl x
process using Summed Area Tables [6] (SATs)
S1(j) = j
l=1 wl , S2(j) = j
l=1 wl xl , and S3(j) = j
l=1 wl F(xl ) in
O(n) time at preprocessing stage.
Evaluate the Bregman information e1(Xj,i ) in constant time O(1).
For example, i
l=j wl F(xl ) = S3(i) − S3(j − 1) with S3(0) = 0.
Bregman Voronoi diagrams have connected cells [4] thus DP yields
optimal interval clustering.
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/26
17. Exponential families in statistics
Family of probability distributions:
F = {pF (x; θ) : θ ∈ Θ}
Exponential families [13]:
pF (x|θ) = exp (t(x)θ − F(θ) + k(x)) ,
For example:
univariate Rayleigh R(σ), t(x) = x2, k(x) = log x, θ = − 1
2σ2 ,
η = −1
θ , F(θ) = log − 1
2θ and F∗(η) = −1 + log 2
η .
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/26
18. Uniorder exponential families: MLE
Maximum Likelihood Estimator (MLE) [13]:
e1(Xj,i ) = ˆl(xj , ..., xi ) = F∗
(ˆηj,i ) +
1
i − j + 1
i
l=j
k(xl ).
with ˆηj,i = 1
i−j+1
i
l=j t(xl ).
By making a change of variable yl = t(xl ), and not accounting the
k(xl ) terms that are constant for any clustering, we get
e1(Xj,i ) ≡ F∗
⎛
⎝ 1
i − j + 1
i
l=j
yl
⎞
⎠
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 18/26
19. Hard clustering for learning statistical mixtures
Expectation-Maximization learns monotonically from an
initialization by maximizing the incomplete log-likelihood.
Mixture maximizing the complete log-likelihood:
lc(X; L, Ω) =
n
i=1
log(αli
p(xi ; θli
)),
L = {li }i : hidden labels.
max lc ≡ min
θ1,...,θk
n
i=1
k
min
j=1
(− log p(xi ; θj ) − log αj ).
Given fixed α and − log pF (x; θ) amounts to a dual Bregman
divergence[2].
Run Bregman k-means and DP yields optimal partition since
additively-weighted Bregman Voronoi diagrams are interval [4].
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 19/26
20. Hard clustering for learning statistical mixtures
Location families:
F = {f (x; μ) =
1
σ
f0(
x − μ
σ
), μ ∈ R}
f0 standard density, σ > 0 fixed. Cauchy or Laplacian families have
density graphs intersecting in exactly one point.
→ singly-connected Maximum Likelihood Voronoi cells.
Model selection: Akaike Information Criterion [5] (AIC):
AIC(x1, ..., xn) = −2l(x1, ..., xn) + 2k +
2k(k + 1)
n − k − 1
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 20/26
21. Experiments with: Gaussian Mixture Models (GMMs)
gmm1 score = −3.0754314021966658 (Euclidean k-means) gmm2
score = −3.038795325884112 (Bregman k-means, better)
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 21/26
22. Conclusion
Generic DP for solving interval clustering:
O(n2
kT1(n))-time using O(nk) memory
O(n2
T1(n)) time using O(n2
) memory
Refine DP by adding minimum/maximum cluster size
constraints
Model selection from DP table
Two applications:
1D Bregman r -clustering. 1D Bregman k-means in O(n2
k)
time using O(nk) memory using Summed Area Tables (SATs)
Mixture learning maximizing the complete likelihood:
For uni-order exponential families amount to a dual Bregman
k-means on Y = {yi = t(xi )}i
For location families with density graph intersecting pairwise
in one point (Cauchy, Laplacian: ∈ exponential families)
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 22/26
23. Perspectives
Ω(n log n) for sorting.
Hierarchical center-based clustering with single-linkage: clustering
tree.
Best k-partition pruning using DP [?]:
Optimal for α = 2 +
√
3-perturbation resilient instances.
Time O(nk2 + nT1(n))
Question: How to maintain dynamically an optimal contiguous
clustering? (core-set approximation in the streaming model [9])
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 23/26
24. Bibliography I
Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat.
Np-hardness of Euclidean sum-of-squares clustering.
Machine Learning, 75(2):245–248, 2009.
Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh.
Clustering with Bregman divergences.
Journal of Machine Learning Research, 6:1705–1749, 2005.
Richard Bellman.
A note on cluster analysis and dynamic programming.
Mathematical Biosciences, 18(3-4):311 – 312, 1973.
Jean-Daniel Boissonnat, Frank Nielsen, and Richard Nock.
Bregman Voronoi diagrams.
Discrete Computational Geometry, 44(2):281–307, September 2010.
J. Cavanaugh.
Unifying the derivations for the Akaike and corrected Akaike information criteria.
Statistics & Probability Letters, 33(2):201–208, April 1997.
Franklin C. Crow.
Summed-area tables for texture mapping.
In Proceedings of the 11th Annual Conference on Computer Graphics and Interactive Techniques,
SIGGRAPH ’84, pages 207–212, New York, NY, USA, 1984. ACM.
Sanjoy Dasgupta.
The hardness of k-means clustering.
Technical Report CS2008-0916.
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 24/26
25. Bibliography II
Walter D Fisher.
On grouping for maximum homogeneity.
Journal of the American Statistical Association, 53(284):789–798, 1958.
Sariel Har-Peled and Akash Kushal.
Smaller coresets for k-median and k-means clustering.
In Proceedings of the Twenty-first Annual Symposium on Computational Geometry, SCG ’05, pages
126–134, New York, NY, USA, 2005. ACM.
Meizhu Liu, Baba C. Vemuri, Shun ichi Amari, and Frank Nielsen.
Shape retrieval using hierarchical total Bregman soft clustering.
IEEE Trans. Pattern Anal. Mach. Intell., 34(12):2407–2419, 2012.
Meena Mahajan, Prajakta Nimbhorkar, and Kasturi R. Varadarajan.
The planar k-means problem is NP-hard.
Theoretical Computer Science, 442:13–21, 2012.
Nimrod Megiddo and Kenneth J Supowit.
On the complexity of some common geometric location problems.
SIAM journal on computing, 13(1):182–196, 1984.
Frank Nielsen.
k-mle: A fast algorithm for learning statistical mixture models.
CoRR, abs/1203.5181, 2012.
Frank Nielsen and Richard Nock.
Sided and symmetrized Bregman centroids.
IEEE Transactions on Information Theory, 55(6):2882–2904, 2009.
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 25/26
26. Bibliography III
Haizhou Wang and Mingzhou Song.
Ckmeans.1d.dp: Optimal k-means clustering in one dimension by dynamic programming.
R Journal, 3(2), 2011.
c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 26/26