This document summarizes a presentation on cluster stability estimation and determining the optimal number of clusters in a dataset. The presentation proposes a method that draws random samples from the dataset and compares the partitions obtained from each sample to estimate cluster stability. It quantifies the consistency between partitions using minimal spanning trees and the Friedman-Rafsky test statistic. Experiments on synthetic and real-world datasets show that the method can accurately determine the true number of clusters by finding the partition that maximizes cluster stability.
4.18.24 Movement Legacies, Reflection, and Review.pptx
Methods from Mathematical Data Mining (Supported by Optimization)
1. 4th International Summer School
Achievements and Applications of Contemporary
Informatics, Mathematics and Physics
National University of Technology of the Ukraine
Kiev, Ukraine, August 5-16, 2009
Methods from Mathematical Data Mining
(Supported by Optimization)
Gerhard-Wilhelm Weber * and Başak Akteke-Öztürk
Gerhard- Akteke-
Institute of Applied Mathematics
Middle East Technical University, Ankara, Turkey
* Faculty of Economics, Management and Law, University of Siegen, Germany
Center for Research on Optimization and Control, University of Aveiro, Portugal
1
EURO CBBM
EURO EURO ORD
EURO CE*OC August 8, 2009
2. 4th International Summer School
Achievements and Applications of Contemporary
Informatics, Mathematics and Physics
National University of Technology of the Ukraine
Kiev, Ukraine, August 5-16, 2009
Clustering Theory
Cluster Number and Cluster Stability Estimation
Z. Volkovich
Software Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel
Z. Barzily
Software Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel
G.-W. Weber
Departments of Scientific Computing, Financial Mathematics and Actuarial Sciences,
Institute of Applied Mathematics, Middle East Technical University, 06531, Ankara, Turkey
D. Toledano-Kitai
Software Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel
2
August 8, 2009
3. Clustering
• An essential tool for “unsupervised” learning is
cluster analysis which suggests categorizing data
(objects, instances) into groups such that the
likeness within a group is much higher than the one
between the groups.
• This resemblance is often described by a
distance function.
3
August 8, 2009
4. Clustering
For a given set S ⊂ IR d a clustering algorithm CL
constructs a clustered set:
CL(S, int-part, k) = Π(S) = (π1(S) ,…, πk (S)),
such that CL(x) = CL(y) = i, if x and y are similar:
x, y ∈ πi(S), for some i=1,…,k;
and CL(x) ≠ CL(y), if x and y are dissimilar.
4
August 8, 2009
5. Clustering
The disjoint subsets πi (S), i=1,…,k, are named
clusters:
k
U π (S )
i =1
i = S , and π i ∩ π j = ∅ for i ≠ j.
5
August 8, 2009
7. Clustering
The iterative clustering process is usually carried out in two phases:
a partitioning phase and a quality assessment phase.
In the partitioning phase, a label is assigned to each element
in view of the assumption that, in addition to the observed features,
for each data item, there is a hidden, unobserved feature
representing cluster membership.
The quality assessment phase measures the grouping quality.
The outcome of the clustering process is a partition that acquires
the highest quality score.
Except for the data itself, two essential input parameters are
typically required: an initial partition and a suggested number of
clusters. Here, the parameters are denoted as
• int-part ;
• k. 7
August 8, 2009
8. The Problem
Partitions generated by the iterative algorithms are commonly
sensitive to initial partitions fed in as an input parameter.
Selection of “good” initial partitions is an essential
clustering problem.
Another problem arising here is choosing the right number of the
clusters. It is well known that this key task of the cluster analysis
is ill posed. For instance, the “correct” number of clusters in a
data set can depend on the scale in which the data are measured.
In this talk, we address to the last problem concerning
determination of the number of clusters.
8
August 8, 2009
9. The Problem
Partitions generated by the iterative algorithms are commonly
sensitive to initial partitions fed in as an input parameter.
Selection of “good” initial partitions is an essential
clustering problem.
Another problem arising here is choosing the right number of the
clusters. It is well known that this key task of the cluster analysis
is ill posed. For instance, the “correct” number of clusters in a
data set can depend on the scale in which the data are measured.
9
August 8, 2009
10. The Problem
Many approaches to this problem exploit the within-cluster
dispersion matrix (defined according to the pattern of a
covariance matrix). The span of this matrix (column space)
usually decreases as the number of groups rises, and may have
a point in which it “falls”. Such an “elbow” on the graph locates,
in several known methods, the “true” number of clusters.
Stability based approaches, for the cluster validation problem,
evaluate the partitions’ variability under repeated applications
of a clustering algorithm. Low variability is understood as
high consistency in the result obtained, and the number of clusters
that maximizes cluster stability is accepted as an estimate for the
“true” number of clusters.
10
August 8, 2009
11. The Concept
In the current talk, the problem of determining the
true number of clusters is addressed by the cluster
stability approach.
We propose a method for the study of cluster stability.
This method suggests a geometrical stability of a
partition.
• We draw samples from the source data and estimate
the clusters by means of each of the drawn samples.
• We compare pairs of the partitions obtained.
• A pair is considered to be consistent if the obtained
division is close.
11
August 8, 2009
12. The Concept
• We quantify this closeness by the number of edges
connecting points from different samples in a
minimal spanning tree (MST) constructed for each one
of the clusters.
• We use the Friedman and Rafsky two sample test
statistic which measures these quantities. Under the
null hypothesis on the homogeneity of the source data,
this statistic is approximately normally distributed.
So, the case of well mingled samples within the clusters
leads to normal distribution of the considered statistic.
12
August 8, 2009
14. The Concept
The left-side picture is an example of “a good cluster”
where the quantity of edges connecting points from
different samples (marked by solid red lines) is
relatively big.
The right-side picture images a “poor situation” when
only one (and long) edge connects the (sub-) clusters.
14
August 8, 2009
15. The Two-Sample MST-Test
Henze and Penrose (1979) considered the asymptotic behavior of
Rmn :
the number of edges of V which connect a point of S to a point of T .
Suppose that |S|=m → ∞ and |T|=n → ∞ such that
m /(m+n) → p∈ (0, 1).
∈
Introducing q = 1 − p and r = 2pq, they obtained:
1
Rmn −
2mn
m+n
(
→ N 0, σ d
2
),
m+n
2
where the convergence is in distribution and N(0, σ d ) denotes
the normal distribution with a 0 expectation and a variance
2
σ d := r (r + Cd (1 − 2r)), for some constant Cd
depending only on the space’s dimension d.
15
August 8, 2009
16. Concept
• Resting upon this fact, the standard score
2K m
Y j := Rj −
m K
of the mentioned edges quantity is calculated
for each cluster j=1,…, K ,
where m is the sample size and
K denotes the number of clusters.
%
• The partition quality Y is represented by the
worst cluster corresponding to the
minimal standard score value obtained.
16
August 8, 2009
17. Concept
• It is natural to expect that the true number of
clusters can be characterized by the empirical
distribution of the partition standard score
having the shortest left tail.
• The proposed methodology is expressed as a
sequential creation of the described distribution
with its left-asymmetry estimation.
17
August 8, 2009
18. Concept
One of important problems appearing here is the
so-called clusters coordination problem.
Actually, the same cluster can be differently tagged
within repeated rerunning of the algorithm.
This fact results from the inherent symmetry of
the partitions according to their clusters labels.
18
August 8, 2009
19. Concept
We solve this problem by the following way:
Let S = S1 ∪ S 2 . Consider three categorizations:
Π K := Cl ( S , K ) ,
Π K ,1 := Cl ( S1, K ) ,
Π K ,2 := Cl ( S2 , K ) .
Thus, we get two partitions for each of the samples
Si, i=1,2. The first one is induced by ΠK and the
second one is Π K ,i , i = 1, 2 .
19
August 8, 2009
20. Concept
For each one of the samples i =1,2, our purpose is
to find the permutation ψ of the set {1,…,K} which
minimizes the quantities of the misclassified items:
( i ) x , i = 1, 2 ,
ψ i* ψ α
= arg min ∑ I ( )
K ,i ( x ) ≠ α K ( )
ψ x∈ X
where I(z) is the indicator function of the event z and
α K ,i , α Ki ) are assignments defined by ∏ K , ∏ K ,i ,
(
correspondingly.
20
August 8, 2009
21. Concept
The well-known Hungarian method for solving
this problem has computational complexity of O(K3).
After changing the cluster labels of the partitions
∏ K ,i , i = 1, 2 , consistent with ψ i , i = 1, 2 ,
*
we can assume that these partitions are coordinated,
i.e., the clusters are consistently designated.
21
August 8, 2009
22. Algorithm
1. Choose the parameters: K*, J, m, Cl .
2. For K = 2 to K*
3. For j = 1 to J
4. Sj,1= sample (X, m) , Sj,2= sample (X Sj,1, m)
5. Calculate
ΠK , j =Cl( S(j), K) ,
ΠK , j,1 =Cl( Sj ,1, K) ,
ΠK , j,2 =Cl( Sj ,2, K) .
6. Solve the coordination problem.
22
August 8, 2009
23. Algorithm
7. Calculate Yj(k), k=1,…,K, % (jK ) .
Y
8. end if j
9. Calculate an asymmetry index (percentile) IK
% (jK ) | j = 1,...,J }.
for {Y
10. end if K
11. The “true” K* is selected as the one which yields
the maximal value of the index.
Here, sample(S,m) is a procedure which selects a
random sample of size m from the set S, without
replacement. 23
August 8, 2009
24. Numerical Experiments
We have carried out various numerical experiments on synthetic
and real data sets. We choose K*=7 in all tests, and we provide
10 trials for each experiment.
The results are presented via the error-bar plots of the sample
percentiles’ mean within the trials. The sizes of the error bars
equal two standard deviations, found inside the trials of the results.
The standard version of the Partitioning Around Medoids (PAM)
algorithm has been used for clustering.
The empirical percentiles of 25%, 75% and 90% have been used
as the asymmetry indexes.
24
August 8, 2009
25. Numerical Experiments – Synthetic Data
The synthesized data are mixtures of 2-dimensional
Gaussian distributions with independent coordinates
owning the same standard deviation σ.
Mean values of the components are placed on the
unit circle on the angular neighboring distance 2π / k .
ˆ
Each data set contains 4000 items.
Here, we took J=100 (J: number of samples) and
m=200 (m: size of samples).
25
August 8, 2009
26. Synthetic Data - Example 1
The first data set has the parameters k = 4 and σ = 0.3.
ˆ
As we see, all of the three indexes clearly indicate
four clusters. 26
August 8, 2009
27. Synthetic Data - Example 2
The second synthetic data set has the parameters k = 5
ˆ
and σ = 0.3.
The components are obviously overlapping in this case.
27
August 8, 2009
28. Synthetic Data - Example 2
As it can be seen, the true number of clusters has been
successfully found by all indexes.
28
August 8, 2009
29. Numerical Experiments – Real-World Data
First Data Sets
The first real data set was chosen from the text collection
http://ftp.cs.cornell.edu/pub/smart/ .
This set consists of the following three sub-collections
DC0: Medlars Collection (1033 medical abstracts),
DC1: CISI Collection (1460 information science abstracts),
DC2: Cranfield Collection (1400 aerodynamics abstracts).
29
August 8, 2009
30. Numerical Experiments – Real-World Data
First Data Sets
We picked the 600 “best” terms, following the common
bag of words method.
It is known that this collection is well separated
by means of its first two leading principal components.
Here, we also took J=100 and m=200.
30
August 8, 2009
31. Real-World Data - First Data Sets
All the indexes receive their maximal values at K=3,
i.e., the number of clusters is properly determined.
31
August 8, 2009
32. Numerical Experiments – Real-World Data
Second Data Set
Another considered data set is the famous
Iris Flower Data Set, available, for example, at
http://archive.ics.uci.edu/ml/datasets/Iris .
This dataset is composed from 150 4-dimensional
feature vectors of three equally sized sets of iris flowers.
We choose J=200 and the sample size equals 70.
32
August 8, 2009
33. Real-World Data – Iris Flower Data Set
Our method turns out a three clusters structure.
33
August 8, 2009
34. Conclusions -
The Rationale of Our Approach
• In this paper, we propose a novel approach, based on
the Minimal Spanning Tree two sample test, for the
cluster stability assessment.
• The method offers to quantify the partitions’ features
through the test statistic computed within the clusters
built by means of sample pairs.
• The worst cluster, determined by the lowest
standardized statistic value, characterizes the
partition quality.
34
August 8, 2009
35. Conclusions -
The Rationale of Our Approach
• The departure from the theoretical model, which
suggests well-mingled samples within the clusters,
is described by the left tail of the score distribution.
• The shortest tail corresponds to the “true” number
of clusters.
• All presented experiments detect the true number
of clusters.
35
August 8, 2009
36. Conclusions
• In the case of the five components Gaussian data set,
the true number of clusters was found even though
a certain overlapping of the clusters exists.
• The four Gaussian components data set contains
sufficiently separated components. Therefore,
it is of no revelation that the true number of clusters
is attained here.
36
August 8, 2009
37. Conclusions
• The analysis of the abstracts data set is carried out
with 600 terms and the true number of clusters
was also detected.
• The Iris Flower dataset is sufficiently difficult to
analyze due to the fact that two clusters are not
linearly separable. However, the true number
of clusters was found here as well.
37
August 8, 2009
38. References
Barzily, Z., Volkovich, Z.V., Akteke-Öztürk, B., and Weber, G.-W., Cluster stability using minimal spanning trees,
ISI Proceedings of 20th Mini-EURO Conference Continuous Optimization and Knowledge-Based Technologies
(Neringa, Lithuania, May 20-23, 2008) 248-252.
Barzily, Z., Volkovich, Z.V., Akteke-Öztürk, B., and Weber, G.-W., On a minimal spanning tree approach in the
cluster validation problem, to appear in the special issue of INFORMATICA at the occasion of 20th Mini-EURO
Conference Continuous Optimization and Knowledge Based Technologies (Neringa, Lithuania, May 20-23, 2008),
Dzemyda, G., Miettinen, K., and Sakalauskas, L., guest editors.
Volkovich, V., Barzily, Z., Weber, G.-W., and Toledano-Kitai, D., Cluster stability estimation based on a minimal
spanning trees approach, Proceedings of the Second Global Conference on Power Control and Optimization, AIP
Conference Proceedings 1159, Bali, Indonesia, 1-3 June 2009, Subseries: Mathematical and Statistical Physics; ISBN
978-0-7354-0696-4 (August 2009) 299-305; Hakim, A.H., Vasant, P., and Barsoum, N., guest eds..
38
August 8, 2009