SlideShare a Scribd company logo
1 of 92
Download to read offline
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Assessing the quality of a clustering
Christian Hennig
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
1. A short introduction to cluster analysis
Cluster analysis is about finding groups in data.
var 1
−0.4 −0.2 0.0 0.2 0.4
3
3
3333
333 333
3333
333
3
33
3
3333
33
3
3
3333
4
5
4
4
4 544444
4444444444444 4
5
5
55
5
55 55
5
5 5
5
5
5
5
9
9995
58888
8
88
9
888
6
6
6
66
6
66666 6
6
11111111111
1
11111111
1
111
1
1111
6
111111111111111
1
111111
111111
111
6
1
9
999
2222
2
2
2
222
2
2
22 2
22
2
2
22
2 2 222
2 22 2
22 2
2
22
2
22
222
2
2
2
2
2
2 7
7
777
7777
7
777
7
7
3
3
3 333
333333
3333
333
3
33
3
3333
33
3
3
3333
4
5
4
4
454 444 4
4 4444 44 44 44 44 4
5
5
55
5
5 55 5
5
55
5
5
5
5
9
9995
5 8888
8
88
9
888
6
6
6
6 6
6
666 666
6
11111 11 111 1
1
11111 111
1
11 1
1
111 1
6
1 11111 11 11 1111 1
1
1 11 11 1
11 11 11
111
6
1
9
999
22 22
2
2
2
22 2
2
2
2 22
22
2
2
22
2 2222
2222
222
2
22
2
2 2
2 22
2
2
2
2
2
27
7
7 77
7777
7
777
7
7
−0.4 −0.2 0.0 0.2
−0.4−0.20.00.20.4
3
3
3 333
33 33 33
3333
33 3
3
33
3
33 33
33
3
3
3333
4
5
4
4
45 4 4444
444444 444 44 44 4
5
5
55
5
55 5 5
5
5 5
5
5
5
5
9
999 5
5888 8
8
88
9
888
6
6
6
66
6
66 666 6
6
1111111 11 11
1
11 11 1111
1
111
1
11 11
6
1111 111 111 11 11 1
1
111111
111111
111
6
1
9
999
2 222
2
2
2
22 2
2
2
22 2
2 2
2
2
22
22222
2 22 2
2 22
2
22
2
22
22 2
2
2
2
2
2
27
7
77
7
77 77
7
77 7
7
7
−0.4−0.20.00.20.4
3 3
3333333
3
3
3 333333
3
3 333 33
3
3333
333
3
3
45
4
44
5
44444 44
4444444444
4
45
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
9999
55
8
888
8
8
89 888
6
6
6
66 6
666
66
66
111111111111
1111111
1 1111
1
11
11
6
1111
11
111111
1111
111111
111111
111
6
1
9
9
99
22
2
2 222 2222
2
2
2
2 2222 22
2
2
2
2
22
2
2
2
22
2
2
2
2 2
22 2
22
2 2
2
2
22
77
7
7
7
77
77
7
7
77 7
7
var 2
33
3 333333
3
3
3 333333
3
3333 33
3
3333
333
3
3
45
4
44
5
4 444 44 4
444 44 44 44 4
4
45
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
9999
55
8
888
8
8
89888
6
6
6
6 6 6
666
66
6 6
11111 11 111 11
11111 11
11 11 1
1
11
1 1
6
1 11111
11 11 11
11 11
1 11 11 1
11 11 11
111
6
1
9
9
99
22
2
222222 22
2
2
2
2 222222
2
2
2
2
22
2
2
2
22
2
2
2
2 2
2 22
22
2 2
2
2
2 2
77
7
7
7
77
77
7
7
777
7
33
3 33333 3
3
3
3 3333 33
3
3333 33
3
3333
333
3
3
45
4
44
5
4 444444
4444 444 44 4
4
45
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
9999
55
8
88 8
8
8
8 9888
6
6
6
66 6
66 6
66
66
1111111 11 111
11 11 111
11 111
1
11
11
6
1111
11
1 111 11
11 11
111111
111111
111
6
1
9
9
99
2 2
2
2 2 22 22 22
2
2
2
22 222 22
2
2
2
2
22
2
2
2
2 2
2
2
2
22
22 2
2 2
22
2
2
2 2
7 7
7
7
7
77
77
7
7
7 77
7
3
33
33
3
3
3333
3
33333
33
3 33
3
33333
33 3333
3 4
5
4
4454
4
44
4
4
44
4
4
44
4
4
44
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
55
5 5
99
9
9
5
5
8
88888
89 888
6 6 6
6
6
6
6
66
6
6
6
6
11
1
1
1
11
1
1
1
11
1
1
111
1
11
1
11
1
1
111
1 61
1
11
1
1
11
11
11
1
1
1
1
1
1
1
1
1
1
11
1
1
11
111
6
1
9
9
9
9
2
2
22 222 2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
22222
222222
2
2
2 2
2
22
2
2
2
2
2
7
7
7
7
7
7
777
7
7
77
77
3
33
33
3
3
33 33
3
33333
33
333
3
33333
333333
34
5
4
44 54
4
44
4
4
44
4
4
44
4
4
44
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
55
5 5
99
9
9
5
5
8
88888
89888
6 66
6
6
6
6
66
6
6
6
6
11
1
1
1
11
1
1
1
11
1
1
111
1
11
1
11
1
1
111
1 61
1
11
1
1
11
11
11
1
1
1
1
1
1
1
1
1
1
11
1
1
11
111
6
1
9
9
9
9
2
2
222222
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
22 22 2
22 22 22
2
2
22
2
22
2
2
2
2
2
7
7
7
7
7
7
777
7
7
77
7 7
var 3
−0.20.00.20.4
3
33
33
3
3
3 33 3
3
3333 3
3 3
333
3
33 333
33 3333
3 4
5
4
445 4
4
44
4
4
44
4
4
4 4
4
4
44
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
5 5
55
99
9
9
5
5
8
88 888
8 9888
66 6
6
6
6
6
6 6
6
6
6
6
11
1
1
1
11
1
1
1
11
1
1
11 1
1
11
1
11
1
1
11 1
161
1
11
1
1
1 1
11
11
1
1
1
1
1
1
1
1
1
1
11
1
1
11
111
6
1
9
9
9
9
2
2
22 2 22 2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
22 22 2
2 22222
2
2
2 2
2
22
2
2
2
2
2
7
7
7
7
7
7
7 77
7
7
7 7
77
−0.4 −0.2 0.0 0.2 0.4
−0.4−0.20.00.2
3 3
3
33
3
33
3
3
3
3
3
33
3
3
3
3
3 33
3
3
3
3
333
3
33
333
4
5
4
44
5
4
4
444 444
44
4
4
4
4
44
4
4
4
5
5
5
5
5
55
5
5
5 5
5
5
5
5
5
9
99
9
5
5
88
8
88
8
8
9
888
6
6
6
6
6
6
66
6
6
6
66
11
11111
1
1
1
1
1
1
1
11
1
11
1
1
11
1
1
1
1
11 6
1111
11
1
1
1
1
1
1
11
11
1
1
1
1
1111111
1
1
11
6
1
9
9
99
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
222
22
2
2
2
2
2
22222 2
2
2
2
2
2
2 22 2
2
2
7
777
7
7
7
7
7
777
7 7
7
33
3
33
3
33
3
3
3
3
3
33
3
3
3
3
333
3
3
3
3
333
3
33
333
4
5
4
44
5
4
4
444444
44
4
4
4
4
44
4
4
4
5
5
5
5
5
55
5
5
55
5
5
5
5
5
9
99
9
5
5
88
8
88
8
8
9
888
6
6
6
6
6
6
66
6
6
6
66
11
11111
1
1
1
1
1
1
1
11
1
11
1
1
11
1
1
1
1
11 6
1111
11
1
1
1
1
1
1
11
11
1
1
1
1
1111111
1
1
11
6
1
9
9
99
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2 2 2
22
2
2
2
2
2
2 22 222
2
2
2
2
2
2222
2
2
7
7 77
7
7
7
7
7
777
77
7
−0.2 0.0 0.2 0.4
33
3
33
3
33
3
3
3
3
3
33
3
3
3
3
333
3
3
3
3
333
3
33
333
4
5
4
44
5
4
4
44 44 44
44
4
4
4
4
44
4
4
4
5
5
5
5
5
5 5
5
5
5 5
5
5
5
5
5
9
99
9
5
5
88
8
88
8
8
9
888
6
6
6
6
6
6
66
6
6
6
6 6
11
11
1 11
1
1
1
1
1
1
1
11
1
11
1
1
11
1
1
1
1
1 16
1 111
11
1
1
1
1
1
1
11
11
1
1
1
1
1 111
11 1
1
1
11
6
1
9
9
99
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2 22
22
2
2
2
2
2
22222 2
2
2
2
2
2
2 22 2
2
2
7
77 7
7
7
7
7
7
7 77
77
7
var 4
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
1.1 Cluster analysis methods
1.1.1 k-means (Fix & Hodges 1951)
n
i=1
xi − ¯xC(i)
2
= min!
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
1.1 Cluster analysis methods
1.1.1 k-means (Fix & Hodges 1951)
n
i=1
xi − ¯xC(i)
2
= min!
represents all objects by centroid,
“compact” clusters.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
1.1 Cluster analysis methods
1.1.1 k-means (Fix & Hodges 1951)
n
i=1
xi − ¯xC(i)
2
= min!
represents all objects by centroid,
“compact” clusters.
Version: Don’t square, other centroids than mean (“pam”).
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
8
8
8888
8
8 8
8
8
8
888
88
8
8
8
88
8
8
8
8
8
8
88
8
8
8
8
8
71
7
7 7
7
7
7
7
7
7
77
77
7
7
7
77
7 7
7
7
7
1
7
1
1
9
9
1
9
1
9
1
1
9
9
1
9
44
44
4
4
9
9
9
9
9
9
92 99
9
2
2
2
2
2
2
2
2 2
2
2
2
2
55
5
5 5
55
5
5
55
5
5
5
55
55
5
5
5
55
5
5
5
5
55
3
5
5
55
5 5
5
5
5
5
5
5
555
5
55 5
5
55
5
5555
5
555
3
5
2
2
22
6
6
3
6
226
666
3
2
6
6
6
2
2
66 6
6
6
6
3
6
66
3
6
3
6
3
3
6
3
6
6
3
6
6
6 6
3
6
3
2
3
6
3
3
3
3
6
3
3
33
3
3
3
3
3
3
−0.4 −0.2 0.0 0.2 0.4
−0.4−0.20.00.20.4
MDS 1
MDS2
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
1.1.2 Gaussian mixture model (Pearson 1894)
f(x) =
k
j=1
πjϕaj ,Σj
(x).
Clusters are described by Gaussian distributions.
Elliptical clusters, flexible size and shape.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
3
3
3333
3
33
3
3
3 33333
3
3
3
333
3
3
3
33
33
3
3
3
3
3
45
4
4 4
5
4
4
4
4
4
44
44
4
44
44
4 44
4
4
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
99
99
5 5
8
8
88
8
8
89 888
6
6
6
6
6 6
6
6 6
6
6
6
6
111
1 1
11
1
1
11
1
1
1
1111
1
1
1
11
1
1
11
11
6
1111
1 1
1
11
11
1
111
1
111
1
11
1
11111
111
6
1
9
9
99
2
2
2
2 222
2222
2
2
2
2
2
2
22 2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
2
2 2
2
2
2
2
2
2
7
7
7
7
7
7
7
77
7
7
7
7
7
7
−0.4 −0.2 0.0 0.2 0.4
−0.4−0.20.00.20.4
MDS 1
MDS2
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
1.1.3 Classical hierarchical methods
Operate on dissimilarity matrices;
compute dissimilarity measure for every pair of observations.
Can use Euclidean distance,
but also tailor-made distances for other data formats.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
1.1.3 Classical hierarchical methods
Operate on dissimilarity matrices;
compute dissimilarity measure for every pair of observations.
Can use Euclidean distance,
but also tailor-made distances for other data formats.
“Cluster”: a collection of similar objects,
dissimilar to the others.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
Genetic data: 236 Tetragonula bees, 13 allele pairs
[,1] [,2] [,3] [,4] [,5] [,6] (...)
[1,] "NO" "AA" "PP" "HH" "EH" "FF"
[2,] "EO" "AA" "PP" "HH" "GH" "FF"
[3,] "NQ" "AA" "PT" "HH" "GF" "EF"
[4,] "OO" "AA" "PP" "GH" "GH" "EF"
[5,] "OO" "AA" "PP" "GH" "GH" "EF"
[6,] "LN" "AA" "PP" "HH" "EG" "FE"
(...)
Compute “shared allele distance”.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
[,1] [,2] [,3] [,4] [,5]
[1,] 0.00 0.21 0.33 0.29 0.25
[2,] 0.21 0.00 0.33 0.25 0.21
[3,] 0.33 0.33 0.00 0.29 0.33 (...)
[4,] 0.29 0.25 0.29 0.00 0.08
[5,] 0.25 0.21 0.33 0.08 0.00
(...)
Dataset seen before is a
Euclidean approximation (“MDS”) of this.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
1.1.3 Classical hierarchical methods
Operate on dissimilarities and produce hierarchical trees
(originally motivated by biological classification).
Differ in definition of “dissimilarity between clusters”.
818280797778676675737668706361624172716469377465465654534751455955525048385749604358403942364432302826231816129198453353172917206111342433222710251415213219085848393878886899192170172173961041051069910398971021001019495171198182177168234220208206199205216210204185197194191190189219218209217215207214202175188183181178200179193176196213212180187192174221195211203186184201169167136133117151166145116165156142131110146155149144143132157128134125152158124154147129161163160153162150159140137126119122135121118111127120112130109115113164139123141108107114138148229233235231226236228222230224232227223225
0.00.20.40.6
Cluster Dendrogram
hclust (*, "single")
as.dist(tai$distmat)
Height
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
Single Linkage: (Florek and Perkal 1951)
˜d(A, B) = min
a∈A,b∈B
d(a, b)
Complete Linkage:
˜d(A, B) = max
a∈A,b∈B
d(a, b)
Average Linkage:
˜d(A, B) = avea∈A,b∈Bd(a, b)
These can deliver quite different clusterings.
(Complete L. very compact,
Single L. separated but maybe widespread)
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
1.1.4 Spectral clustering (Shi and Malik 2000)
Dissimilarity-based nonlinear dimension reduction
for k-means.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
1
1
1111
1
1 1
1
1
1
111
11
1
1
1
11
1
1
1
1
1
1
11
1
1
1
1
1
62
6
6 6
2
6
6
6
6
6
66
66
6
6
6
66
6 6
6
6
6
2
2
2
2
6
2
2
2
2
2
2
2
2
2
2
2
55
55
2
2
3
3
3
3
3
3
35 33
3
7
7
7
7
7
7
7
7 7
7
7
7
7
44
4
4 4
44
4
4
44
4
4
4
44
44
4
4
4
44
4
4
4
4
44
4
4
4
44
4 4
4
4
4
4
4
4
444
4
44 4
4
44
4
4444
4
444
4
4
5
5
55
7
7
7
7
777
777
7
7
7
7
7
7
7
77 7
7
7
7
7
7
77
7
7
7
7
7
7
7
7
7
7
7
7
7
7 7
7
7
7
7
7
7
7
7
7
7
7
7
7
77
7
7
7
7
7
7
−0.4 −0.2 0.0 0.2 0.4
−0.4−0.20.00.20.4
ctai$points[,1]
ctai$points[,2]
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
1.1.5 Density-based methods
such as “DBSCAN” (Ester et al. 1996),
joins observations with all neighbouring points,
and neighbourhoods if they share enough points.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
1
1
1111
1
1 1
1
1
1
111
11
1
1
1
11
1
1
1
1
1
1
11
1
1
1
1
1
2N
2
2 2
N
2
2
2
2
2
22
22
2
2
2
22
2 2
2
2
2
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
NN
NN
N
N
3
3
3
3
3
3
3N 33
3
N
N
4
4
4
N
4
4 4
N
N
4
4
55
5
5 5
55
5
5
55
5
5
5
55
55
5
5
5
55
5
5
5
5
55
N
5
5
55
5 5
5
5
5
5
5
5
555
5
55 5
5
55
5
5555
5
555
N
5
N
N
NN
6
6
6
6
666
666
6
N
6
6
6
6
6
66 6
6
6
6
6
6
66
6
6
6
6
6
6
6
6
6
6
6
6
6
6 6
6
6
6
6
6
6
7
7
7
7
N
7
7
77
7
7
7
7
7
7
−0.4 −0.2 0.0 0.2 0.4
−0.4−0.20.00.20.4
ctai$points[,1]
ctai$points[,2]
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
1.1.6 Further issues in cluster analysis
Number of clusters
Cluster validation
Dissimilarity definition
Choice of method
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
2. Benchmarking and measurement of quality
Which clustering is better?
(Old faithful geyser data)
−2 −1 0 1 2
−2−101
mclust
waiting
duration
−2 −1 0 1 2
−2−101
pam
waiting
duration
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Which clustering is better?
−10 0 10 20 30
010203040
xy[,1]
xy[,2]
−10 0 10 20 30
010203040
xy[,1]
xy[,2]
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Benchmarking approaches:
Real datasets with known classes
Simulated datasets from mixture distributions
Real datasets without known classes
With known truth can compute misclassification rates.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Disadvantages of benchmarking with known truth
In datasets with known classes
clustering is not of real scientific interest.
Deviate systematically from real clustering problems.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Disadvantages of benchmarking with known truth
In datasets with known classes
clustering is not of real scientific interest.
Deviate systematically from real clustering problems.
The fact that we know certain true classes
doesn’t preclude other legitimate/”true” clusterings.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Disadvantages of benchmarking with known truth
In datasets with known classes
clustering is not of real scientific interest.
Deviate systematically from real clustering problems.
The fact that we know certain true classes
doesn’t preclude other legitimate/”true” clusterings.
Classes in supervised classification problems
may not qualify as data analytic clusters.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Disadvantages of benchmarking with known truth
In datasets with known classes
clustering is not of real scientific interest.
Deviate systematically from real clustering problems.
The fact that we know certain true classes
doesn’t preclude other legitimate/”true” clusterings.
Classes in supervised classification problems
may not qualify as data analytic clusters.
So there could be better truths than the known one.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
How true are the true given classes?
(Hennig and Liao 2013, social stratification data)
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
7 standard occupation classes such as “manual workers”,
“managerials and professionals”, “not working”
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
These are not “data analytic clusters”.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Mixture components aren’t always “data analytic clusters”
either.
55
3
54
3
54
5
5
4
4
5
4
3
5
4
5
5
4
3
534
4
4
4
4
4
4
5
54
5
4 5
4
4
4
5
5
5
4
54
4
5
5
4
5
4
5
5
5
4
4
4
4
5
4
5
5
55
5
4
4 4
4
4
5
5
5
4
3
4
1
4
4
3
4
4
5
4
1
4 54
5
5
3
5
54
5
1
4
4
4
4
4
5
4
3
4
5
4
55
5
4
5
4
5
3
5
4 4
4
5
3
5
4 5
34
4
5
5
5
4
3
4
5
5
55
4 5
4
54
4
5
4
4
4
4
4
5
4
5
3
5
5
3
4
5 5
4 5
4 5
5
54
4
4
4 4
5
5
4
54
5
5
4
4
5
4
5
5
5 5
4
4
4
5
5
5
5
4
4
4
5
4
4
2
5
4
3
5
4
4
5
4 54
5
4
4
4
4
5
4
5
5
5
4 4
5
4
5
54
4 4 5
4
5
5
4
5
5
5
5
4
4 5
3
5
5 54
5
4
4 4
4
35
5
5
4
5
4
4
5
3
4
5
5
4
4
4
5
4
4
4 5
5
4
54
5
44
5
4 5
3
4
4
3
3
4
4
55
4 5
4
4
4
5
5
4
5
5
555
4
5
4
4
5
5
3
5
4
4
5
5
4
4
5
3
4
55
4 54
4
4
45
3
4
5
3
5
5
4
4
3
4
2 5
4
54
4
4
4
4
4
4
2 5
4
4
4
5
5
4
4
5
5
5
4
5
5
5
55
4 5
5
5
4 54
5
4
4
554
4 551
4
5
4
5
2
33
4
4
45
54
4
5
1
5
44
4
4
4
4
54
4
3
4
4
4
4
5
3
5 554
4
44
5
4
4
5
5
4
5
4
4
5
4
4
5
5
5
2
3
5
34
5
4
5
3
4
4
5
5
1
4
5
4
4 5 4 54 54
5
4
4
4
4
55
4 5
5
54
4 5
4
5
5
5
2
4
3
4
4
4
5
5
5 5
5
4
4
5
4 4
3
5
0 20 40 60 80 100
−20−10010203040
xdata$x[,1]
xdata$x[,2]
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Using a known truth is useful and fair enough
but also want to evaluate clusterings
on data for which truth is not known.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
There is a range of cluster validation indexes
measuring clustering quality, such as
Average silhouette width (ASW)
(Kaufman and Rouseeuw 1990)
sw(i, C) = b(i,C)−a(i,C)
max(a(i,C),b(i,C)),
a(i, C) =
1
|Cj| − 1
x∈Cj
d(xi, x), b(i, C) = min
xi ∈Cl
1
|Cl|
x∈Cl
d(xi, x).
Maximum average sw ⇒ good C.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
“One size fits it all”-approach.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
“One size fits it all”-approach.
Homogeneity will normally dominate here:
−10 0 10 20 30
010203040
xy[,1]
xy[,2]
−10 0 10 20 30
010203040
xy[,1]
xy[,2]
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
My general philosophy
There are various different aims of clustering.
Measure them separately to characterise
what a method does best,
instead of producing a single ranking.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Typical clustering aims
Between-cluster separation
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Typical clustering aims
Between-cluster separation
Within-cluster homogeneity (low distances)
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Typical clustering aims
Between-cluster separation
Within-cluster homogeneity (low distances)
Within-cluster homogeneous distributional shape
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Typical clustering aims
Between-cluster separation
Within-cluster homogeneity (low distances)
Within-cluster homogeneous distributional shape
Good representation of data by centroids
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Typical clustering aims
Between-cluster separation
Within-cluster homogeneity (low distances)
Within-cluster homogeneous distributional shape
Good representation of data by centroids
Little loss of information
from original distance between objects.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Typical clustering aims
Between-cluster separation
Within-cluster homogeneity (low distances)
Within-cluster homogeneous distributional shape
Good representation of data by centroids
Little loss of information
from original distance between objects.
Clusters are regions of high density
without within-cluster gaps
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Typical clustering aims
Between-cluster separation
Within-cluster homogeneity (low distances)
Within-cluster homogeneous distributional shape
Good representation of data by centroids
Little loss of information
from original distance between objects.
Clusters are regions of high density
without within-cluster gaps
Uniform cluster sizes
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Typical clustering aims
Between-cluster separation
Within-cluster homogeneity (low distances)
Within-cluster homogeneous distributional shape
Good representation of data by centroids
Little loss of information
from original distance between objects.
Clusters are regions of high density
without within-cluster gaps
Uniform cluster sizes
Stability
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
These may be in conflict with each other.
−10 0 10 20 30
010203040
xy[,1]
xy[,2]
−10 0 10 20 30
010203040
xy[,1]
xy[,2]
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
E.g., pattern recognition in images
requires separation,
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
E.g., pattern recognition in images
requires separation,
clustering for information reduction requires
good representation by centroids,
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
E.g., pattern recognition in images
requires separation,
clustering for information reduction requires
good representation by centroids,
groups in social network analysis shouldn’t have
large within-cluster gaps,
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
E.g., pattern recognition in images
requires separation,
clustering for information reduction requires
good representation by centroids,
groups in social network analysis shouldn’t have
large within-cluster gaps,
underlying “true” classes (biological species)
may cause homogeneous distributional shapes.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
3. Cluster quality statistics
Measuring between-cluster separation
∃ several ways measuring separation (as for other aims).
Straightforward: min distance between any two clusters,
or distance between centroids (e.g., k-means).
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
waiting
duration
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
waiting
duration
M
M
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Measuring between-cluster separation
∃ several ways measuring separation (as for other aims).
Straightforward: min distance between any two clusters,
or distance between centroids (e.g., k-means).
These measure quite different concepts of separation.
(min distance relies on only two points;
centroid distance ignores what goes on at border.)
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
p-separation index:
More stable version of “min distance”:
Average distance to nearest point in different cluster for
p = 10% “border” points in any cluster.
−2 −1 0 1 2
−2−101
waiting
duration
X
X
X
X
X
X
X
XX
X
X XX
X
X
X
X
X
X
X
X
X X
X
X
X X
X
X
X
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Measuring “density mountains vs. valleys”
Index that measures whether clusters correspond
to “density mountains”,
and whether “valleys” are between clusters.
Note: This is current research and may be revised.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Two aspects:
(a) Density goes down from mode;
no gaps and valleys within clusters.
(b) Cluster borders are valleys;
they don’t run through mountains.
Estimate density by weighted count of close points
(“kernel density”).
0.00.51.01.52.0
x
k(x)
10% quantile of within−cluster distances
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
sinlink g= 2
waiting
duration
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Start from cluster modes
−0.5 0.0 0.5 1.0 1.5 2.0
−1.8−1.6−1.4−1.2−1.0−0.8
sinlink g= 2 Step 1
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Connect closest point to cluster
−0.5 0.0 0.5 1.0 1.5 2.0
−1.8−1.6−1.4−1.2−1.0−0.8
sinlink g= 2 Step 3
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
As long as density goes down, no penalty
−0.5 0.0 0.5 1.0 1.5 2.0
−1.8−1.6−1.4−1.2−1.0−0.8
sinlink g= 2 Step 6
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Penalty for density increase
−2 −1 0 1 2
−2−101
sinlink g= 2 Step 98
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
sinlink g= 2 Step 99
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
sinlink g= 2 Step 100
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
sinlink g= 2 Step 101
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
sinlink g= 2 Step 102
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
sinlink g= 2 Step 103
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
sinlink g= 2 Step 104
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
sinlink g= 2 Step 105
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
sinlink g= 2 Step 106
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
sinlink g= 2 Step 107
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
sinlink g= 2 Step 108
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
sinlink g= 2 Step 297
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Add penalty density∗density from other clusters
−2 −1 0 1 2
−2−101
specc g= 3
waiting
duration
P P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
PP
P
P
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Other statistics
Within-cluster average distance
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Other statistics
Within-cluster average distance
Within-cluster similarity measure
to normal/uniform
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Other statistics
Within-cluster average distance
Within-cluster similarity measure
to normal/uniform
Within-cluster (squared) distance to centroid
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Other statistics
Within-cluster average distance
Within-cluster similarity measure
to normal/uniform
Within-cluster (squared) distance to centroid
ρ(distance, cluster induced distance) (Hubert’s Γ)
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Other statistics
Within-cluster average distance
Within-cluster similarity measure
to normal/uniform
Within-cluster (squared) distance to centroid
ρ(distance, cluster induced distance) (Hubert’s Γ)
Entropy of cluster sizes
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Other statistics
Within-cluster average distance
Within-cluster similarity measure
to normal/uniform
Within-cluster (squared) distance to centroid
ρ(distance, cluster induced distance) (Hubert’s Γ)
Entropy of cluster sizes
Average largest within-cluster gap
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Other statistics
Within-cluster average distance
Within-cluster similarity measure
to normal/uniform
Within-cluster (squared) distance to centroid
ρ(distance, cluster induced distance) (Hubert’s Γ)
Entropy of cluster sizes
Average largest within-cluster gap
Variation of clusterings on bootstrapped data
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Other statistics
Within-cluster average distance
Within-cluster similarity measure
to normal/uniform
Within-cluster (squared) distance to centroid
ρ(distance, cluster induced distance) (Hubert’s Γ)
Entropy of cluster sizes
Average largest within-cluster gap
Variation of clusterings on bootstrapped data
Standardise all indexes to [0, 1] so that
“large is good”.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
4. Examples
−10 0 10 20 30
010203040
xy[,1]
xy[,2]
−10 0 10 20 30
010203040
xy[,1]
xy[,2]
3-means mclust-3
ave within 0.811 0.643
sep index 0.163 0.306
density index 0.977 0.978
within gap 0.927 0.949
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
−2 −1 0 1 2
−2−101
mclust
waiting
duration
−2 −1 0 1 2
−2−101
pam
waitingduration
−2 −1 0 1 2
−2−101
spectral
waiting
duration
−2 −1 0 1 2
−2−101
ave.linkage
waiting
duration
−2 −1 0 1 2
−2−101
single linkage
waiting
duration
−2 −1 0 1 2
−2−101
pdfCluster (3)
waiting
duration
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
mclust pam spect ave.l sing.l pdf3
ave within 0.71 0.95 0.82 0.90 0.04 0.98
sep index 0.98 0.30 0.94 0.60 0.99 0.78
density 0.99 0.44 0.70 0.63 0.59 0.99
gap 0.14 0.46 0.46 0.46 0.99 0.48
gamma 0.81 0.91 0.92 0.96 0.06 0.98
normality 0.69 0.44 0.45 0.48 0.11 0.52
Note: These values are quantile-standardised,
implementation of this in fpc is still to come.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
0.00.20.40.60.81.0
Number of clusters
dindex
2 3 4 5
kmeans
kmeans
kmeans
kmeans
avelink
avelink
avelink avelink
sinlink
sinlink
sinlink
sinlink
comlink
comlink
comlink
comlink
mclust
mclust
mclust mclust
pam
pam
pam
pam
specc
specc
specc
specc
pdfclus
Quantile−calibrated density index
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
6. Discussion
Clustering quality is multidimensional.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
6. Discussion
Clustering quality is multidimensional.
Provide multidimensional evaluation,
characterising a method’s behaviour.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
6. Discussion
Clustering quality is multidimensional.
Provide multidimensional evaluation,
characterising a method’s behaviour.
Can aggregate criteria by weighted mean
given well justified weights.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
6. Discussion
Clustering quality is multidimensional.
Provide multidimensional evaluation,
characterising a method’s behaviour.
Can aggregate criteria by weighted mean
given well justified weights.
Benchmarking without known truth
and comparison of clusterings in practice.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
6. Discussion
Clustering quality is multidimensional.
Provide multidimensional evaluation,
characterising a method’s behaviour.
Can aggregate criteria by weighted mean
given well justified weights.
Benchmarking without known truth
and comparison of clusterings in practice.
Required: standardisation to compare
different indexes and numbers of clusters.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Much of this is implemented in R-package fpc,
more will be.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Much of this is implemented in R-package fpc,
more will be.
Soon to come:
IFCS Cluster Benchmarking Repository
(Iven Van Mechelen, Nema Dean, Isabelle Guyon,
Anne-Laure Boulesteix, Doug Steinley, Friedrich Leisch,
Christian Hennig, Rainer Dangl)
This work is supported by EPSRC Grant EP/K033972/1.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
A bit of marketing:
Christian Hennig Assessing the quality of a clustering

More Related Content

What's hot

Measurecamp Brussels - Synthetic data.pdf
Measurecamp Brussels - Synthetic data.pdfMeasurecamp Brussels - Synthetic data.pdf
Measurecamp Brussels - Synthetic data.pdfHuman37
 
五項修練的故事
五項修練的故事五項修練的故事
五項修練的故事Jiang sh
 
[MPKD1] Introduction to business analytics and simulation
[MPKD1] Introduction to business analytics and simulation[MPKD1] Introduction to business analytics and simulation
[MPKD1] Introduction to business analytics and simulationNguyen Ngoc Binh Phuong
 
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...Edureka!
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsYONG ZHENG
 
Testing strategy for RPA implementation
Testing strategy for RPA implementationTesting strategy for RPA implementation
Testing strategy for RPA implementationARJUN S MEDA
 
Deep learning for biomedical discovery and data mining I
Deep learning for biomedical discovery and data mining IDeep learning for biomedical discovery and data mining I
Deep learning for biomedical discovery and data mining IDeakin University
 
製造業智慧化發展策略
 製造業智慧化發展策略 製造業智慧化發展策略
製造業智慧化發展策略Pei Hung Hsieh
 
Interactive Recommender Systems
Interactive Recommender SystemsInteractive Recommender Systems
Interactive Recommender SystemsRoelof van Zwol
 
Knowledge Management Lecture 1: definition, history and presence
Knowledge Management Lecture 1: definition, history and presenceKnowledge Management Lecture 1: definition, history and presence
Knowledge Management Lecture 1: definition, history and presenceStefan Urbanek
 
Hands on Explainable Recommender Systems with Knowledge Graphs @ RecSys22
Hands on Explainable Recommender Systems with Knowledge Graphs @ RecSys22Hands on Explainable Recommender Systems with Knowledge Graphs @ RecSys22
Hands on Explainable Recommender Systems with Knowledge Graphs @ RecSys22GiacomoBalloccu
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...Simplilearn
 
presentation on knowledge management
 presentation on knowledge management presentation on knowledge management
presentation on knowledge managementvineetlamba
 
Knowledge Management Tools & Techniques
Knowledge Management Tools & TechniquesKnowledge Management Tools & Techniques
Knowledge Management Tools & TechniquesMichael Norton
 
Knowledge management and business process management
Knowledge management and business process managementKnowledge management and business process management
Knowledge management and business process managementfutureshocked
 
DEA Compliance - Production and Inventory Reconciliation
DEA Compliance - Production and Inventory ReconciliationDEA Compliance - Production and Inventory Reconciliation
DEA Compliance - Production and Inventory ReconciliationVijay Pisipaty
 
Trust and Recommender Systems
Trust and  Recommender SystemsTrust and  Recommender Systems
Trust and Recommender Systemszhayefei
 

What's hot (19)

Measurecamp Brussels - Synthetic data.pdf
Measurecamp Brussels - Synthetic data.pdfMeasurecamp Brussels - Synthetic data.pdf
Measurecamp Brussels - Synthetic data.pdf
 
五項修練的故事
五項修練的故事五項修練的故事
五項修練的故事
 
[MPKD1] Introduction to business analytics and simulation
[MPKD1] Introduction to business analytics and simulation[MPKD1] Introduction to business analytics and simulation
[MPKD1] Introduction to business analytics and simulation
 
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender Systems
 
Testing strategy for RPA implementation
Testing strategy for RPA implementationTesting strategy for RPA implementation
Testing strategy for RPA implementation
 
Deep learning for biomedical discovery and data mining I
Deep learning for biomedical discovery and data mining IDeep learning for biomedical discovery and data mining I
Deep learning for biomedical discovery and data mining I
 
製造業智慧化發展策略
 製造業智慧化發展策略 製造業智慧化發展策略
製造業智慧化發展策略
 
Interactive Recommender Systems
Interactive Recommender SystemsInteractive Recommender Systems
Interactive Recommender Systems
 
Knowledge Management Lecture 1: definition, history and presence
Knowledge Management Lecture 1: definition, history and presenceKnowledge Management Lecture 1: definition, history and presence
Knowledge Management Lecture 1: definition, history and presence
 
Hands on Explainable Recommender Systems with Knowledge Graphs @ RecSys22
Hands on Explainable Recommender Systems with Knowledge Graphs @ RecSys22Hands on Explainable Recommender Systems with Knowledge Graphs @ RecSys22
Hands on Explainable Recommender Systems with Knowledge Graphs @ RecSys22
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
 
presentation on knowledge management
 presentation on knowledge management presentation on knowledge management
presentation on knowledge management
 
Knowledge Management Tools & Techniques
Knowledge Management Tools & TechniquesKnowledge Management Tools & Techniques
Knowledge Management Tools & Techniques
 
Web mining
Web miningWeb mining
Web mining
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Knowledge management and business process management
Knowledge management and business process managementKnowledge management and business process management
Knowledge management and business process management
 
DEA Compliance - Production and Inventory Reconciliation
DEA Compliance - Production and Inventory ReconciliationDEA Compliance - Production and Inventory Reconciliation
DEA Compliance - Production and Inventory Reconciliation
 
Trust and Recommender Systems
Trust and  Recommender SystemsTrust and  Recommender Systems
Trust and Recommender Systems
 

Similar to Christian Hennig- Assessing the quality of a clustering

Properties of Normal Distribution
Properties of Normal DistributionProperties of Normal Distribution
Properties of Normal DistributionDr. Lokesh Gupta
 
Genetic Algorithm (GA) Optimization - Step-by-Step Example
Genetic Algorithm (GA) Optimization - Step-by-Step ExampleGenetic Algorithm (GA) Optimization - Step-by-Step Example
Genetic Algorithm (GA) Optimization - Step-by-Step ExampleAhmed Gad
 
Parametric and non parametric test
Parametric and non parametric testParametric and non parametric test
Parametric and non parametric testAjay Malpani
 
Tableau for statistical graphic and data visualization
Tableau for statistical graphic and data visualizationTableau for statistical graphic and data visualization
Tableau for statistical graphic and data visualizationBAINIDA
 
psikometri
psikometripsikometri
psikometriekasepta
 
Thesis-presentation: Tuenti Engineering
Thesis-presentation: Tuenti EngineeringThesis-presentation: Tuenti Engineering
Thesis-presentation: Tuenti EngineeringMarcus Ljungblad
 
Canny Edge & Image Representation.pptx
Canny Edge & Image Representation.pptxCanny Edge & Image Representation.pptx
Canny Edge & Image Representation.pptxPriyankaHemrajani2
 
Evaluation of chem lab software.
Evaluation of chem lab software.Evaluation of chem lab software.
Evaluation of chem lab software.Abir Almaqrashi
 
Image scalar hw_algorithm
Image scalar hw_algorithmImage scalar hw_algorithm
Image scalar hw_algorithmsean chen
 
HIERARCHICAL CLUSTER ANALYSIS.pptx
HIERARCHICAL CLUSTER ANALYSIS.pptxHIERARCHICAL CLUSTER ANALYSIS.pptx
HIERARCHICAL CLUSTER ANALYSIS.pptxagniva pradhan
 
Lt analysis of item test validity
Lt analysis of item test validityLt analysis of item test validity
Lt analysis of item test validitySiti Purwaningsih
 
More Reliable Delivery with Monte Carlo & Story Mapping
More Reliable Delivery with Monte Carlo & Story MappingMore Reliable Delivery with Monte Carlo & Story Mapping
More Reliable Delivery with Monte Carlo & Story MappingConal Scanlon
 
wealth age region37 50 M24 88 U14 64 A13 63 U13 66 .docx
wealth age region37 50 M24 88 U14 64 A13 63 U13 66 .docxwealth age region37 50 M24 88 U14 64 A13 63 U13 66 .docx
wealth age region37 50 M24 88 U14 64 A13 63 U13 66 .docxmelbruce90096
 
Clustering and Association Rule
Clustering and Association RuleClustering and Association Rule
Clustering and Association RuleCisco
 

Similar to Christian Hennig- Assessing the quality of a clustering (20)

Properties of Normal Distribution
Properties of Normal DistributionProperties of Normal Distribution
Properties of Normal Distribution
 
Genetic Algorithm (GA) Optimization - Step-by-Step Example
Genetic Algorithm (GA) Optimization - Step-by-Step ExampleGenetic Algorithm (GA) Optimization - Step-by-Step Example
Genetic Algorithm (GA) Optimization - Step-by-Step Example
 
Inferential stat tests samples discuss 4
Inferential stat tests samples discuss 4Inferential stat tests samples discuss 4
Inferential stat tests samples discuss 4
 
Parametric and non parametric test
Parametric and non parametric testParametric and non parametric test
Parametric and non parametric test
 
Tableau for statistical graphic and data visualization
Tableau for statistical graphic and data visualizationTableau for statistical graphic and data visualization
Tableau for statistical graphic and data visualization
 
Accelerate performance
Accelerate performanceAccelerate performance
Accelerate performance
 
Tes Reliabilitas
Tes ReliabilitasTes Reliabilitas
Tes Reliabilitas
 
psikometri
psikometripsikometri
psikometri
 
Thesis-presentation: Tuenti Engineering
Thesis-presentation: Tuenti EngineeringThesis-presentation: Tuenti Engineering
Thesis-presentation: Tuenti Engineering
 
Canny Edge & Image Representation.pptx
Canny Edge & Image Representation.pptxCanny Edge & Image Representation.pptx
Canny Edge & Image Representation.pptx
 
Evaluation of chem lab software.
Evaluation of chem lab software.Evaluation of chem lab software.
Evaluation of chem lab software.
 
Image scalar hw_algorithm
Image scalar hw_algorithmImage scalar hw_algorithm
Image scalar hw_algorithm
 
HIERARCHICAL CLUSTER ANALYSIS.pptx
HIERARCHICAL CLUSTER ANALYSIS.pptxHIERARCHICAL CLUSTER ANALYSIS.pptx
HIERARCHICAL CLUSTER ANALYSIS.pptx
 
4
44
4
 
Lt analysis of item test validity
Lt analysis of item test validityLt analysis of item test validity
Lt analysis of item test validity
 
More Reliable Delivery with Monte Carlo & Story Mapping
More Reliable Delivery with Monte Carlo & Story MappingMore Reliable Delivery with Monte Carlo & Story Mapping
More Reliable Delivery with Monte Carlo & Story Mapping
 
AIPMT-2014-Answer-Key
AIPMT-2014-Answer-KeyAIPMT-2014-Answer-Key
AIPMT-2014-Answer-Key
 
Swot
SwotSwot
Swot
 
wealth age region37 50 M24 88 U14 64 A13 63 U13 66 .docx
wealth age region37 50 M24 88 U14 64 A13 63 U13 66 .docxwealth age region37 50 M24 88 U14 64 A13 63 U13 66 .docx
wealth age region37 50 M24 88 U14 64 A13 63 U13 66 .docx
 
Clustering and Association Rule
Clustering and Association RuleClustering and Association Rule
Clustering and Association Rule
 

More from PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydPyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverPyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldPyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 

More from PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Recently uploaded

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 

Recently uploaded (20)

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 

Christian Hennig- Assessing the quality of a clustering

  • 1. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Assessing the quality of a clustering Christian Hennig Christian Hennig Assessing the quality of a clustering
  • 2. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1. A short introduction to cluster analysis Cluster analysis is about finding groups in data. var 1 −0.4 −0.2 0.0 0.2 0.4 3 3 3333 333 333 3333 333 3 33 3 3333 33 3 3 3333 4 5 4 4 4 544444 4444444444444 4 5 5 55 5 55 55 5 5 5 5 5 5 5 9 9995 58888 8 88 9 888 6 6 6 66 6 66666 6 6 11111111111 1 11111111 1 111 1 1111 6 111111111111111 1 111111 111111 111 6 1 9 999 2222 2 2 2 222 2 2 22 2 22 2 2 22 2 2 222 2 22 2 22 2 2 22 2 22 222 2 2 2 2 2 2 7 7 777 7777 7 777 7 7 3 3 3 333 333333 3333 333 3 33 3 3333 33 3 3 3333 4 5 4 4 454 444 4 4 4444 44 44 44 44 4 5 5 55 5 5 55 5 5 55 5 5 5 5 9 9995 5 8888 8 88 9 888 6 6 6 6 6 6 666 666 6 11111 11 111 1 1 11111 111 1 11 1 1 111 1 6 1 11111 11 11 1111 1 1 1 11 11 1 11 11 11 111 6 1 9 999 22 22 2 2 2 22 2 2 2 2 22 22 2 2 22 2 2222 2222 222 2 22 2 2 2 2 22 2 2 2 2 2 27 7 7 77 7777 7 777 7 7 −0.4 −0.2 0.0 0.2 −0.4−0.20.00.20.4 3 3 3 333 33 33 33 3333 33 3 3 33 3 33 33 33 3 3 3333 4 5 4 4 45 4 4444 444444 444 44 44 4 5 5 55 5 55 5 5 5 5 5 5 5 5 5 9 999 5 5888 8 8 88 9 888 6 6 6 66 6 66 666 6 6 1111111 11 11 1 11 11 1111 1 111 1 11 11 6 1111 111 111 11 11 1 1 111111 111111 111 6 1 9 999 2 222 2 2 2 22 2 2 2 22 2 2 2 2 2 22 22222 2 22 2 2 22 2 22 2 22 22 2 2 2 2 2 2 27 7 77 7 77 77 7 77 7 7 7 −0.4−0.20.00.20.4 3 3 3333333 3 3 3 333333 3 3 333 33 3 3333 333 3 3 45 4 44 5 44444 44 4444444444 4 45 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 9999 55 8 888 8 8 89 888 6 6 6 66 6 666 66 66 111111111111 1111111 1 1111 1 11 11 6 1111 11 111111 1111 111111 111111 111 6 1 9 9 99 22 2 2 222 2222 2 2 2 2 2222 22 2 2 2 2 22 2 2 2 22 2 2 2 2 2 22 2 22 2 2 2 2 22 77 7 7 7 77 77 7 7 77 7 7 var 2 33 3 333333 3 3 3 333333 3 3333 33 3 3333 333 3 3 45 4 44 5 4 444 44 4 444 44 44 44 4 4 45 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 9999 55 8 888 8 8 89888 6 6 6 6 6 6 666 66 6 6 11111 11 111 11 11111 11 11 11 1 1 11 1 1 6 1 11111 11 11 11 11 11 1 11 11 1 11 11 11 111 6 1 9 9 99 22 2 222222 22 2 2 2 2 222222 2 2 2 2 22 2 2 2 22 2 2 2 2 2 2 22 22 2 2 2 2 2 2 77 7 7 7 77 77 7 7 777 7 33 3 33333 3 3 3 3 3333 33 3 3333 33 3 3333 333 3 3 45 4 44 5 4 444444 4444 444 44 4 4 45 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 9999 55 8 88 8 8 8 8 9888 6 6 6 66 6 66 6 66 66 1111111 11 111 11 11 111 11 111 1 11 11 6 1111 11 1 111 11 11 11 111111 111111 111 6 1 9 9 99 2 2 2 2 2 22 22 22 2 2 2 22 222 22 2 2 2 2 22 2 2 2 2 2 2 2 2 22 22 2 2 2 22 2 2 2 2 7 7 7 7 7 77 77 7 7 7 77 7 3 33 33 3 3 3333 3 33333 33 3 33 3 33333 33 3333 3 4 5 4 4454 4 44 4 4 44 4 4 44 4 4 44 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 55 5 5 99 9 9 5 5 8 88888 89 888 6 6 6 6 6 6 6 66 6 6 6 6 11 1 1 1 11 1 1 1 11 1 1 111 1 11 1 11 1 1 111 1 61 1 11 1 1 11 11 11 1 1 1 1 1 1 1 1 1 1 11 1 1 11 111 6 1 9 9 9 9 2 2 22 222 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 22222 222222 2 2 2 2 2 22 2 2 2 2 2 7 7 7 7 7 7 777 7 7 77 77 3 33 33 3 3 33 33 3 33333 33 333 3 33333 333333 34 5 4 44 54 4 44 4 4 44 4 4 44 4 4 44 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 55 5 5 99 9 9 5 5 8 88888 89888 6 66 6 6 6 6 66 6 6 6 6 11 1 1 1 11 1 1 1 11 1 1 111 1 11 1 11 1 1 111 1 61 1 11 1 1 11 11 11 1 1 1 1 1 1 1 1 1 1 11 1 1 11 111 6 1 9 9 9 9 2 2 222222 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 22 22 2 22 22 22 2 2 22 2 22 2 2 2 2 2 7 7 7 7 7 7 777 7 7 77 7 7 var 3 −0.20.00.20.4 3 33 33 3 3 3 33 3 3 3333 3 3 3 333 3 33 333 33 3333 3 4 5 4 445 4 4 44 4 4 44 4 4 4 4 4 4 44 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 55 99 9 9 5 5 8 88 888 8 9888 66 6 6 6 6 6 6 6 6 6 6 6 11 1 1 1 11 1 1 1 11 1 1 11 1 1 11 1 11 1 1 11 1 161 1 11 1 1 1 1 11 11 1 1 1 1 1 1 1 1 1 1 11 1 1 11 111 6 1 9 9 9 9 2 2 22 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 22 22 2 2 22222 2 2 2 2 2 22 2 2 2 2 2 7 7 7 7 7 7 7 77 7 7 7 7 77 −0.4 −0.2 0.0 0.2 0.4 −0.4−0.20.00.2 3 3 3 33 3 33 3 3 3 3 3 33 3 3 3 3 3 33 3 3 3 3 333 3 33 333 4 5 4 44 5 4 4 444 444 44 4 4 4 4 44 4 4 4 5 5 5 5 5 55 5 5 5 5 5 5 5 5 5 9 99 9 5 5 88 8 88 8 8 9 888 6 6 6 6 6 6 66 6 6 6 66 11 11111 1 1 1 1 1 1 1 11 1 11 1 1 11 1 1 1 1 11 6 1111 11 1 1 1 1 1 1 11 11 1 1 1 1 1111111 1 1 11 6 1 9 9 99 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 222 22 2 2 2 2 2 22222 2 2 2 2 2 2 2 22 2 2 2 7 777 7 7 7 7 7 777 7 7 7 33 3 33 3 33 3 3 3 3 3 33 3 3 3 3 333 3 3 3 3 333 3 33 333 4 5 4 44 5 4 4 444444 44 4 4 4 4 44 4 4 4 5 5 5 5 5 55 5 5 55 5 5 5 5 5 9 99 9 5 5 88 8 88 8 8 9 888 6 6 6 6 6 6 66 6 6 6 66 11 11111 1 1 1 1 1 1 1 11 1 11 1 1 11 1 1 1 1 11 6 1111 11 1 1 1 1 1 1 11 11 1 1 1 1 1111111 1 1 11 6 1 9 9 99 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 22 222 2 2 2 2 2 2222 2 2 7 7 77 7 7 7 7 7 777 77 7 −0.2 0.0 0.2 0.4 33 3 33 3 33 3 3 3 3 3 33 3 3 3 3 333 3 3 3 3 333 3 33 333 4 5 4 44 5 4 4 44 44 44 44 4 4 4 4 44 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 9 99 9 5 5 88 8 88 8 8 9 888 6 6 6 6 6 6 66 6 6 6 6 6 11 11 1 11 1 1 1 1 1 1 1 11 1 11 1 1 11 1 1 1 1 1 16 1 111 11 1 1 1 1 1 1 11 11 1 1 1 1 1 111 11 1 1 1 11 6 1 9 9 99 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 22 22 2 2 2 2 2 22222 2 2 2 2 2 2 2 22 2 2 2 7 77 7 7 7 7 7 7 7 77 77 7 var 4 Christian Hennig Assessing the quality of a clustering
  • 3. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1 Cluster analysis methods 1.1.1 k-means (Fix & Hodges 1951) n i=1 xi − ¯xC(i) 2 = min! Christian Hennig Assessing the quality of a clustering
  • 4. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1 Cluster analysis methods 1.1.1 k-means (Fix & Hodges 1951) n i=1 xi − ¯xC(i) 2 = min! represents all objects by centroid, “compact” clusters. Christian Hennig Assessing the quality of a clustering
  • 5. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1 Cluster analysis methods 1.1.1 k-means (Fix & Hodges 1951) n i=1 xi − ¯xC(i) 2 = min! represents all objects by centroid, “compact” clusters. Version: Don’t square, other centroids than mean (“pam”). Christian Hennig Assessing the quality of a clustering
  • 6. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 8 8 8888 8 8 8 8 8 8 888 88 8 8 8 88 8 8 8 8 8 8 88 8 8 8 8 8 71 7 7 7 7 7 7 7 7 7 77 77 7 7 7 77 7 7 7 7 7 1 7 1 1 9 9 1 9 1 9 1 1 9 9 1 9 44 44 4 4 9 9 9 9 9 9 92 99 9 2 2 2 2 2 2 2 2 2 2 2 2 2 55 5 5 5 55 5 5 55 5 5 5 55 55 5 5 5 55 5 5 5 5 55 3 5 5 55 5 5 5 5 5 5 5 5 555 5 55 5 5 55 5 5555 5 555 3 5 2 2 22 6 6 3 6 226 666 3 2 6 6 6 2 2 66 6 6 6 6 3 6 66 3 6 3 6 3 3 6 3 6 6 3 6 6 6 6 3 6 3 2 3 6 3 3 3 3 6 3 3 33 3 3 3 3 3 3 −0.4 −0.2 0.0 0.2 0.4 −0.4−0.20.00.20.4 MDS 1 MDS2 Christian Hennig Assessing the quality of a clustering
  • 7. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1.2 Gaussian mixture model (Pearson 1894) f(x) = k j=1 πjϕaj ,Σj (x). Clusters are described by Gaussian distributions. Elliptical clusters, flexible size and shape. Christian Hennig Assessing the quality of a clustering
  • 8. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 3 3 3333 3 33 3 3 3 33333 3 3 3 333 3 3 3 33 33 3 3 3 3 3 45 4 4 4 5 4 4 4 4 4 44 44 4 44 44 4 44 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 99 99 5 5 8 8 88 8 8 89 888 6 6 6 6 6 6 6 6 6 6 6 6 6 111 1 1 11 1 1 11 1 1 1 1111 1 1 1 11 1 1 11 11 6 1111 1 1 1 11 11 1 111 1 111 1 11 1 11111 111 6 1 9 9 99 2 2 2 2 222 2222 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 7 7 7 7 7 7 7 77 7 7 7 7 7 7 −0.4 −0.2 0.0 0.2 0.4 −0.4−0.20.00.20.4 MDS 1 MDS2 Christian Hennig Assessing the quality of a clustering
  • 9. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1.3 Classical hierarchical methods Operate on dissimilarity matrices; compute dissimilarity measure for every pair of observations. Can use Euclidean distance, but also tailor-made distances for other data formats. Christian Hennig Assessing the quality of a clustering
  • 10. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1.3 Classical hierarchical methods Operate on dissimilarity matrices; compute dissimilarity measure for every pair of observations. Can use Euclidean distance, but also tailor-made distances for other data formats. “Cluster”: a collection of similar objects, dissimilar to the others. Christian Hennig Assessing the quality of a clustering
  • 11. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods Genetic data: 236 Tetragonula bees, 13 allele pairs [,1] [,2] [,3] [,4] [,5] [,6] (...) [1,] "NO" "AA" "PP" "HH" "EH" "FF" [2,] "EO" "AA" "PP" "HH" "GH" "FF" [3,] "NQ" "AA" "PT" "HH" "GF" "EF" [4,] "OO" "AA" "PP" "GH" "GH" "EF" [5,] "OO" "AA" "PP" "GH" "GH" "EF" [6,] "LN" "AA" "PP" "HH" "EG" "FE" (...) Compute “shared allele distance”. Christian Hennig Assessing the quality of a clustering
  • 12. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods [,1] [,2] [,3] [,4] [,5] [1,] 0.00 0.21 0.33 0.29 0.25 [2,] 0.21 0.00 0.33 0.25 0.21 [3,] 0.33 0.33 0.00 0.29 0.33 (...) [4,] 0.29 0.25 0.29 0.00 0.08 [5,] 0.25 0.21 0.33 0.08 0.00 (...) Dataset seen before is a Euclidean approximation (“MDS”) of this. Christian Hennig Assessing the quality of a clustering
  • 13. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1.3 Classical hierarchical methods Operate on dissimilarities and produce hierarchical trees (originally motivated by biological classification). Differ in definition of “dissimilarity between clusters”. 818280797778676675737668706361624172716469377465465654534751455955525048385749604358403942364432302826231816129198453353172917206111342433222710251415213219085848393878886899192170172173961041051069910398971021001019495171198182177168234220208206199205216210204185197194191190189219218209217215207214202175188183181178200179193176196213212180187192174221195211203186184201169167136133117151166145116165156142131110146155149144143132157128134125152158124154147129161163160153162150159140137126119122135121118111127120112130109115113164139123141108107114138148229233235231226236228222230224232227223225 0.00.20.40.6 Cluster Dendrogram hclust (*, "single") as.dist(tai$distmat) Height Christian Hennig Assessing the quality of a clustering
  • 14. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods Single Linkage: (Florek and Perkal 1951) ˜d(A, B) = min a∈A,b∈B d(a, b) Complete Linkage: ˜d(A, B) = max a∈A,b∈B d(a, b) Average Linkage: ˜d(A, B) = avea∈A,b∈Bd(a, b) These can deliver quite different clusterings. (Complete L. very compact, Single L. separated but maybe widespread) Christian Hennig Assessing the quality of a clustering
  • 15. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1.4 Spectral clustering (Shi and Malik 2000) Dissimilarity-based nonlinear dimension reduction for k-means. Christian Hennig Assessing the quality of a clustering
  • 16. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1 1 1111 1 1 1 1 1 1 111 11 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 62 6 6 6 2 6 6 6 6 6 66 66 6 6 6 66 6 6 6 6 6 2 2 2 2 6 2 2 2 2 2 2 2 2 2 2 2 55 55 2 2 3 3 3 3 3 3 35 33 3 7 7 7 7 7 7 7 7 7 7 7 7 7 44 4 4 4 44 4 4 44 4 4 4 44 44 4 4 4 44 4 4 4 4 44 4 4 4 44 4 4 4 4 4 4 4 4 444 4 44 4 4 44 4 4444 4 444 4 4 5 5 55 7 7 7 7 777 777 7 7 7 7 7 7 7 77 7 7 7 7 7 7 77 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 77 7 7 7 7 7 7 −0.4 −0.2 0.0 0.2 0.4 −0.4−0.20.00.20.4 ctai$points[,1] ctai$points[,2] Christian Hennig Assessing the quality of a clustering
  • 17. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1.5 Density-based methods such as “DBSCAN” (Ester et al. 1996), joins observations with all neighbouring points, and neighbourhoods if they share enough points. Christian Hennig Assessing the quality of a clustering
  • 18. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1 1 1111 1 1 1 1 1 1 111 11 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 2N 2 2 2 N 2 2 2 2 2 22 22 2 2 2 22 2 2 2 2 2 N N N N N N N N N N N N N N N N NN NN N N 3 3 3 3 3 3 3N 33 3 N N 4 4 4 N 4 4 4 N N 4 4 55 5 5 5 55 5 5 55 5 5 5 55 55 5 5 5 55 5 5 5 5 55 N 5 5 55 5 5 5 5 5 5 5 5 555 5 55 5 5 55 5 5555 5 555 N 5 N N NN 6 6 6 6 666 666 6 N 6 6 6 6 6 66 6 6 6 6 6 6 66 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 N 7 7 77 7 7 7 7 7 7 −0.4 −0.2 0.0 0.2 0.4 −0.4−0.20.00.20.4 ctai$points[,1] ctai$points[,2] Christian Hennig Assessing the quality of a clustering
  • 19. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1.6 Further issues in cluster analysis Number of clusters Cluster validation Dissimilarity definition Choice of method Christian Hennig Assessing the quality of a clustering
  • 20. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims 2. Benchmarking and measurement of quality Which clustering is better? (Old faithful geyser data) −2 −1 0 1 2 −2−101 mclust waiting duration −2 −1 0 1 2 −2−101 pam waiting duration Christian Hennig Assessing the quality of a clustering
  • 21. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Which clustering is better? −10 0 10 20 30 010203040 xy[,1] xy[,2] −10 0 10 20 30 010203040 xy[,1] xy[,2] Christian Hennig Assessing the quality of a clustering
  • 22. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Benchmarking approaches: Real datasets with known classes Simulated datasets from mixture distributions Real datasets without known classes With known truth can compute misclassification rates. Christian Hennig Assessing the quality of a clustering
  • 23. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Disadvantages of benchmarking with known truth In datasets with known classes clustering is not of real scientific interest. Deviate systematically from real clustering problems. Christian Hennig Assessing the quality of a clustering
  • 24. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Disadvantages of benchmarking with known truth In datasets with known classes clustering is not of real scientific interest. Deviate systematically from real clustering problems. The fact that we know certain true classes doesn’t preclude other legitimate/”true” clusterings. Christian Hennig Assessing the quality of a clustering
  • 25. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Disadvantages of benchmarking with known truth In datasets with known classes clustering is not of real scientific interest. Deviate systematically from real clustering problems. The fact that we know certain true classes doesn’t preclude other legitimate/”true” clusterings. Classes in supervised classification problems may not qualify as data analytic clusters. Christian Hennig Assessing the quality of a clustering
  • 26. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Disadvantages of benchmarking with known truth In datasets with known classes clustering is not of real scientific interest. Deviate systematically from real clustering problems. The fact that we know certain true classes doesn’t preclude other legitimate/”true” clusterings. Classes in supervised classification problems may not qualify as data analytic clusters. So there could be better truths than the known one. Christian Hennig Assessing the quality of a clustering
  • 27. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims How true are the true given classes? (Hennig and Liao 2013, social stratification data) Christian Hennig Assessing the quality of a clustering
  • 28. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims 7 standard occupation classes such as “manual workers”, “managerials and professionals”, “not working” Christian Hennig Assessing the quality of a clustering
  • 29. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims These are not “data analytic clusters”. Christian Hennig Assessing the quality of a clustering
  • 30. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Mixture components aren’t always “data analytic clusters” either. 55 3 54 3 54 5 5 4 4 5 4 3 5 4 5 5 4 3 534 4 4 4 4 4 4 5 54 5 4 5 4 4 4 5 5 5 4 54 4 5 5 4 5 4 5 5 5 4 4 4 4 5 4 5 5 55 5 4 4 4 4 4 5 5 5 4 3 4 1 4 4 3 4 4 5 4 1 4 54 5 5 3 5 54 5 1 4 4 4 4 4 5 4 3 4 5 4 55 5 4 5 4 5 3 5 4 4 4 5 3 5 4 5 34 4 5 5 5 4 3 4 5 5 55 4 5 4 54 4 5 4 4 4 4 4 5 4 5 3 5 5 3 4 5 5 4 5 4 5 5 54 4 4 4 4 5 5 4 54 5 5 4 4 5 4 5 5 5 5 4 4 4 5 5 5 5 4 4 4 5 4 4 2 5 4 3 5 4 4 5 4 54 5 4 4 4 4 5 4 5 5 5 4 4 5 4 5 54 4 4 5 4 5 5 4 5 5 5 5 4 4 5 3 5 5 54 5 4 4 4 4 35 5 5 4 5 4 4 5 3 4 5 5 4 4 4 5 4 4 4 5 5 4 54 5 44 5 4 5 3 4 4 3 3 4 4 55 4 5 4 4 4 5 5 4 5 5 555 4 5 4 4 5 5 3 5 4 4 5 5 4 4 5 3 4 55 4 54 4 4 45 3 4 5 3 5 5 4 4 3 4 2 5 4 54 4 4 4 4 4 4 2 5 4 4 4 5 5 4 4 5 5 5 4 5 5 5 55 4 5 5 5 4 54 5 4 4 554 4 551 4 5 4 5 2 33 4 4 45 54 4 5 1 5 44 4 4 4 4 54 4 3 4 4 4 4 5 3 5 554 4 44 5 4 4 5 5 4 5 4 4 5 4 4 5 5 5 2 3 5 34 5 4 5 3 4 4 5 5 1 4 5 4 4 5 4 54 54 5 4 4 4 4 55 4 5 5 54 4 5 4 5 5 5 2 4 3 4 4 4 5 5 5 5 5 4 4 5 4 4 3 5 0 20 40 60 80 100 −20−10010203040 xdata$x[,1] xdata$x[,2] Christian Hennig Assessing the quality of a clustering
  • 31. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Using a known truth is useful and fair enough but also want to evaluate clusterings on data for which truth is not known. Christian Hennig Assessing the quality of a clustering
  • 32. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims There is a range of cluster validation indexes measuring clustering quality, such as Average silhouette width (ASW) (Kaufman and Rouseeuw 1990) sw(i, C) = b(i,C)−a(i,C) max(a(i,C),b(i,C)), a(i, C) = 1 |Cj| − 1 x∈Cj d(xi, x), b(i, C) = min xi ∈Cl 1 |Cl| x∈Cl d(xi, x). Maximum average sw ⇒ good C. Christian Hennig Assessing the quality of a clustering
  • 33. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims “One size fits it all”-approach. Christian Hennig Assessing the quality of a clustering
  • 34. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims “One size fits it all”-approach. Homogeneity will normally dominate here: −10 0 10 20 30 010203040 xy[,1] xy[,2] −10 0 10 20 30 010203040 xy[,1] xy[,2] Christian Hennig Assessing the quality of a clustering
  • 35. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims My general philosophy There are various different aims of clustering. Measure them separately to characterise what a method does best, instead of producing a single ranking. Christian Hennig Assessing the quality of a clustering
  • 36. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Typical clustering aims Between-cluster separation Christian Hennig Assessing the quality of a clustering
  • 37. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Typical clustering aims Between-cluster separation Within-cluster homogeneity (low distances) Christian Hennig Assessing the quality of a clustering
  • 38. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Typical clustering aims Between-cluster separation Within-cluster homogeneity (low distances) Within-cluster homogeneous distributional shape Christian Hennig Assessing the quality of a clustering
  • 39. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Typical clustering aims Between-cluster separation Within-cluster homogeneity (low distances) Within-cluster homogeneous distributional shape Good representation of data by centroids Christian Hennig Assessing the quality of a clustering
  • 40. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Typical clustering aims Between-cluster separation Within-cluster homogeneity (low distances) Within-cluster homogeneous distributional shape Good representation of data by centroids Little loss of information from original distance between objects. Christian Hennig Assessing the quality of a clustering
  • 41. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Typical clustering aims Between-cluster separation Within-cluster homogeneity (low distances) Within-cluster homogeneous distributional shape Good representation of data by centroids Little loss of information from original distance between objects. Clusters are regions of high density without within-cluster gaps Christian Hennig Assessing the quality of a clustering
  • 42. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Typical clustering aims Between-cluster separation Within-cluster homogeneity (low distances) Within-cluster homogeneous distributional shape Good representation of data by centroids Little loss of information from original distance between objects. Clusters are regions of high density without within-cluster gaps Uniform cluster sizes Christian Hennig Assessing the quality of a clustering
  • 43. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Typical clustering aims Between-cluster separation Within-cluster homogeneity (low distances) Within-cluster homogeneous distributional shape Good representation of data by centroids Little loss of information from original distance between objects. Clusters are regions of high density without within-cluster gaps Uniform cluster sizes Stability Christian Hennig Assessing the quality of a clustering
  • 44. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims These may be in conflict with each other. −10 0 10 20 30 010203040 xy[,1] xy[,2] −10 0 10 20 30 010203040 xy[,1] xy[,2] Christian Hennig Assessing the quality of a clustering
  • 45. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims E.g., pattern recognition in images requires separation, Christian Hennig Assessing the quality of a clustering
  • 46. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims E.g., pattern recognition in images requires separation, clustering for information reduction requires good representation by centroids, Christian Hennig Assessing the quality of a clustering
  • 47. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims E.g., pattern recognition in images requires separation, clustering for information reduction requires good representation by centroids, groups in social network analysis shouldn’t have large within-cluster gaps, Christian Hennig Assessing the quality of a clustering
  • 48. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims E.g., pattern recognition in images requires separation, clustering for information reduction requires good representation by centroids, groups in social network analysis shouldn’t have large within-cluster gaps, underlying “true” classes (biological species) may cause homogeneous distributional shapes. Christian Hennig Assessing the quality of a clustering
  • 49. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics 3. Cluster quality statistics Measuring between-cluster separation ∃ several ways measuring separation (as for other aims). Straightforward: min distance between any two clusters, or distance between centroids (e.g., k-means). Christian Hennig Assessing the quality of a clustering
  • 50. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 waiting duration Christian Hennig Assessing the quality of a clustering
  • 51. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 waiting duration M M Christian Hennig Assessing the quality of a clustering
  • 52. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Measuring between-cluster separation ∃ several ways measuring separation (as for other aims). Straightforward: min distance between any two clusters, or distance between centroids (e.g., k-means). These measure quite different concepts of separation. (min distance relies on only two points; centroid distance ignores what goes on at border.) Christian Hennig Assessing the quality of a clustering
  • 53. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics p-separation index: More stable version of “min distance”: Average distance to nearest point in different cluster for p = 10% “border” points in any cluster. −2 −1 0 1 2 −2−101 waiting duration X X X X X X X XX X X XX X X X X X X X X X X X X X X X X X X Christian Hennig Assessing the quality of a clustering
  • 54. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Measuring “density mountains vs. valleys” Index that measures whether clusters correspond to “density mountains”, and whether “valleys” are between clusters. Note: This is current research and may be revised. Christian Hennig Assessing the quality of a clustering
  • 55. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Two aspects: (a) Density goes down from mode; no gaps and valleys within clusters. (b) Cluster borders are valleys; they don’t run through mountains. Estimate density by weighted count of close points (“kernel density”). 0.00.51.01.52.0 x k(x) 10% quantile of within−cluster distances Christian Hennig Assessing the quality of a clustering
  • 56. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 waiting duration Christian Hennig Assessing the quality of a clustering
  • 57. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Start from cluster modes −0.5 0.0 0.5 1.0 1.5 2.0 −1.8−1.6−1.4−1.2−1.0−0.8 sinlink g= 2 Step 1 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 58. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Connect closest point to cluster −0.5 0.0 0.5 1.0 1.5 2.0 −1.8−1.6−1.4−1.2−1.0−0.8 sinlink g= 2 Step 3 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 59. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics As long as density goes down, no penalty −0.5 0.0 0.5 1.0 1.5 2.0 −1.8−1.6−1.4−1.2−1.0−0.8 sinlink g= 2 Step 6 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 60. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Penalty for density increase −2 −1 0 1 2 −2−101 sinlink g= 2 Step 98 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 61. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 99 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 62. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 100 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 63. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 101 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 64. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 102 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 65. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 103 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 66. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 104 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 67. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 105 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 68. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 106 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 69. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 107 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 70. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 108 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 71. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 297 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 72. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Add penalty density∗density from other clusters −2 −1 0 1 2 −2−101 specc g= 3 waiting duration P P P P P P P P P P P P P P P P P P P PP P P Christian Hennig Assessing the quality of a clustering
  • 73. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Other statistics Within-cluster average distance Christian Hennig Assessing the quality of a clustering
  • 74. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Other statistics Within-cluster average distance Within-cluster similarity measure to normal/uniform Christian Hennig Assessing the quality of a clustering
  • 75. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Other statistics Within-cluster average distance Within-cluster similarity measure to normal/uniform Within-cluster (squared) distance to centroid Christian Hennig Assessing the quality of a clustering
  • 76. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Other statistics Within-cluster average distance Within-cluster similarity measure to normal/uniform Within-cluster (squared) distance to centroid ρ(distance, cluster induced distance) (Hubert’s Γ) Christian Hennig Assessing the quality of a clustering
  • 77. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Other statistics Within-cluster average distance Within-cluster similarity measure to normal/uniform Within-cluster (squared) distance to centroid ρ(distance, cluster induced distance) (Hubert’s Γ) Entropy of cluster sizes Christian Hennig Assessing the quality of a clustering
  • 78. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Other statistics Within-cluster average distance Within-cluster similarity measure to normal/uniform Within-cluster (squared) distance to centroid ρ(distance, cluster induced distance) (Hubert’s Γ) Entropy of cluster sizes Average largest within-cluster gap Christian Hennig Assessing the quality of a clustering
  • 79. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Other statistics Within-cluster average distance Within-cluster similarity measure to normal/uniform Within-cluster (squared) distance to centroid ρ(distance, cluster induced distance) (Hubert’s Γ) Entropy of cluster sizes Average largest within-cluster gap Variation of clusterings on bootstrapped data Christian Hennig Assessing the quality of a clustering
  • 80. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Other statistics Within-cluster average distance Within-cluster similarity measure to normal/uniform Within-cluster (squared) distance to centroid ρ(distance, cluster induced distance) (Hubert’s Γ) Entropy of cluster sizes Average largest within-cluster gap Variation of clusterings on bootstrapped data Standardise all indexes to [0, 1] so that “large is good”. Christian Hennig Assessing the quality of a clustering
  • 81. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion 4. Examples −10 0 10 20 30 010203040 xy[,1] xy[,2] −10 0 10 20 30 010203040 xy[,1] xy[,2] 3-means mclust-3 ave within 0.811 0.643 sep index 0.163 0.306 density index 0.977 0.978 within gap 0.927 0.949 Christian Hennig Assessing the quality of a clustering
  • 82. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion −2 −1 0 1 2 −2−101 mclust waiting duration −2 −1 0 1 2 −2−101 pam waitingduration −2 −1 0 1 2 −2−101 spectral waiting duration −2 −1 0 1 2 −2−101 ave.linkage waiting duration −2 −1 0 1 2 −2−101 single linkage waiting duration −2 −1 0 1 2 −2−101 pdfCluster (3) waiting duration Christian Hennig Assessing the quality of a clustering
  • 83. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion mclust pam spect ave.l sing.l pdf3 ave within 0.71 0.95 0.82 0.90 0.04 0.98 sep index 0.98 0.30 0.94 0.60 0.99 0.78 density 0.99 0.44 0.70 0.63 0.59 0.99 gap 0.14 0.46 0.46 0.46 0.99 0.48 gamma 0.81 0.91 0.92 0.96 0.06 0.98 normality 0.69 0.44 0.45 0.48 0.11 0.52 Note: These values are quantile-standardised, implementation of this in fpc is still to come. Christian Hennig Assessing the quality of a clustering
  • 84. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion 0.00.20.40.60.81.0 Number of clusters dindex 2 3 4 5 kmeans kmeans kmeans kmeans avelink avelink avelink avelink sinlink sinlink sinlink sinlink comlink comlink comlink comlink mclust mclust mclust mclust pam pam pam pam specc specc specc specc pdfclus Quantile−calibrated density index Christian Hennig Assessing the quality of a clustering
  • 85. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion 6. Discussion Clustering quality is multidimensional. Christian Hennig Assessing the quality of a clustering
  • 86. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion 6. Discussion Clustering quality is multidimensional. Provide multidimensional evaluation, characterising a method’s behaviour. Christian Hennig Assessing the quality of a clustering
  • 87. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion 6. Discussion Clustering quality is multidimensional. Provide multidimensional evaluation, characterising a method’s behaviour. Can aggregate criteria by weighted mean given well justified weights. Christian Hennig Assessing the quality of a clustering
  • 88. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion 6. Discussion Clustering quality is multidimensional. Provide multidimensional evaluation, characterising a method’s behaviour. Can aggregate criteria by weighted mean given well justified weights. Benchmarking without known truth and comparison of clusterings in practice. Christian Hennig Assessing the quality of a clustering
  • 89. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion 6. Discussion Clustering quality is multidimensional. Provide multidimensional evaluation, characterising a method’s behaviour. Can aggregate criteria by weighted mean given well justified weights. Benchmarking without known truth and comparison of clusterings in practice. Required: standardisation to compare different indexes and numbers of clusters. Christian Hennig Assessing the quality of a clustering
  • 90. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Much of this is implemented in R-package fpc, more will be. Christian Hennig Assessing the quality of a clustering
  • 91. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Much of this is implemented in R-package fpc, more will be. Soon to come: IFCS Cluster Benchmarking Repository (Iven Van Mechelen, Nema Dean, Isabelle Guyon, Anne-Laure Boulesteix, Doug Steinley, Friedrich Leisch, Christian Hennig, Rainer Dangl) This work is supported by EPSRC Grant EP/K033972/1. Christian Hennig Assessing the quality of a clustering
  • 92. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion A bit of marketing: Christian Hennig Assessing the quality of a clustering