Here we overview the problem of clustering, why it's so different from supervised learning problems, how to select the number of clusters. We discuss main approaches to perform clustering: k-Means, hierarchical clustering, DBSCAN.
4. Clustering Problem formulation
Problem formulation
The main task of cluster analysis is to group instances into subgroups (clusters) of
similar ones.
These groups can be
Partitions
Hierarchies
Fuzzy partitions
Biclusters
Mixtures of distributions
(Higher School of Economics) Clustering 16.11.2018 4 / 24
5. Clustering Applications
Applications
Biology and medicine
Gene expression analysis
Tomography clustering
Humanitarian sciences
Sociology and anthropology
Psychology
Technical systems
Telemetry
Image segmentation
Marketing
Customer segmentation
Subgroup behavioral analysis
Text analytics
News clustering
Social networks
Comunity detection
(Higher School of Economics) Clustering 16.11.2018 5 / 24
6. Clustering methods
Plan
1 Clustering
Problem formulation
Applications
2 Clustering methods
k-Means
Hierarchical methods
Agglomerative clustering
Density-based methods
(Higher School of Economics) Clustering 16.11.2018 6 / 24
7. Clustering methods
How to measure dissimilarity of instances
Instances x ∈ Rm
are representaed as feature matrices.
x1
x2
...
xn
⇐⇒
x1
1 x2
1 · · · xm
1
x1
2 x2
2 · · · xm
2
· · · · · · · · · · · ·
x1
n xm
n · · · xm
n
Minkowski distance
d(x, y) =
m
i=1
|xi
− yi
|p
1
p
Cosine distance
d(x, y) = 1 −
⟨x, y⟩
⟨x, x⟩ ⟨y, y⟩
Hamming distance
d(x, y) =
1
m
m
i=1
[xi
̸= yi
]
(Higher School of Economics) Clustering 16.11.2018 7 / 24
8. Clustering methods
k-Means
k-Means is an iterative algorithm to split data into k clusters.
Geometrical mean of each cluster (called a centroid) is denoted with Cj is defined
as
cj =
1
|Cj |
i∈Cj
xi
The objective is the sum of squares of all distances between instances and
centroids of clusteres to which these instances belong.
J(C) =
k
j=1 i∈Cj
d(xi , cj )2
(Higher School of Economics) Clustering 16.11.2018 8 / 24
9. Clustering methods
k-Means
The algorithm
Input: Data, k — is a hyperparameter
Ouput: Partition of data into k clusters
* * *
1. Initialization: Set k points to be initial centroid
2. Update clusters: Given k centroids, each instance is attributed to one of
centroids. Thus, all instances attributed to a centroid cj
(j = 1 . . . k), form a cluster Cj .
3. Update centroids: For each cluster Cj , a new centroid is calculated as a
geometrical mean of all instances in this cluster.
Steps 2-3 are repeated until convergence.
(Higher School of Economics) Clustering 16.11.2018 9 / 24
17. Clustering methods
Clustering quality and the number of clusters
Elbow method
For each k we can calculate J(C).
Then, we find such k that further increasing it does not decrease J “too much”.
Formally, we look for k that minimizes the following D(k):
D(k) =
|J(k) − J(k + 1)|
|J(k − 1) − J(k)|
(Higher School of Economics) Clustering 16.11.2018 11 / 24
18. Clustering methods
Clustering quality and the number of clusters
Elbow method
−6 −4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
6
2 3 4 5 6 7 8 9 10
0
500
1000
1500
2000
2500
3000
3500
4000
k
J(R)
Elbow Method
(Higher School of Economics) Clustering 16.11.2018 11 / 24
19. Clustering methods
Clustering quality and the number of clusters
Silhouette
Silhouette for an instance xi in a cluster C is a function
s(i) =
bi − ai
max(ai , bi )
,
where a(i) — is the mean distance from xi to all other instances from C, а bm(i)
— is the mean distance from xi to instances from other clusters.
(Higher School of Economics) Clustering 16.11.2018 12 / 24
24. Clustering methods Hierarchical methods
Agglomerative clustering
Sequential merging of similar clusters
0 Start with each cluster having only one instance
1 Find two closest clusters
2 Merge them
Repeat steps 1-2 untill all instances are in the same cluster
How to define distance between clusters?
(Higher School of Economics) Clustering 16.11.2018 16 / 24
25. Clustering methods Hierarchical methods
Agglomerative clustering
Linkage
1 Single Linkage
d(A, B) = min
x∈A,y∈B
d(x, y)
2 Complete Linkage
d(A, B) = max
x∈A,y∈B
d(x, y)
(Higher School of Economics) Clustering 16.11.2018 17 / 24
26. Clustering methods Hierarchical methods
Agglomerative clustering
Linkage
3 Average Linkage
d(A, B) =
1
|A||B|
i∈A j∈B
d(xi , yj )
4 Weighted Average Linkage
Let clusterA be a union of clusters q и p. Then
d(A, B) =
d(p, B) + d(q, B)
2
5 Centroid Linkage
d(A, B) = ∥cA − cB ∥2
(Higher School of Economics) Clustering 16.11.2018 18 / 24
27. Clustering methods Hierarchical methods
Agglomerative clustering
Merging clusters can be depicted with a dendrogram.
Let us take a look at a 1D sample: { 1, 2, 3, 7, 10, 12, 25, 29 }
1 2 3 7 10 12 25 29
0
5
10
15
20
25
Objects
Clusterdistances
B
C
A
Distance between
cluster A and B
(Higher School of Economics) Clustering 16.11.2018 19 / 24
28. Clustering methods Density-based methods
Density-based methods
DBSCAN
DBSCAN stabds for Density Based Spatial Clustering of Applications with Noise.
(Higher School of Economics) Clustering 16.11.2018 20 / 24
29. Clustering methods Density-based methods
DBSCAN algorithm
All point can be divided into elements of dense regions, border points and noise
(skipping formal definition here).
(Higher School of Economics) Clustering 16.11.2018 21 / 24
30. Clustering methods Density-based methods
DBSCAN. Example
Hyperparams: M = 4, Eps > 0
(Higher School of Economics) Clustering 16.11.2018 22 / 24
34. Clustering methods Density-based methods
DBSCAN. Pros and cons
Pros
+ Can find clusters of any shape
+ Easy to implement
+ Can find noise in data
+ Nice complexity — O(n log(n)) with a good data sctructure
(otherwise — O(n2
) )
Cons
- Parametric
- Doesn’t work well when clusters differ in density
- Depends on the chosen metric
(Higher School of Economics) Clustering 16.11.2018 23 / 24
35. Clustering methods Density-based methods
Contacts
Questions
Thanks!
Please ask your questions in OpenDataSciene Slack team.
http://ods.ai
(Higher School of Economics) Clustering 16.11.2018 24 / 24