3. DEFINITION
• Cluster Analysis is a way of grouping cases of data
based on the similarity of responses to several
variables.
▪ The fundamental problem clustering address is to
divide the data into meaningful groups (clusters).
Group Together Variables
Grouping Cases
Factor Analysis
Cluster Analysis
4/17/2020 DR ATHAR KHAN 3
12. Unsupervised learning is a machine learning technique, where you do not need to
supervise the model. Instead, you need to allow the model to work on its own to
discover information, only have input data (X) and no corresponding output variables.4/17/2020 DR ATHAR KHAN 12
13. Types of Data
▪ The data used in cluster analysis can be interval,
ordinal or categorical.
▪ However, having a mixture of different types of
variable will make the analysis more complicated.
▪ This is because in cluster analysis you need to have
some way of measuring the distance between
observations and the type of measure used will
depend on what type of data you have.
4/17/2020 DR ATHAR KHAN 13
14. Measures of Distance
▪ A number of different measures have been proposed
to measure ’distance’ for categorical data:
▪ K-Means algorithm for categorical data, ROCK, LIMBO,
CLICKS, Ward’s agglomerativealgorithm
▪ In a hierarchical clustering algorithm most used is Ward’s.
▪ It is the most widely used method for measuring the
distance between the objects for interval data is
Euclidean Distance.
4/17/2020 DR ATHAR KHAN 14
15. Euclidean Distance, d
Euclidean distance is the geometric distance
between two objects (or cases). Therefore, if we
were to call George subject i and Zippy subject j,
then we could express their Euclidean distance in
terms of the following equation:
Euclidean distances the smaller the distance, the
more similar the cases.4/17/2020 DR ATHAR KHAN 15
16. Measures of Distance
▪ When using a measure such as the Euclidean
distance, the scale of measurement of the variables
under consideration is an issue, as changing the scale
will obviously effect the distance between subjects
(e.g. a difference of 10cm could being a difference of
100mm).
▪ To get around this problem each variable can be
standardized (converted to z-scores).
4/17/2020 DR ATHAR KHAN 16
17. Approaches to Cluster Analysis
▪ There are a number of different methods that can be
used to carry out a cluster analysis:
▪ Hierarchical methods
▪ – Agglomerative methods
▪ – Divisive methods
▪ Non-hierarchical methods (often known as k-means
clustering methods)
4/17/2020 DR ATHAR KHAN 17
18. Agglomerative Methods
▪ Agglomerative clustering is Bottom-up technique start by
considering each data point as its own cluster and
merging them together into larger groups from the
bottom up into a single giant cluster.
4/17/2020 DR ATHAR KHAN 18
19. Divisive Clustering
▪ Divisive clustering is the opposite, it starts with one
cluster, which is then divided in two as a function of the
similarities or distances in the data. These new clusters
are then divided, and so on until each case is a cluster.
Agglomerative
methods are
used more
often than
Divisive
methods
4/17/2020 DR ATHAR KHAN 19
21. Hierarchical agglomerative methods
Within this approach to cluster analysis there are a number of different
methods used to determine which clusters should be joined at each stage.
Linkage Function/Creating the Clusters
4/17/2020 DR ATHAR KHAN 21
22. Nearest neighbour method (single linkage method)
In this method the distance between two clusters is defined to be the distance
between the two closest members, or neighbours.
Furthest neighbour method (complete linkage method)
In this case the distance between two clusters is defined to be the maximum
distance between members — i.e. the distance between the two subjects that
are furthest apart.
4/17/2020 DR ATHAR KHAN 22
23. Average (between groups) linkage method (sometimes referred to as
UPGMA)
The distance between two clusters is calculated as the average distance
between all pairs of subjects in the two clusters.
Centroid Method
Here the centroid (mean value for each variable) of each cluster is calculated
and the distance between centroids is used. Clusters whose centroids are
closest together are merged.
4/17/2020 DR ATHAR KHAN 23
24. Ward’s Method
▪ In this method all possible pairs of clusters are combined and
the sum of the squared distances within each cluster is
calculated.
▪ This is then summed over all clusters.
▪ The combination that gives the lowest sum of squares is
chosen.
▪ The aim in Ward’s method is to join cases into clusters such
that the variance within a cluster is minimised.
▪ To be more precise, two clusters are merged if this merger
results in the minimum increase in the error sum of squares.
▪ Most popular Method
4/17/2020 DR ATHAR KHAN 24
25. Selecting the optimum number of clusters
▪ Once the cluster analysis has been carried out it is then necessary to
select the ’best’ cluster solution.
▪ # of clusters and within cluster variances
4/17/2020 DR ATHAR KHAN 25
26. Dendrogram
1
2
34
In the dendrogram above, the height of the
dendrogram indicates the order in which the
clusters were joined.
Dendrograms cannot tell you how many clusters
you should have4/17/2020 DR ATHAR KHAN 26
27. Data Preparation
• To perform a cluster analysis, generally, the data
should be prepared as follows:
• Any missing value in the data must be removed or
estimated.
• The data must be standardized(Z SCORES)
4/17/2020 DR ATHAR KHAN 27
28. Limitations of Cluster Analysis
• There are several things to be aware of when conducting
cluster analysis:
– The different methods of clustering usually give very different results.
This occurs because of the different criterion for merging clusters
(including cases). It is important to think carefully about which method
is best for what you are interested in looking at.
– With the exception of simple linkage, the results will be affected by
the way in which the variables are ordered.
– The analysis is not stable when cases are dropped: this occurs because
selection of a case (or merger of clusters) depends on similarity of one
case to the cluster.
4/17/2020 DR ATHAR KHAN 28
29. Limitations of Cluster Analysis
• Imagine we wanted to look at clusters of cases
referred for psychiatric treatment.
• We measured each subject on four questionnaires:
Spielberger Trait Anxiety Inventory (STAI), the Beck
Depression Inventory (BDI), a measure of Intrusive
Thoughts and Rumination (IT) and a measure of
Impulsive Thoughts and Actions (Impulse).
• The rationale behind this analysis is that people with
the same disorder should report a similar pattern of
scores across the measures (so the profiles of their
responses should be similar)
4/17/2020 DR ATHAR KHAN 29
30. Video : Hierarchical Clustering : Agglomerative Clustering and
Divisive Clustering
https://www.youtube.com/watch?v=7enWesSofhg
4/17/2020 DR ATHAR KHAN 30
36. Agglomeration schedule: Shows how the clusters are combined at each stage.
Stage 1: Cases 1 and 4 have the smallest distance ("Coefficients" = .168) => first
cluster {1,4}
Stage 2: Cases 10 and 12 have the second smallest distance => second cluster
{10,12}4/17/2020 DR ATHAR KHAN 36
38. Agglomeration schedule: Shows how the clusters are combined at each stage.
The next part of the table shows the stage at which each cluster first appears.
4/17/2020 DR ATHAR KHAN 38
39. Agglomeration schedule: Shows how the clusters are combined at each stage.
In stage 6, cluster 1 is the cluster that was formed in stage 1...
4/17/2020 DR ATHAR KHAN 39
40. Agglomeration schedule: Shows how the clusters are combined at each stage.
Stage 1: Cases 1 and 4 have the smallest distance ("Coefficients" = .168) => first cluster
{1,4}
First cluster {1,4} is merged with case 13 in stage 6 ("Next Stage") => Cluster {1,4,13}
0 means first time
4/17/2020 DR ATHAR KHAN 40
42. ▪ The Coefficients column indicates the distance between the two clusters (or
cases) joined at each stage.
▪ The values here depend on the proximity measure and linkage method used
in the analysis.
▪ For a good cluster solution, you will see a sudden jump in the distance
coefficient as you read down the table.
▪ The stage before the sudden change indicates the optimal stopping point for
merging clusters.
3 clusters
2 Clusters
1 Cluster
4/17/2020 DR ATHAR KHAN 42
43. NUMBER OF CLUSTERS
▪ Number of cases 15
▪ Step of ‘elbow’ 12
15 – 12
Number of clusters 3
4/17/2020 DR ATHAR KHAN 43
46. ▪ Notice how the "branches" merge together as you look from left to right in the
dendrogram.
▪ Cases or clusters that are joined by lines "further down" the tree (near the left side
of the dendrogram) are very similar.
The dendrogram (or "tree diagram") shows relative similarities between cases.
4/17/2020 DR ATHAR KHAN 46
47. ▪ Cases or clusters that are joined by lines "further up" the tree (near the right side)
are dissimilar.
▪ Cluster distances are rescaled so that they range from 0 to 25 in this plot.
4/17/2020 DR ATHAR KHAN 47
48. ▪ This would identify 3 clusters (GREEN), one for each point where a branch intersects
our line.
▪ By considering different cut points for our line, we can get solutions with different
numbers of cluster.
▪ A good cluster solution is one with small within-cluster distances, but large between
cluster distances.
1
2
3
4/17/2020 DR ATHAR KHAN 48
49. ▪ Choose the number of clusters within the largest increase in heterogeneity.
1
2
3
Standardized distance
4/17/2020 DR ATHAR KHAN 49
50. ▪ This table shows cluster membership for each case, according to the
number of clusters you requested.
▪ You can attempt to interpret the clusters by observing which cases are
grouped together.
4/17/2020 DR ATHAR KHAN 50
51. ▪ This table shows cluster membership for each case, according to the
number of clusters you requested.
▪ You can attempt to interpret the clusters by observing which cases are
grouped together.
4/17/2020 DR ATHAR KHAN 51
53. ▪ Having eyeballed the dendrogram and decided how many
clusters are present it is possible to re-run the analysis asking
SPSS to save a new variable in which cluster codes are assigned
to cases (with the researcher specifying the number of clusters
in the data).
▪ For these data, we saw three clear clusters and so we could re-
run the analysis asking for cluster group codings for three
clusters (in fact, I told you to do this as part of the original
analysis).
▪ The output below shows the resulting codes for each case in this
analysis. It’s pretty clear that these codes map exactly onto the
DSM-IV classifications.
4/17/2020 DR ATHAR KHAN 53
54. ▪ This table shows cluster membership for each case, according to the
number of clusters you requested.
▪ You can attempt to interpret the clusters by observing which cases are
grouped together.
4/17/2020 DR ATHAR KHAN 54
55. 4/17/2020 DR ATHAR KHAN 55
DR ATHAR KHAN
MBBS, MCPS, DPH, DCPS-HCSM, DCPS-HPE, MBA, PGD-
STATISTICS, CCRP
ASSOCIATE PROFESSOR
DEPARTMENT OF COMMUNITY MEDICINE
LIAQUAT COLLEGE OF MEDICINE & DENTISTRY
KARACHI, PAKISTAN
0092-3232135932