Clustering Methods with R

Clustering Methods with R
Akira Murakami
Department of English Language and Applied Linguistics
University of Birmingham
a.murakami@bham.ac.uk

Download Necessary Files
2
http://tinyurl.com/ClusteringWithR

Cluster Analysis
• Cluster analysis ﬁnds groups in data.
• Objects in the same cluster are similar to each other.
• Objects in different clusters are dissimilar.
• A variety of algorithms have been proposed.
• Saying “I ran a cluster analysis” does not mean much.
• Used in data mining or as a statistical analysis.
• Unsupervised machine learning technique.
3

Cluster Analysis in SLA
• In SLA, clustering has been applied to identify the typology of
learners’
• motivational profiles (Csizér & Dörnyei, 2005),
• ability/aptitude profiles (Rysiewicz, 2008),
• developmental profiles based on international posture, L2
willingness to communicate, and frequency of communication
in L2 (Yashima & Zenuk-Nishide, 2008),
• cognitive and achievement profiles based on L1 achievement,
intelligence, L2 aptitude, and L2 proficiency (Sparks, Patton,
& Ganschow, 2012).
4

Similarity Measure
• Cluster analysis groups the observations that are
“similar”. But how do we measure similarity?
• Let’s suppose that we are interested in clustering L1
groups according to their accuracy of different
linguistic features (i.e., accuracy proﬁle of L1
groups).
• As the measure of accuracy, we use an index that
takes the value between 0 and 1, such as the TLU
score.
5

￨￨￨￨￨￨￨￨￨￨￨
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Mathematical Distance
6

￨￨￨￨￨￨￨￨￨￨￨
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
L1 Korean
7

￨￨￨￨￨￨￨￨￨￨￨
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
L1 Korean
L1 German
8

￨￨￨￨￨￨￨￨￨￨￨
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
L1 Korean
L1 German
Distance = 0.2
9

￨￨￨￨￨￨￨￨￨￨￨
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
L1 Korean
L1 German
Distance = 0.2
L1 Japanese
Distance = 0.1
10

(Dis)Similarity Matrix
11
L1 Korean L1 German L1 Japanese
L1 Korean 0.0
L1 German 0.2 0.0
L1 Japanese 0.1 0.3 0.0

Distance Measures
• Things are simple in 1D, but get more complicated in 2D or above.
• Different measures of distance
• Euclidean distance
• Manhattan distance
• Maximum distance
• Mahalanobis distance
• Hamming distance
• etc
12

Distance Measures
• etc
13

Article Accuracy
Pasttense−edAccuracy
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Euclidean Distance
14

Article Accuracy
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
L1 German
(0.8, 0.6)
Euclidean Distance
15

Article Accuracy
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
L1 German
(0.8, 0.6)
L1 Korean
(0.4, 0.8)
Euclidean Distance
16

Article Accuracy
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
L1 German
(0.8, 0.6)
L1 Korean
(0.4, 0.8)
(0.4−0.8)2
+(0.8−0.6)2
Euclidean Distance
17

Article Accuracy
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
L1 German
(0.8, 0.6)
L1 Korean
(0.4, 0.8)
0.45
Euclidean Distance
18

Article Accuracy
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
L1 German
(0.8, 0.6)
L1 Korean
(0.4, 0.8)
0.45
L1 Japanese (0.6, 0.5)
Euclidean Distance
19

Article Accuracy
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
L1 German
(0.8, 0.6)
L1 Korean
(0.4, 0.8)
0.45
L1 Japanese (0.6, 0.5)
0.36
0.22
Euclidean Distance
20

21
L1 Korean 0.00
L1 German 0.45 0.00
L1 Japanese 0.36 0.22 0.00

0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
0.0
0.2
0.4
0.6
0.8
1.0
Article Accuracy
Plural−sAccuracy
L1 German (0.3, 0.6, 0.9)
L1 Korean (0.6, 0.9, 0.6)
L1 Japanese (0.9, 0.4, 0.5)
Euclidean Distance (3D)
22

0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
0.0
0.2
0.4
0.6
0.8
1.0
Article Accuracy
Plural−sAccuracy
L1 German (0.3, 0.6, 0.9)
L1 Korean (0.6, 0.9, 0.6)
L1 Japanese (0.9, 0.4, 0.5)
0.75
0.52
0.59
Euclidean Distance (3D)
23

24
L1 Korean 0.00
L1 German 0.52 0.00
L1 Japanese 0.59 0.75 0.00

Distance Measures
• etc
25

Article Accuracy
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
L1 German
(0.8, 0.6)
L1 Korean
(0.4, 0.8)
Manhattan Distance
26

Article Accuracy
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
L1 German
(0.8, 0.6)
L1 Korean
(0.4, 0.8)
Manhattan Distance
27

Article Accuracy
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
L1 German
(0.8, 0.6)
L1 Korean
(0.4, 0.8)
0.4
0.2
Manhattan Distance
28

Article Accuracy
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
L1 German
(0.8, 0.6)
L1 Korean
(0.4, 0.8)
0.4
0.2
Manhattan Distance
29
→ Distance = 0.4 + 0.2 = 0.6

Article Accuracy
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(0.1, 0.4)
(0.9, 0.3)
(0.6, 0.9)
Manhattan Distance
30

Article Accuracy
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(0.1, 0.4)
(0.9, 0.3)
(0.6, 0.9)
0.5
0.5
0.71
0.1
0.8
0.81
Manhattan Distance
31

Article Accuracy
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(0.1, 0.4)
(0.9, 0.3)
(0.6, 0.9)
0.5
0.5
0.71
0.1
0.8
0.81
Manhattan Distance
32
Euclidean: 0.71
Manhattan: 0.5 + 0.5 = 1.00
Euclidean: 0.81
Manhattan: 0.1 + 0.8 = 0.90

dist()
• In R, dist function is used to obtain dissimilarity
matrices.
• Practicals
33

Clustering Methods
• Now that we know the concept of similarity, we
move on to the clustering of objects based on the
similarity.
• A number of methods have been proposed for
clustering. We will look at the following two:
• agglomerative hierarchical cluster analysis
• k-means
34

Clustering Methods
• Now that we know the concept of similarity, we
move on to the clustering of objects based on the
similarity.
• A number of methods have been proposed for
clustering. We will look at the following two:
• agglomerative hierarchical cluster analysis
• k-means
35

Agglomerative Hierarchical Cluster Analysis
• In agglomerative hierarchical clustering,
observations are clustered in a bottom-up manner.
1. Each observation forms an independent cluster
at the beginning.
2. The two clusters that are most similar are
clustered together.
3. 2 is repeated until all the observations are
clustered in a single cluster.
36

Linkage Criteria
• How do we calculate the similarity between clusters
that each includes multiple observations?
• Ward’s criterion (Ward’s method)
• complete-linkage
• single-linkage
• etc.
37

Linkage Criteria
• How do we calculate the similarity between clusters
• single-linkage
• etc.
38

Ward’s Method
• Ward’s method leads to the smallest within-cluster
variance.
• At each iteration, two clusters are merged so that it
yields the smallest increase of the sum of squared
errors.
• Sum of Squared Errors (SSE): the sum of the
squared difference between the mean of the cluster
and individual data points.
39

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Article Accuracy
1 (0.4, 0.2)
2 (0.2, 0.4)
3 (0.4, 0.8) 4 (0.8, 0.8)
5 (0.9, 0.4)
Ward’s Method
40

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Article Accuracy
1 (0.4, 0.2)
2 (0.2, 0.4)
3 (0.4, 0.8) 4 (0.8, 0.8)
5 (0.9, 0.4)
Ward’s Method
41

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Article Accuracy
1 (0.4, 0.2)
2 (0.2, 0.4)
3 (0.4, 0.8) 4 (0.8, 0.8)
5 (0.9, 0.4)
x
mean (0.3, 0.6)
Ward’s Method
42

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Article Accuracy
1 (0.4, 0.2)
2 (0.2, 0.4)
3 (0.4, 0.8) 4 (0.8, 0.8)
5 (0.9, 0.4)
x
mean (0.3, 0.6)
0.22
0.22
Ward’s Method
43

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Article Accuracy
1 (0.4, 0.2)
2 (0.2, 0.4)
3 (0.4, 0.8) 4 (0.8, 0.8)
5 (0.9, 0.4)
x
mean (0.3, 0.6)
0.05
0.05
Ward’s Method
44

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Article Accuracy
1 (0.4, 0.2)
2 (0.2, 0.4)
3 (0.4, 0.8) 4 (0.8, 0.8)
5 (0.9, 0.4)
x
mean (0.3, 0.6)
0.05
0.05
Ward’s Method
45→ 0.05 + 0.05 = 0.10

• This procedure is repeated for all of the pairs.
46

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Article Accuracy
1 (0.4, 0.2)
2 (0.2, 0.4)
3 (0.4, 0.8) 4 (0.8, 0.8)
5 (0.9, 0.4)
x
Ward’s Method
47

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Article Accuracy
1 (0.4, 0.2)
2 (0.2, 0.4)
3 (0.4, 0.8) 4 (0.8, 0.8)
5 (0.9, 0.4)
x
x
Ward’s Method
48

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Article Accuracy
1 (0.4, 0.2)
2 (0.2, 0.4)
3 (0.4, 0.8) 4 (0.8, 0.8)
5 (0.9, 0.4)
x
x
(0.3, 0.3)
(0.6, 0.8)
Ward’s Method
49

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Article Accuracy
1 (0.4, 0.2)
2 (0.2, 0.4)
3 (0.4, 0.8) 4 (0.8, 0.8)
5 (0.9, 0.4)
x
x
(0.3, 0.3)
(0.6, 0.8)
( 0.1
2
+0.1
2
)
2
= 0.02
0.2
2
= 0.04
Ward’s Method
50

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Article Accuracy
1 (0.4, 0.2)
2 (0.2, 0.4)
3 (0.4, 0.8) 4 (0.8, 0.8)
5 (0.9, 0.4)
x
x
(0.3, 0.3)
(0.6, 0.8)
( 0.1
2
+0.1
2
)
2
= 0.02
0.2
2
= 0.04
Ward’s Method
SSE = 0.02 + 0.02 + 0.04 + 0.04 = 0.12

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Article Accuracy
1 (0.4, 0.2)
2 (0.2, 0.4)
3 (0.4, 0.8) 4 (0.8, 0.8)
5 (0.9, 0.4)
x (0.45, 0.55)
Ward’s Method

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Article Accuracy
1 (0.4, 0.2)
2 (0.2, 0.4)
3 (0.4, 0.8) 4 (0.8, 0.8)
5 (0.9, 0.4)
x (0.45, 0.55)
0.12
0.08
0.06
0.18
Ward’s Method
SSE = 0.12 + 0.08 + 0.06 + 0.18 = 0.46

ΔSSE
• SSE before the merger: 0.12
• SSE after the merger: 0.46
• Difference (ΔSSE): 0.46 - 0.12 = 0.34
54

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Article Accuracy
1 (0.4, 0.2)
2 (0.2, 0.4)
3 (0.4, 0.8) 4 (0.8, 0.8)
5 (0.9, 0.4)
x
x
Ward’s Method
55

Dendrogram
56
1
2
5
3
4
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Cluster Dendrogram
hclust (*, "ward.D2")
dd.dist
Height

EF-Cambridge Open Language Database
(EFCAMDAT)
58
• Writings submitted at Englishtown, the
online school of Education First
• 16 Levels × 8 Units (A1-C2 in CEFR)
• Each student submits one writing per unit
• Teachers’ feedback available on some
writings (≈ error tags)
• Available at http://corpus.mml.cam.ac.uk/
efcamdat/

Linkage Criteria
• How do we know the similarity between clusters
• single-linkage
• etc.
59

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Article Accuracy
1 (0.4, 0.2)
2 (0.2, 0.4)
3 (0.4, 0.8) 4 (0.8, 0.8)
5 (0.9, 0.4)
Complete Linkage
60

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Article Accuracy
1 (0.4, 0.2)
2 (0.2, 0.4)
3 (0.4, 0.8) 4 (0.8, 0.8)
5 (0.9, 0.4)
Complete Linkage
61

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Article Accuracy
1 (0.4, 0.2)
2 (0.2, 0.4)
3 (0.4, 0.8) 4 (0.8, 0.8)
5 (0.9, 0.4)
0.7
Complete Linkage
62

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Article Accuracy
1 (0.4, 0.2)
2 (0.2, 0.4)
3 (0.4, 0.8) 4 (0.8, 0.8)
5 (0.9, 0.4)
0.4
Single Linkage
63

Potential Pitfall of Hierarchical Clustering
• It assumes hierarchical structure in the clustering.
• Let us say that our data included two L1 groups over three
proﬁciency levels.
• If we group the data into two clusters, the best split may be
between the two L1 groups.
• If we group them into three clusters, the best groups may be by
proﬁciency groups.
• In this case, three-cluster solution is not nested within two-
cluster solution, and hierarchical clustering may fail to identify
the two clusters.
64

k-means Clustering
• K-means clustering does not assume a hierarchical
structure of clusters.
• i.e., no parent/child clusters
• Analysts need to specify the number of clusters.
66

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Article Accuracy
1 (0.4, 0.2)
2 (0.2, 0.4)
3 (0.4, 0.8) 4 (0.8, 0.8)
5 (0.9, 0.4)
k-means Clustering
67

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Article Accuracy
x
x
1
2
3 4
5
(Centroid 1)
(Centroid 2)
k-means Clustering
68

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Article Accuracy
x
x
1
2
3 4
5
(Centroid 1)
(Centroid 2)
0.28
0.60
0.45 0.72
0.72
0.64
0.70
k-means Clustering
69

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Article Accuracy
1
2
3 4
5
x
x
Centroid 1
Centroid 2
k-means Clustering
70

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Article Accuracy
1
2
3 4
5
x
x
Centroid 1
Centroid 2
0.40
0.41
0.50
0.22
0.45
0.22
0.28
0.42
0.21
0.63
k-means Clustering
71

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Article Accuracy
1
2
3 4
5
x
x
Centroid 1
Centroid 2
k-means Clustering
72

k-Means Clustering
• The optimal number of clusters depends on the intended use.
• There is no “correct” or “wrong” choice in the number of
clusters.
• NP hard
• The algorithm only approximates solutions.
• Randomness is involved in the solution. You get different
solutions every time you run it.
• It assumes convex clusters.
73

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
y1 Concave
74

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
y1
x
x
xx
x
x
x
x
x
xx x
x
x
x
x x
x
x xx
x
xx
x
x
x
xx
x
x
xx
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
xx
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x x
x xx
x
x
xx x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
xx
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Concave
75

Within-Learner Centering
• The mean accuracy value of each learner was subtracted from all the
data points of the learner.
• For example, let's suppose the mean sentence length (MSL) of
Learner A over 10 writings was
• {4.0, 4.2, 4.4, 4.6, 4.8, 5.0, 5.2, 5.4, 5.6, 5.8}  
 
and that of Learner B was
• {8.0, 8.2, 8.4, 8.6, 8.8, 9.0, 9.2, 9.4, 9.6, 9.8}
• The difference in MSL is identical in the two learners (+0.2 per writing).
• But the absolute MSL is widely different.
77

Within-Learner Centering
• The mean value of Learner A (4.9) is subtracted from all the data
points of Learner A:
• → {-0.90, -0.70, -0.50, -0.30, -0.10, 0.10, 0.30, 0.50, 0.70,
0.90}.
• Similarly, the mean value of Learner B (8.90) is subtracted from
all the data points of Learner B:
• → {-0.90, -0.70, -0.50, -0.30, -0.10, 0.10, 0.30, 0.50, 0.70,
0.90}.
• It is guaranteed that these two learners are clustered into the
same group as they have exactly the same set of values.
78

Cluster Validation/Evaluation
• We got clusters and explored them, but how do we
know how good the clusters are, or whether they
indeed capture signal and not just noise?
• Are the clusters ‘real’?
• Is it the difference in the true learning curve that
the earlier clustering captured or is it just the
random noise?
80

Two Types of Validation
• External Validation
• Internal Validation
81

External Validation
• If there is a a systematic pattern between clusters
and some external criteria, such as the proﬁciency
or L1 of learners, then what the cluster analysis
captured is unlikely to be just noise.
82

Internal Validation
• Measures of goodness of clusters
• silhouette width
• Davies–Bouldin index
• Dunn index
• etc.
83

Internal Validation
• Measures of goodness of clusters
• silhouette width
• Davies–Bouldin index
• Dunn index
• etc.
84

Silhouette Width
• Intuitively, the silhouette value is large if within-
cluster dissimilarity is small (i.e., learners within
each cluster have similar developmental
trajectories) and between-cluster dissimilarity is
large (i.e., learners in different clusters have
different learning curves).
• The silhouette is given to each data point (i.e.,
learner), and all the silhouette values are averaged
to measure the cluster distinctiveness of a cluster
analysis.
85

• Let’s say there are three clusters, A through C.
• Let’s further say that i is a member of Cluster A.
• Let a(i) be the average distance between that learner and all the
other learners that belong to the same cluster.
• We also calculate the average distances
1. between the learner and all the other learners that belong to
Cluster B
2. between the learner and all the other learners that belong to
Cluster C
• Let b(i) be the smaller of the two above (1-2).
• s(i) = (b(i) - a(i)) / max(a(i), b(i))
86
Silhouette Width

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
y1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
Silhouette Width
87

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
y1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
Silhouette Width
88

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
y1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
Silhouette Width
89
→ Average = 0.022 (the value of a(i))

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
y1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
Silhouette Width
90
→ Average = 0.191

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
y1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
Silhouette Width
91
→ Average = 0.240

Silhouette Width
• a(i) = 0.022
• b(i) = 0.191 (the smaller of the other two)
• s(i) = (b(i) - a(i)) / max(a(i), b(i))
• s(i) = (0.191 - 0.022) / 0.191 = 0.882
• This is repeated for all the data points.
• Goodness of clustering: mean silhouette width across
all the data points.
92

Bootstrapping
• Now that we have a measure of how good our
clustering is, the next question is whether it is good
enough to be considered non-random.
• We can address this question through the technique
called bootstrapping.
• The idea is similar to the usual hypothesis-testing
procedure.
• We obtain the null distribution of the silhouette value
and see where our value falls.
93

• More speciﬁc procedure is as follows:
1. For each learner, we sample 30 writings (with replacement).
2. We run a k-means cluster analysis with the data obtained in
1 and calculate the mean silhouette value.
3. 1 and 2 are repeated e.g., 10,000 times, resulting in 10,000
mean silhouette values which we consider as the null
distribution.
4. We examine whether the 95% range of 3 includes our
observed mean silhouette value.
94
Bootstrapping

• The idea here is that we practically randomize the order
of the writings within individual learners and follow the
same procedure as our main analysis.
• Since the order of writings is random, there should not
be any systematic pattern of development observed.
• The clusters obtained in this manner thus captures noise
alone. We calculate the mean silhouette value on the
noise-only, random clusters, and obtain its distribution by
repeating the whole procedure a large number of times.
95
Bootstrapping

langtest.jp
97
http://langtest.jp

Paper Introducing langtest.jp
98
http://applij.oxfordjournals.org/content/early/2015/06/24/applin.amv025.abstract

Clustering Methods with R

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Clustering Methods with R

Ähnlich wie Clustering Methods with R (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Clustering Methods with R