2. Definitions review
⢠Cluster: A collection of data objects
â similar (or related) to one another within the
same group
â dissimilar (or unrelated) to the objects in other
groups
⢠Cluster analysis
â Finding similarities between data according to the
characteristics found in the data and grouping
similar data objects into clusters
3. Clustering Methods
⢠Partitioning :
â Unsupervised learning algorithms, Construct various
partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
â Typical methods: k-means, k-medoids
⢠Hierarchical :
â Create a hierarchical decomposition of the set of data
(or objects) using some criterion
â Typical methods: Diana, Agnes, BIRCH, ROCK,
CAMELEON
5. illustrate of 2 clustering technique
using Rapidminer tool and Java
⢠K-means algorithm:
We performed two test
1. Using java program: program parameters
ď K = 2;
ď Data:
22 21
19 20
18 22
1 3
3 2
6. 6
K-means Clustering
⢠Input: the number of clusters K and the collection of n
instances
⢠Output: a set of k clusters that minimizes the squared error
criterion
⢠Method:
â Arbitrarily choose k instances as the initial cluster centers
â Repeat
⢠(Re)assign each instance to the cluster to which the
instance is the most similar, based on the mean value of
the instances in the cluster
⢠Update cluster means (compute mean value of the
instances for each cluster)
â Until no change in the assignment
⢠Squared Error Criterion
â E = âi=1 k â pĐCi |p-mi|2
â where mi are the cluster means and p are points in clusters
11. 11
K-medoids
⢠Input: the number of clusters K and the collection of n
instances
⢠Output: A set of k clusters that minimizes the sum of the
dissimilarities of all the instances to their nearest medoids
⢠Method:
â Arbitrarily choose k instances as the initial medoids
â Repeat
⢠(Re)assign each remaining instance to the cluster with
the nearest medoid
⢠Randomly select a non-medoid instance, or
⢠Compute the total cost, S, of swapping Oj with Or
⢠If S<0 then swap Oj with Or to form the new set of k
medoids
â Until no change
15. Comparison
ďąThe results of both algorithms are the same
ďąBoth require K to be specified in the
input
ďąK-medoids is less influenced by outliers in the
data
ďąBoth methods assign each instance exactly to
one cluster