SlideShare ist ein Scribd-Unternehmen logo
1 von 76
Downloaden Sie, um offline zu lesen
Prof. Pier Luca Lanzi
Representative-Based Clustering 
Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
Prof. Pier Luca Lanzi
Readings
•  Mining of Massive Datasets (Chapter 7)
•  Data Mining and Analysis (Section 13.3)
2
Prof. Pier Luca Lanzi
Representation-Based (or Point Assignment)
Algorithms
•  Given a dataset of N instances, and a desired number of clusters
k, this class of algorithms generates a partition C of N in k clusters
{C1, C2, …, Ck}
•  For each cluster there is a point that summarizes the cluster
•  The common choice being the mean of the points in the cluster





where ni = |Ci| and μi is the centroid
3
Prof. Pier Luca Lanzi
Representation-Based (or Point Assignment)
Algorithms
•  The goal of the clustering process is to select the best partition according to
some scoring function
•  Sum of squared errors is the most common scoring function
•  The goal of the clustering process is thus to find
•  Brute-force Approach
§ Generate all the possible clustering C = {C1, C2, …, Ck} and select the
best one. Unfortunately, there are O(kN/k!) possible partitions
4
Prof. Pier Luca Lanzi
k-Means Algorithm
•  Most widely known representative-based algorithm
•  Assumes an Euclidean space but can be easily extended to the
non-Euclidean case
•  Employs a greedy iterative approaches that minimizes the SSE
objective. Accordingly it can converge to a local optimal instead
of a globally optimal clustering.
5
Prof. Pier Luca Lanzi
1. Initially choose k points that are
likely to be in different clusters;
2. Make these points the centroids of
their clusters;
3. FOR each remaining point p DO
Find the centroid to which p is closest;
Add p to the cluster of that centroid;
Adjust the centroid of that
cluster to account for p;
END;
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Initializing Clusters
•  Solution 1
§ Pick points that are as far away from one another as possible.
•  Variation of solution 1
Pick the first point at random;
WHILE there are fewer than k points DO
Add the point whose minimum distance
from the selected points is as large as
possible;
END;
•  Solution 2
§ Cluster a sample of the data, perhaps hierarchically, so there
are k clusters. Pick a point from each cluster, perhaps that
point closest to the centroid of the cluster.
22
Prof. Pier Luca Lanzi
k-Means Clustering in R
set.seed(1234)
# random generated points
x-rnorm(12, mean=rep(1:3,each=4), sd=0.2)
y-rnorm(12, mean=rep(c(1,2,1),each=4), sd=0.2)
plot(x,y,pch=19,cex=2,col=blue)
# distance matrix
d - data.frame(x,y)
km - kmeans(d, 3)
names(km)
plot(x,y,pch=19,cex=2,col=blue)
par(new=TRUE)
plot(km$centers[,1], km$centers[,2], pch=19, cex=2, col=red)
23
Prof. Pier Luca Lanzi
k-Means Clustering in R
# generate other random centroids to start with
km - kmeans(d, 3, centers=cbind(runif(3,0,3),runif(3,0,2)))
plot(x,y,pch=19,cex=2,col=blue)
par(new=TRUE)
plot(km$centers[,1], km$centers[,2], pch=19, cex=2, col=red)
24
Prof. Pier Luca Lanzi
Evaluation on k-Means  Number of Clusters
###
### Evaluate clustering in kmeans using elbow/knee analysis
###
library(foreign)
library(GMD)
iris = read.arff(iris.arff)
# init two vectors that will contain the evaluation
# in terms of within and between sum of squares
plot_wss = rep(0,12)
plot_bss = rep(0,12)
# evaluate every clustering
for(i in 1:12)
{
cl - kmeans(iris[,1:4],i)
plot_wss[i] - cl$tot.withinss
plot_bss[i] - cl$betweenss;
}
25
Prof. Pier Luca Lanzi
Evaluation on k-Means  Number of Clusters
# plot the results
x = 1:12
plot(x, y=plot_bss, main=Within/Between Cluster Sum-of-square, cex=2,
pch=18, col=blue, xlab=Number of Clusters, ylab=Evaluation,
ylim=c(0,700))
lines(x, plot_bss, col=blue)
par(new=TRUE)
plot(x, y=plot_wss, cex=2, pch=19, col=red, ylab=, xlab=,
ylim=c(0,700))
lines(x,plot_wss, col=red);
26
Prof. Pier Luca Lanzi
Elbow  Knee Analysis 27
Prof. Pier Luca Lanzi
Two different K-means Clusterings 28
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Sub-optimal Clustering
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Optimal Clustering
Original Points
Prof. Pier Luca Lanzi
Importance of Choosing the Initial Centroids 29
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
Prof. Pier Luca Lanzi
Importance of Choosing the Initial Centroids 30
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
Prof. Pier Luca Lanzi
Importance of Choosing the Initial Centroids 31
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
Prof. Pier Luca Lanzi
Importance of Choosing the Initial Centroids 32
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
xy
Iteration 5
Prof. Pier Luca Lanzi
33Why Selecting the Best Initial Centroids is
Difficult?
•  If there are K ‘real’ clusters then the chance of selecting one
centroid from each cluster is small.
•  Chance is relatively small when K is large
•  If clusters are the same size, n, then


•  For example, if K = 10, then probability = 10!/1010 = 0.00036
•  Sometimes the initial centroids will readjust themselves in ‘right’
way, and sometimes they don’t
•  Consider an example of five pairs of clusters
Prof. Pier Luca Lanzi
10 Clusters Example 34
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
Prof. Pier Luca Lanzi
10 Clusters Example 35
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
Prof. Pier Luca Lanzi
10 Clusters Example 36
Starting with some pairs of clusters having three initial centroids, while other have only one.
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Prof. Pier Luca Lanzi
10 Clusters Example 37
Starting with some pairs of clusters having three initial centroids, while other have only one.
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Prof. Pier Luca Lanzi
38Dealing with the Initial
Centroids Issue
•  Multiple runs, helps, but probability is not on your side
•  Sample and use another clustering method (hierarchical?) to
determine initial centroids
•  Select more than k initial centroids and then select among these
initial centroids
•  Postprocessing
•  Bisecting K-means, not as susceptible to initialization issues
Prof. Pier Luca Lanzi
39Updating Centers Incrementally
•  In the basic K-means algorithm, centroids are updated after all
points are assigned to a centroid
•  An alternative is to update the centroids after each assignment
(incremental approach)
§ Each assignment updates zero or two centroids
§ More expensive
§ Introduces an order dependency
§ Never get an empty cluster
§ Can use “weights” to change the impact
Prof. Pier Luca Lanzi
40Pre-processing and Post-processing
•  Pre-processing
§ Normalize the data
§ Eliminate outliers
•  Post-processing
§ Eliminate small clusters that may represent outliers
§ Split ‘loose’ clusters, i.e., clusters with relatively high SSE
§ Merge clusters that are ‘close’ and 
that have relatively low SSE
§ These steps can be used during the clustering process
Prof. Pier Luca Lanzi
Bisecting K-means
•  Variant of K-means that can produce 
a partitional or a hierarchical clustering
41
Prof. Pier Luca Lanzi
Bisecting K-means Example 42
Prof. Pier Luca Lanzi
43Limitations of K-means
•  K-means has problems when clusters are of differing
§ Sizes
§ Densities
§ Non-globular shapes
•  K-means has also problems when the data contains outliers.
Prof. Pier Luca Lanzi
Limitations of K-means: 
Differing Sizes
44
Original Points K-means (3 Clusters)
Prof. Pier Luca Lanzi
Limitations of K-means: 
Differing Density
45
Original Points K-means (3 Clusters)
Prof. Pier Luca Lanzi
Limitations of K-means: 
Non-globular Shapes
46
Original Points K-means (2 Clusters)
Prof. Pier Luca Lanzi
Overcoming K-means Limitations 47
Original Points K-means Clusters
One solution is to use many clusters.
Find parts of clusters, but need to put together.
Prof. Pier Luca Lanzi
Overcoming K-means Limitations 48
Original Points K-means Clusters
Prof. Pier Luca Lanzi
Overcoming K-means Limitations 49
Original Points K-means Clusters
Prof. Pier Luca Lanzi
50K-Means Clustering Summary
•  Strength
§ Relatively efficient
§ Often terminates at a local optimum
§ The global optimum may be found using techniques such as:
deterministic annealing and genetic algorithms
•  Weakness
§ Applicable only when mean is defined, then what about
categorical data?
§ Need to specify k, the number of clusters, in advance
§ Unable to handle noisy data and outliers
§ Not suitable to discover clusters with non-convex shapes
Prof. Pier Luca Lanzi
51K-Means Clustering Summary
•  Advantages
§ Simple, understandable
§ Items automatically assigned to clusters
•  Disadvantages
§ Must pick number of clusters before hand
§ All items forced into a cluster
§ Too sensitive to outliers
Prof. Pier Luca Lanzi
52Variations of the K-Means Method
•  A few variants of the k-means which differ in
§ Selection of the initial k means
§ Dissimilarity calculations
§ Strategies to calculate cluster means
•  Handling categorical data: k-modes
§ Replacing means of clusters with modes
§ Using new dissimilarity measures 
to deal with categorical objects
§ Using a frequency-based method 
to update modes of clusters
§ A mixture of categorical and numerical data: 
k-prototype method
Prof. Pier Luca Lanzi
53Variations of the K-Means Method
•  A few variants of the k-means which differ in
§ Selection of the initial k means
§ Dissimilarity calculations
§ Strategies to calculate cluster means
•  Handling categorical data: k-modes
§ Replacing means of clusters with modes
§ Using new dissimilarity measures 
to deal with categorical objects
§ Using a frequency-based method 
to update modes of clusters
§ A mixture of categorical and numerical data: 
k-prototype method
Prof. Pier Luca Lanzi
The BFR Algorithm
Prof. Pier Luca Lanzi
The BFR Algorithm
•  BFR [Bradley-Fayyad-Reina] is a variant of k-means designed to 
handle very large (disk-resident) data sets
•  Assumes that clusters are normally distributed around a centroid
in a Euclidean space
•  Standard deviations in different dimensions may vary
•  Clusters are axis-aligned ellipses
•  Efficient way to summarize clusters (want 
memory required O(clusters) and not O(data))
55
Prof. Pier Luca Lanzi
The BFR Algorithm
•  Points are read from disk one chunk at the time (so to fit into
main memory)
•  Most points from previous memory loads are summarized by
simple statistics
•  To begin, from the initial load we select the initial k centroids by
some sensible approach
§ Take k random points
§ Take a small random sample and cluster optimally
§ Take a sample; pick a random point, and then 
k–1 more points, each as far from the previously selected
points as possible
56
Prof. Pier Luca Lanzi
Three Classes of Points
•  Discard set (DS)
§ Points close enough to a centroid to be summarized
•  Compression set (CS)
§ Groups of points that are close together but not close to any
existing centroid
§ These points are summarized, but not assigned to a cluster
•  Retained set (RS)
§ Isolated points waiting to be assigned to a compression set
57
Prof. Pier Luca Lanzi
The Status of BFR Algorithm 58
A cluster. Its points
are in the DS.
The centroid
Compressed sets.
Their points are in
the CS.
Points in
the RS
Discard set (DS): Close enough to a centroid to be summarized
Compression set (CS): Summarized, but not assigned to a cluster
Retained set (RS): Isolated points
Prof. Pier Luca Lanzi
Summarizing Sets of Points
•  For each cluster, the discard set (DS) is summarized by:
•  The number of points, N
•  The vector SUM, whose component SUM(i) is the sum of the
coordinates of the points in the ith dimension
•  The vector SUMSQ whose component SUMSQ(i) is the sum of
squares of coordinates in ith dimension
59
A cluster. 
All its points are in the DS.
The centroid
Prof. Pier Luca Lanzi
Summarizing Points: Comments
•  2d + 1 values represent any size cluster 
(d is the number of dimensions)
•  Average in each dimension (the centroid) can be calculated as
SUM(i)/N
•  Variance of a cluster’s discard set in dimension i is computed as
(SUMSQ(i)/N) – (SUM(i)/N)2
•  And standard deviation is the square root of that variance
60
Prof. Pier Luca Lanzi
Processing Data in the BFR Algorithm
1.  First, all points that are “sufficiently close” to the centroid of a cluster are added to
that cluster (by updating its parameters) then the point is discharged
2.  The points that are not “sufficiently close” to any centroid are clustered along with the
points in the retained set. Any algorithm can be used even the hierarchical one in this
step.
3.  The miniclusters derived for new points and the old retained set are merged (e.g., by
using the same criteria used for hierarchical clustering)
4.  Any point outside a cluster or a minicluster are dropped.
When the last chunk of data is processed, the remaining miniclusters and the points in the
retained set which might be labeled as outliers or alternatively can be assigned to one of
the centroids (as k-means would do).
Note that for miniclusters we only have N, SUM and SUMSQ so it is easier to used criteria
based on variance and similar statistics. So we might combine two clusters if their combined
variance is below some threshold.
61
Prof. Pier Luca Lanzi
“Sufficiently Close”
•  Two approaches have been proposed to determine whether a point is
sufficiently close to a cluster
•  Add p to a cluster if
§ It has the centroid closest to p
§ It is also very unlikely that, after all the points have been processed, some
other cluster centroid will be found to be nearer to p
•  We can measure the probability that, if p belongs to a cluster, it would be
found as far as it is from the centroid of that cluster
§ This is where the assumption about the clusters containing normally
distributed points aligned with the axes of the space is used
62
Prof. Pier Luca Lanzi
Mahalanobis Distance
•  It is used to decide whether a point is closed enough to a cluster
•  It is computed as the distance between a point and the centroid of a cluster,
normalized by the standard deviation of the cluster in each dimension.
•  Given p = (p1, … pd) and c = (c1, … cd), the Mahalanobis distance between p
and c is computed as
•  We assign p to the cluster with the least Mahalanobis from p provided that the
distance is below a certain threshold. A threshold of 4 means that we have
only a chance in a million not to include something that belongs to the cluster
63
Prof. Pier Luca Lanzi
k-Means for Arbitrary Shapes
(the CURE algorithm)
Prof. Pier Luca Lanzi
The CURE Algorithm
•  Problem with BFR/k-means:
§ Assumes clusters are normally 
distributed in each dimension
§ And axes are fixed – ellipses at 
an angle are not OK
•  CURE (Clustering Using REpresentatives):
§ Assumes a Euclidean distance
§ Allows clusters to assume any shape
§ Uses a collection of representative 
points to represent clusters
65
Vs.
Prof. Pier Luca Lanzi
k-means BFR
and these?
Prof. Pier Luca Lanzi
e e
e
e
e e
e
e e
e
e
h
h
h
h
h
h
h h
h
h
h
h h
salary
age
salary of humanities vs engineering
Prof. Pier Luca Lanzi
e e
e
e
e e
e
e e
e
e
h
h
h
h
h
h
h h
h
h
h
h h
salary
age
salary of humanities vs engineering
Prof. Pier Luca Lanzi
Starting CURE – Pass 1 of 2
•  Pick a random sample of points that fit into main memory
•  Cluster sample points to create initial clusters (e.g. using
hierarchical clustering)
•  Pick representative points
§ For each cluster pick k representative points 
(as disperse as possible)
§ Create synthetic representative points by moving 
the k points toward the centroid of the cluster (e.g. 20%)
69
Prof. Pier Luca Lanzi
e e
e
e
e e
e
e e
e
e
h
h
h
h
h
h
h h
h
h
h
h h
salary
age
salary of humanities vs engineering
Prof. Pier Luca Lanzi
e e
e
e
e e
e
e e
e
e
h
h
h
h
h
h
h h
h
h
h
h h
salary
age
salary of humanities vs engineering
synthetic
representative
points
Prof. Pier Luca Lanzi
Starting CURE – Pass 2 of 2
•  Rescan the whole dataset (from secondary memory) and for
each point p
•  Place p in the “closest cluster” that is the cluster that has a
representative that is closest to p
72
Prof. Pier Luca Lanzi
Expectation Maximization
Prof. Pier Luca Lanzi
Expectation-Maximization (EM) Clustering
•  k-means assigns each point to only one cluster (hard assignment)
•  The approach can be extended to consider soft assignment of points to
clusters, so that each point has a probability of belonging to each cluster
•  We assume that each cluster Ci is characterized by a multivariate normal
distribution and thus identified by
§ The mean vector μi
§ The covariance matrix Σi
•  A clustering is identified by a vector of parameter θ defined as

θ = {μi Σi P(Ci)}

where P(Ci) are the prior probability of all the clusters Ci which sum up to one
74
Prof. Pier Luca Lanzi
Expectation-Maximization (EM) Clustering
•  The goal of maximum likelihood estimation (MLE) is to choose the parameters
θ that maximize the likelihood, that is
•  General idea
§ Starts with an initial estimate of the parameter vector
§ Iteratively rescores the patterns against the mixture density produced by
the parameter vector
§ The rescored patterns are used to update the parameter updates
§ Patterns belonging to the same cluster, if they are placed by their scores in
a particular component
75
Prof. Pier Luca Lanzi
The EM (Expectation Maximization) Algorithm
•  Initially, randomly assign k cluster centers
•  Iteratively refine the clusters based on two steps
•  Expectation step
§ Assign each data point xi to cluster Ci with the following probability




where p(xi|Ck) follows the normal distribution.
•  This step calculates the probability of cluster membership of xi for each Ck
•  Maximization step
§ The model parameters are estimated from the updated probabilities.
§ For instance, for the mean,
76

Weitere ähnliche Inhalte

Was ist angesagt?

DMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesDMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesPier Luca Lanzi
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationPier Luca Lanzi
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationPier Luca Lanzi
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesPier Luca Lanzi
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationPier Luca Lanzi
 
DMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision TreesDMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision TreesPier Luca Lanzi
 
DMTM Lecture 03 Regression
DMTM Lecture 03 RegressionDMTM Lecture 03 Regression
DMTM Lecture 03 RegressionPier Luca Lanzi
 
DMTM 2015 - 10 Introduction to Classification
DMTM 2015 - 10 Introduction to ClassificationDMTM 2015 - 10 Introduction to Classification
DMTM 2015 - 10 Introduction to ClassificationPier Luca Lanzi
 
DMTM 2015 - 04 Data Exploration
DMTM 2015 - 04 Data ExplorationDMTM 2015 - 04 Data Exploration
DMTM 2015 - 04 Data ExplorationPier Luca Lanzi
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringPier Luca Lanzi
 
2013-1 Machine Learning Lecture 06 - Lucila Ohno-Machado - Ensemble Methods
2013-1 Machine Learning Lecture 06 - Lucila Ohno-Machado - Ensemble Methods2013-1 Machine Learning Lecture 06 - Lucila Ohno-Machado - Ensemble Methods
2013-1 Machine Learning Lecture 06 - Lucila Ohno-Machado - Ensemble MethodsDongseo University
 
Mixed Effects Models - Empirical Logit
Mixed Effects Models - Empirical LogitMixed Effects Models - Empirical Logit
Mixed Effects Models - Empirical LogitScott Fraundorf
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningPier Luca Lanzi
 
Mixed Effects Models - Centering and Transformations
Mixed Effects Models - Centering and TransformationsMixed Effects Models - Centering and Transformations
Mixed Effects Models - Centering and TransformationsScott Fraundorf
 
DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1Pier Luca Lanzi
 
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
H2O World - Top 10 Deep Learning Tips & Tricks - Arno CandelH2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
H2O World - Top 10 Deep Learning Tips & Tricks - Arno CandelSri Ambati
 
Deep learning to the rescue - solving long standing problems of recommender ...
Deep learning to the rescue - solving long standing problems of recommender ...Deep learning to the rescue - solving long standing problems of recommender ...
Deep learning to the rescue - solving long standing problems of recommender ...Balázs Hidasi
 

Was ist angesagt? (17)

DMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesDMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision trees
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluation
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensembles
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparation
 
DMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision TreesDMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision Trees
 
DMTM Lecture 03 Regression
DMTM Lecture 03 RegressionDMTM Lecture 03 Regression
DMTM Lecture 03 Regression
 
DMTM 2015 - 10 Introduction to Classification
DMTM 2015 - 10 Introduction to ClassificationDMTM 2015 - 10 Introduction to Classification
DMTM 2015 - 10 Introduction to Classification
 
DMTM 2015 - 04 Data Exploration
DMTM 2015 - 04 Data ExplorationDMTM 2015 - 04 Data Exploration
DMTM 2015 - 04 Data Exploration
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clustering
 
2013-1 Machine Learning Lecture 06 - Lucila Ohno-Machado - Ensemble Methods
2013-1 Machine Learning Lecture 06 - Lucila Ohno-Machado - Ensemble Methods2013-1 Machine Learning Lecture 06 - Lucila Ohno-Machado - Ensemble Methods
2013-1 Machine Learning Lecture 06 - Lucila Ohno-Machado - Ensemble Methods
 
Mixed Effects Models - Empirical Logit
Mixed Effects Models - Empirical LogitMixed Effects Models - Empirical Logit
Mixed Effects Models - Empirical Logit
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph mining
 
Mixed Effects Models - Centering and Transformations
Mixed Effects Models - Centering and TransformationsMixed Effects Models - Centering and Transformations
Mixed Effects Models - Centering and Transformations
 
DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1
 
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
H2O World - Top 10 Deep Learning Tips & Tricks - Arno CandelH2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
 
Deep learning to the rescue - solving long standing problems of recommender ...
Deep learning to the rescue - solving long standing problems of recommender ...Deep learning to the rescue - solving long standing problems of recommender ...
Deep learning to the rescue - solving long standing problems of recommender ...
 

Andere mochten auch

DMTM 2015 - 16 Data Preparation
DMTM 2015 - 16 Data PreparationDMTM 2015 - 16 Data Preparation
DMTM 2015 - 16 Data PreparationPier Luca Lanzi
 
DMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringDMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringPier Luca Lanzi
 
DMTM 2015 - 09 Density Based Clustering
DMTM 2015 - 09 Density Based ClusteringDMTM 2015 - 09 Density Based Clustering
DMTM 2015 - 09 Density Based ClusteringPier Luca Lanzi
 
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other Methods
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other MethodsDMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other Methods
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other MethodsPier Luca Lanzi
 
DMTM 2015 - 19 Graph Mining
DMTM 2015 - 19 Graph MiningDMTM 2015 - 19 Graph Mining
DMTM 2015 - 19 Graph MiningPier Luca Lanzi
 
Focus Junior - 14 Maggio 2016
Focus Junior - 14 Maggio 2016Focus Junior - 14 Maggio 2016
Focus Junior - 14 Maggio 2016Pier Luca Lanzi
 
DMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course IntroductionDMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course IntroductionPier Luca Lanzi
 
DMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data MiningDMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data MiningPier Luca Lanzi
 
DMTM 2015 - 03 Data Representation
DMTM 2015 - 03 Data RepresentationDMTM 2015 - 03 Data Representation
DMTM 2015 - 03 Data RepresentationPier Luca Lanzi
 
DMTM 2015 - 18 Text Mining Part 2
DMTM 2015 - 18 Text Mining Part 2DMTM 2015 - 18 Text Mining Part 2
DMTM 2015 - 18 Text Mining Part 2Pier Luca Lanzi
 
DMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association RulesDMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association RulesPier Luca Lanzi
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesPier Luca Lanzi
 
Idea Generation and Conceptualization
Idea Generation and ConceptualizationIdea Generation and Conceptualization
Idea Generation and ConceptualizationPier Luca Lanzi
 
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014Pier Luca Lanzi
 
Working with Formal Elements
Working with Formal ElementsWorking with Formal Elements
Working with Formal ElementsPier Luca Lanzi
 
Elements for the Theory of Fun
Elements for the Theory of FunElements for the Theory of Fun
Elements for the Theory of FunPier Luca Lanzi
 

Andere mochten auch (20)

DMTM 2015 - 16 Data Preparation
DMTM 2015 - 16 Data PreparationDMTM 2015 - 16 Data Preparation
DMTM 2015 - 16 Data Preparation
 
DMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringDMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical Clustering
 
DMTM 2015 - 09 Density Based Clustering
DMTM 2015 - 09 Density Based ClusteringDMTM 2015 - 09 Density Based Clustering
DMTM 2015 - 09 Density Based Clustering
 
Course Introduction
Course IntroductionCourse Introduction
Course Introduction
 
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other Methods
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other MethodsDMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other Methods
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other Methods
 
DMTM 2015 - 19 Graph Mining
DMTM 2015 - 19 Graph MiningDMTM 2015 - 19 Graph Mining
DMTM 2015 - 19 Graph Mining
 
Focus Junior - 14 Maggio 2016
Focus Junior - 14 Maggio 2016Focus Junior - 14 Maggio 2016
Focus Junior - 14 Maggio 2016
 
DMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course IntroductionDMTM 2015 - 01 Course Introduction
DMTM 2015 - 01 Course Introduction
 
DMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data MiningDMTM 2015 - 02 Data Mining
DMTM 2015 - 02 Data Mining
 
DMTM 2015 - 03 Data Representation
DMTM 2015 - 03 Data RepresentationDMTM 2015 - 03 Data Representation
DMTM 2015 - 03 Data Representation
 
Course Organization
Course OrganizationCourse Organization
Course Organization
 
DMTM 2015 - 18 Text Mining Part 2
DMTM 2015 - 18 Text Mining Part 2DMTM 2015 - 18 Text Mining Part 2
DMTM 2015 - 18 Text Mining Part 2
 
DMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association RulesDMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association Rules
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification Rules
 
Idea Generation and Conceptualization
Idea Generation and ConceptualizationIdea Generation and Conceptualization
Idea Generation and Conceptualization
 
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
Introduction to Procedural Content Generation - Codemotion 29 Novembre 2014
 
Working with Formal Elements
Working with Formal ElementsWorking with Formal Elements
Working with Formal Elements
 
The Structure of Games
The Structure of GamesThe Structure of Games
The Structure of Games
 
The Design Document
The Design DocumentThe Design Document
The Design Document
 
Elements for the Theory of Fun
Elements for the Theory of FunElements for the Theory of Fun
Elements for the Theory of Fun
 

Ähnlich wie DMTM 2015 - 08 Representative-Based Clustering

Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017Iwan Sofana
 
Mathematics online: some common algorithms
Mathematics online: some common algorithmsMathematics online: some common algorithms
Mathematics online: some common algorithmsMark Moriarty
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means ClusteringJunghoon Kim
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford MapR Technologies
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012Ted Dunning
 
Lecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsLecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsMarina Santini
 
Quarks zk study-club
Quarks zk study-clubQuarks zk study-club
Quarks zk study-clubAlex Pruden
 
13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdfEmanAsem4
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
 
Fuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksFuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksmourya chandra
 
Kmeans initialization
Kmeans initializationKmeans initialization
Kmeans initializationdjempol
 
HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23
HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23
HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23Aritra Sarkar
 

Ähnlich wie DMTM 2015 - 08 Representative-Based Clustering (20)

Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017
 
Mathematics online: some common algorithms
Mathematics online: some common algorithmsMathematics online: some common algorithms
Mathematics online: some common algorithms
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means Clustering
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
 
US learning
US learningUS learning
US learning
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
 
ACM 2013-02-25
ACM 2013-02-25ACM 2013-02-25
ACM 2013-02-25
 
Lecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsLecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest Neighbors
 
Data Mining Lecture_7.pptx
Data Mining Lecture_7.pptxData Mining Lecture_7.pptx
Data Mining Lecture_7.pptx
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
 
Quarks zk study-club
Quarks zk study-clubQuarks zk study-club
Quarks zk study-club
 
6 clustering
6 clustering6 clustering
6 clustering
 
13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf
 
Clustering.pdf
Clustering.pdfClustering.pdf
Clustering.pdf
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
Fuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksFuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networks
 
Kmeans initialization
Kmeans initializationKmeans initialization
Kmeans initialization
 
K-Means Algorithm
K-Means AlgorithmK-Means Algorithm
K-Means Algorithm
 
HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23
HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23
HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23
 
CSC446: Pattern Recognition (LN6)
CSC446: Pattern Recognition (LN6)CSC446: Pattern Recognition (LN6)
CSC446: Pattern Recognition (LN6)
 

Mehr von Pier Luca Lanzi

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i VideogiochiPier Luca Lanzi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiPier Luca Lanzi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomePier Luca Lanzi
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaPier Luca Lanzi
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Pier Luca Lanzi
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationPier Luca Lanzi
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningPier Luca Lanzi
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesPier Luca Lanzi
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringPier Luca Lanzi
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringPier Luca Lanzi
 
DMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationDMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationPier Luca Lanzi
 
DMTM Lecture 01 Introduction
DMTM Lecture 01 IntroductionDMTM Lecture 01 Introduction
DMTM Lecture 01 IntroductionPier Luca Lanzi
 
DMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningDMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningPier Luca Lanzi
 
VDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelineVDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelinePier Luca Lanzi
 
VDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 15 PCG with UnityVDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 15 PCG with UnityPier Luca Lanzi
 
VDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 14 Procedural content generationVDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 14 Procedural content generationPier Luca Lanzi
 

Mehr von Pier Luca Lanzi (18)

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei Videogiochi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning Welcome
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di apertura
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data exploration
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text mining
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rules
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clustering
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
 
DMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationDMTM Lecture 05 Data representation
DMTM Lecture 05 Data representation
 
DMTM Lecture 01 Introduction
DMTM Lecture 01 IntroductionDMTM Lecture 01 Introduction
DMTM Lecture 01 Introduction
 
DMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningDMTM Lecture 02 Data mining
DMTM Lecture 02 Data mining
 
VDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelineVDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipeline
 
VDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 15 PCG with UnityVDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 15 PCG with Unity
 
VDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 14 Procedural content generationVDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 14 Procedural content generation
 

Kürzlich hochgeladen

Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 

Kürzlich hochgeladen (20)

Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 

DMTM 2015 - 08 Representative-Based Clustering

  • 1. Prof. Pier Luca Lanzi Representative-Based Clustering Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
  • 2. Prof. Pier Luca Lanzi Readings •  Mining of Massive Datasets (Chapter 7) •  Data Mining and Analysis (Section 13.3) 2
  • 3. Prof. Pier Luca Lanzi Representation-Based (or Point Assignment) Algorithms •  Given a dataset of N instances, and a desired number of clusters k, this class of algorithms generates a partition C of N in k clusters {C1, C2, …, Ck} •  For each cluster there is a point that summarizes the cluster •  The common choice being the mean of the points in the cluster where ni = |Ci| and μi is the centroid 3
  • 4. Prof. Pier Luca Lanzi Representation-Based (or Point Assignment) Algorithms •  The goal of the clustering process is to select the best partition according to some scoring function •  Sum of squared errors is the most common scoring function •  The goal of the clustering process is thus to find •  Brute-force Approach § Generate all the possible clustering C = {C1, C2, …, Ck} and select the best one. Unfortunately, there are O(kN/k!) possible partitions 4
  • 5. Prof. Pier Luca Lanzi k-Means Algorithm •  Most widely known representative-based algorithm •  Assumes an Euclidean space but can be easily extended to the non-Euclidean case •  Employs a greedy iterative approaches that minimizes the SSE objective. Accordingly it can converge to a local optimal instead of a globally optimal clustering. 5
  • 6. Prof. Pier Luca Lanzi 1. Initially choose k points that are likely to be in different clusters; 2. Make these points the centroids of their clusters; 3. FOR each remaining point p DO Find the centroid to which p is closest; Add p to the cluster of that centroid; Adjust the centroid of that cluster to account for p; END;
  • 22. Prof. Pier Luca Lanzi Initializing Clusters •  Solution 1 § Pick points that are as far away from one another as possible. •  Variation of solution 1 Pick the first point at random; WHILE there are fewer than k points DO Add the point whose minimum distance from the selected points is as large as possible; END; •  Solution 2 § Cluster a sample of the data, perhaps hierarchically, so there are k clusters. Pick a point from each cluster, perhaps that point closest to the centroid of the cluster. 22
  • 23. Prof. Pier Luca Lanzi k-Means Clustering in R set.seed(1234) # random generated points x-rnorm(12, mean=rep(1:3,each=4), sd=0.2) y-rnorm(12, mean=rep(c(1,2,1),each=4), sd=0.2) plot(x,y,pch=19,cex=2,col=blue) # distance matrix d - data.frame(x,y) km - kmeans(d, 3) names(km) plot(x,y,pch=19,cex=2,col=blue) par(new=TRUE) plot(km$centers[,1], km$centers[,2], pch=19, cex=2, col=red) 23
  • 24. Prof. Pier Luca Lanzi k-Means Clustering in R # generate other random centroids to start with km - kmeans(d, 3, centers=cbind(runif(3,0,3),runif(3,0,2))) plot(x,y,pch=19,cex=2,col=blue) par(new=TRUE) plot(km$centers[,1], km$centers[,2], pch=19, cex=2, col=red) 24
  • 25. Prof. Pier Luca Lanzi Evaluation on k-Means Number of Clusters ### ### Evaluate clustering in kmeans using elbow/knee analysis ### library(foreign) library(GMD) iris = read.arff(iris.arff) # init two vectors that will contain the evaluation # in terms of within and between sum of squares plot_wss = rep(0,12) plot_bss = rep(0,12) # evaluate every clustering for(i in 1:12) { cl - kmeans(iris[,1:4],i) plot_wss[i] - cl$tot.withinss plot_bss[i] - cl$betweenss; } 25
  • 26. Prof. Pier Luca Lanzi Evaluation on k-Means Number of Clusters # plot the results x = 1:12 plot(x, y=plot_bss, main=Within/Between Cluster Sum-of-square, cex=2, pch=18, col=blue, xlab=Number of Clusters, ylab=Evaluation, ylim=c(0,700)) lines(x, plot_bss, col=blue) par(new=TRUE) plot(x, y=plot_wss, cex=2, pch=19, col=red, ylab=, xlab=, ylim=c(0,700)) lines(x,plot_wss, col=red); 26
  • 27. Prof. Pier Luca Lanzi Elbow Knee Analysis 27
  • 28. Prof. Pier Luca Lanzi Two different K-means Clusterings 28 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Sub-optimal Clustering -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Optimal Clustering Original Points
  • 29. Prof. Pier Luca Lanzi Importance of Choosing the Initial Centroids 29 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 1 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 3 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 4 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 6
  • 30. Prof. Pier Luca Lanzi Importance of Choosing the Initial Centroids 30 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 1 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 3 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 4 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 6
  • 31. Prof. Pier Luca Lanzi Importance of Choosing the Initial Centroids 31 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 1 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 3 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 4 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 5
  • 32. Prof. Pier Luca Lanzi Importance of Choosing the Initial Centroids 32 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 1 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 3 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 4 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 xy Iteration 5
  • 33. Prof. Pier Luca Lanzi 33Why Selecting the Best Initial Centroids is Difficult? •  If there are K ‘real’ clusters then the chance of selecting one centroid from each cluster is small. •  Chance is relatively small when K is large •  If clusters are the same size, n, then •  For example, if K = 10, then probability = 10!/1010 = 0.00036 •  Sometimes the initial centroids will readjust themselves in ‘right’ way, and sometimes they don’t •  Consider an example of five pairs of clusters
  • 34. Prof. Pier Luca Lanzi 10 Clusters Example 34 0 5 10 15 20 -6 -4 -2 0 2 4 6 8 x y Iteration 1 0 5 10 15 20 -6 -4 -2 0 2 4 6 8 x y Iteration 2 0 5 10 15 20 -6 -4 -2 0 2 4 6 8 x y Iteration 3 0 5 10 15 20 -6 -4 -2 0 2 4 6 8 x y Iteration 4 Starting with two initial centroids in one cluster of each pair of clusters
  • 35. Prof. Pier Luca Lanzi 10 Clusters Example 35 0 5 10 15 20 -6 -4 -2 0 2 4 6 8 x y Iteration 1 0 5 10 15 20 -6 -4 -2 0 2 4 6 8 x y Iteration 2 0 5 10 15 20 -6 -4 -2 0 2 4 6 8 x y Iteration 3 0 5 10 15 20 -6 -4 -2 0 2 4 6 8 x y Iteration 4 Starting with two initial centroids in one cluster of each pair of clusters
  • 36. Prof. Pier Luca Lanzi 10 Clusters Example 36 Starting with some pairs of clusters having three initial centroids, while other have only one. 0 5 10 15 20 -6 -4 -2 0 2 4 6 8 x y Iteration 1 0 5 10 15 20 -6 -4 -2 0 2 4 6 8 x y Iteration 2 0 5 10 15 20 -6 -4 -2 0 2 4 6 8 x y Iteration 3 0 5 10 15 20 -6 -4 -2 0 2 4 6 8 x y Iteration 4
  • 37. Prof. Pier Luca Lanzi 10 Clusters Example 37 Starting with some pairs of clusters having three initial centroids, while other have only one. 0 5 10 15 20 -6 -4 -2 0 2 4 6 8 x y Iteration 1 0 5 10 15 20 -6 -4 -2 0 2 4 6 8 x y Iteration 2 0 5 10 15 20 -6 -4 -2 0 2 4 6 8 x y Iteration 3 0 5 10 15 20 -6 -4 -2 0 2 4 6 8 x y Iteration 4
  • 38. Prof. Pier Luca Lanzi 38Dealing with the Initial Centroids Issue •  Multiple runs, helps, but probability is not on your side •  Sample and use another clustering method (hierarchical?) to determine initial centroids •  Select more than k initial centroids and then select among these initial centroids •  Postprocessing •  Bisecting K-means, not as susceptible to initialization issues
  • 39. Prof. Pier Luca Lanzi 39Updating Centers Incrementally •  In the basic K-means algorithm, centroids are updated after all points are assigned to a centroid •  An alternative is to update the centroids after each assignment (incremental approach) § Each assignment updates zero or two centroids § More expensive § Introduces an order dependency § Never get an empty cluster § Can use “weights” to change the impact
  • 40. Prof. Pier Luca Lanzi 40Pre-processing and Post-processing •  Pre-processing § Normalize the data § Eliminate outliers •  Post-processing § Eliminate small clusters that may represent outliers § Split ‘loose’ clusters, i.e., clusters with relatively high SSE § Merge clusters that are ‘close’ and that have relatively low SSE § These steps can be used during the clustering process
  • 41. Prof. Pier Luca Lanzi Bisecting K-means •  Variant of K-means that can produce a partitional or a hierarchical clustering 41
  • 42. Prof. Pier Luca Lanzi Bisecting K-means Example 42
  • 43. Prof. Pier Luca Lanzi 43Limitations of K-means •  K-means has problems when clusters are of differing § Sizes § Densities § Non-globular shapes •  K-means has also problems when the data contains outliers.
  • 44. Prof. Pier Luca Lanzi Limitations of K-means: Differing Sizes 44 Original Points K-means (3 Clusters)
  • 45. Prof. Pier Luca Lanzi Limitations of K-means: Differing Density 45 Original Points K-means (3 Clusters)
  • 46. Prof. Pier Luca Lanzi Limitations of K-means: Non-globular Shapes 46 Original Points K-means (2 Clusters)
  • 47. Prof. Pier Luca Lanzi Overcoming K-means Limitations 47 Original Points K-means Clusters One solution is to use many clusters. Find parts of clusters, but need to put together.
  • 48. Prof. Pier Luca Lanzi Overcoming K-means Limitations 48 Original Points K-means Clusters
  • 49. Prof. Pier Luca Lanzi Overcoming K-means Limitations 49 Original Points K-means Clusters
  • 50. Prof. Pier Luca Lanzi 50K-Means Clustering Summary •  Strength § Relatively efficient § Often terminates at a local optimum § The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms •  Weakness § Applicable only when mean is defined, then what about categorical data? § Need to specify k, the number of clusters, in advance § Unable to handle noisy data and outliers § Not suitable to discover clusters with non-convex shapes
  • 51. Prof. Pier Luca Lanzi 51K-Means Clustering Summary •  Advantages § Simple, understandable § Items automatically assigned to clusters •  Disadvantages § Must pick number of clusters before hand § All items forced into a cluster § Too sensitive to outliers
  • 52. Prof. Pier Luca Lanzi 52Variations of the K-Means Method •  A few variants of the k-means which differ in § Selection of the initial k means § Dissimilarity calculations § Strategies to calculate cluster means •  Handling categorical data: k-modes § Replacing means of clusters with modes § Using new dissimilarity measures to deal with categorical objects § Using a frequency-based method to update modes of clusters § A mixture of categorical and numerical data: k-prototype method
  • 53. Prof. Pier Luca Lanzi 53Variations of the K-Means Method •  A few variants of the k-means which differ in § Selection of the initial k means § Dissimilarity calculations § Strategies to calculate cluster means •  Handling categorical data: k-modes § Replacing means of clusters with modes § Using new dissimilarity measures to deal with categorical objects § Using a frequency-based method to update modes of clusters § A mixture of categorical and numerical data: k-prototype method
  • 54. Prof. Pier Luca Lanzi The BFR Algorithm
  • 55. Prof. Pier Luca Lanzi The BFR Algorithm •  BFR [Bradley-Fayyad-Reina] is a variant of k-means designed to handle very large (disk-resident) data sets •  Assumes that clusters are normally distributed around a centroid in a Euclidean space •  Standard deviations in different dimensions may vary •  Clusters are axis-aligned ellipses •  Efficient way to summarize clusters (want memory required O(clusters) and not O(data)) 55
  • 56. Prof. Pier Luca Lanzi The BFR Algorithm •  Points are read from disk one chunk at the time (so to fit into main memory) •  Most points from previous memory loads are summarized by simple statistics •  To begin, from the initial load we select the initial k centroids by some sensible approach § Take k random points § Take a small random sample and cluster optimally § Take a sample; pick a random point, and then k–1 more points, each as far from the previously selected points as possible 56
  • 57. Prof. Pier Luca Lanzi Three Classes of Points •  Discard set (DS) § Points close enough to a centroid to be summarized •  Compression set (CS) § Groups of points that are close together but not close to any existing centroid § These points are summarized, but not assigned to a cluster •  Retained set (RS) § Isolated points waiting to be assigned to a compression set 57
  • 58. Prof. Pier Luca Lanzi The Status of BFR Algorithm 58 A cluster. Its points are in the DS. The centroid Compressed sets. Their points are in the CS. Points in the RS Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points
  • 59. Prof. Pier Luca Lanzi Summarizing Sets of Points •  For each cluster, the discard set (DS) is summarized by: •  The number of points, N •  The vector SUM, whose component SUM(i) is the sum of the coordinates of the points in the ith dimension •  The vector SUMSQ whose component SUMSQ(i) is the sum of squares of coordinates in ith dimension 59 A cluster. All its points are in the DS. The centroid
  • 60. Prof. Pier Luca Lanzi Summarizing Points: Comments •  2d + 1 values represent any size cluster (d is the number of dimensions) •  Average in each dimension (the centroid) can be calculated as SUM(i)/N •  Variance of a cluster’s discard set in dimension i is computed as (SUMSQ(i)/N) – (SUM(i)/N)2 •  And standard deviation is the square root of that variance 60
  • 61. Prof. Pier Luca Lanzi Processing Data in the BFR Algorithm 1.  First, all points that are “sufficiently close” to the centroid of a cluster are added to that cluster (by updating its parameters) then the point is discharged 2.  The points that are not “sufficiently close” to any centroid are clustered along with the points in the retained set. Any algorithm can be used even the hierarchical one in this step. 3.  The miniclusters derived for new points and the old retained set are merged (e.g., by using the same criteria used for hierarchical clustering) 4.  Any point outside a cluster or a minicluster are dropped. When the last chunk of data is processed, the remaining miniclusters and the points in the retained set which might be labeled as outliers or alternatively can be assigned to one of the centroids (as k-means would do). Note that for miniclusters we only have N, SUM and SUMSQ so it is easier to used criteria based on variance and similar statistics. So we might combine two clusters if their combined variance is below some threshold. 61
  • 62. Prof. Pier Luca Lanzi “Sufficiently Close” •  Two approaches have been proposed to determine whether a point is sufficiently close to a cluster •  Add p to a cluster if § It has the centroid closest to p § It is also very unlikely that, after all the points have been processed, some other cluster centroid will be found to be nearer to p •  We can measure the probability that, if p belongs to a cluster, it would be found as far as it is from the centroid of that cluster § This is where the assumption about the clusters containing normally distributed points aligned with the axes of the space is used 62
  • 63. Prof. Pier Luca Lanzi Mahalanobis Distance •  It is used to decide whether a point is closed enough to a cluster •  It is computed as the distance between a point and the centroid of a cluster, normalized by the standard deviation of the cluster in each dimension. •  Given p = (p1, … pd) and c = (c1, … cd), the Mahalanobis distance between p and c is computed as •  We assign p to the cluster with the least Mahalanobis from p provided that the distance is below a certain threshold. A threshold of 4 means that we have only a chance in a million not to include something that belongs to the cluster 63
  • 64. Prof. Pier Luca Lanzi k-Means for Arbitrary Shapes (the CURE algorithm)
  • 65. Prof. Pier Luca Lanzi The CURE Algorithm •  Problem with BFR/k-means: § Assumes clusters are normally distributed in each dimension § And axes are fixed – ellipses at an angle are not OK •  CURE (Clustering Using REpresentatives): § Assumes a Euclidean distance § Allows clusters to assume any shape § Uses a collection of representative points to represent clusters 65 Vs.
  • 66. Prof. Pier Luca Lanzi k-means BFR and these?
  • 67. Prof. Pier Luca Lanzi e e e e e e e e e e e h h h h h h h h h h h h h salary age salary of humanities vs engineering
  • 68. Prof. Pier Luca Lanzi e e e e e e e e e e e h h h h h h h h h h h h h salary age salary of humanities vs engineering
  • 69. Prof. Pier Luca Lanzi Starting CURE – Pass 1 of 2 •  Pick a random sample of points that fit into main memory •  Cluster sample points to create initial clusters (e.g. using hierarchical clustering) •  Pick representative points § For each cluster pick k representative points (as disperse as possible) § Create synthetic representative points by moving the k points toward the centroid of the cluster (e.g. 20%) 69
  • 70. Prof. Pier Luca Lanzi e e e e e e e e e e e h h h h h h h h h h h h h salary age salary of humanities vs engineering
  • 71. Prof. Pier Luca Lanzi e e e e e e e e e e e h h h h h h h h h h h h h salary age salary of humanities vs engineering synthetic representative points
  • 72. Prof. Pier Luca Lanzi Starting CURE – Pass 2 of 2 •  Rescan the whole dataset (from secondary memory) and for each point p •  Place p in the “closest cluster” that is the cluster that has a representative that is closest to p 72
  • 73. Prof. Pier Luca Lanzi Expectation Maximization
  • 74. Prof. Pier Luca Lanzi Expectation-Maximization (EM) Clustering •  k-means assigns each point to only one cluster (hard assignment) •  The approach can be extended to consider soft assignment of points to clusters, so that each point has a probability of belonging to each cluster •  We assume that each cluster Ci is characterized by a multivariate normal distribution and thus identified by § The mean vector μi § The covariance matrix Σi •  A clustering is identified by a vector of parameter θ defined as θ = {μi Σi P(Ci)} where P(Ci) are the prior probability of all the clusters Ci which sum up to one 74
  • 75. Prof. Pier Luca Lanzi Expectation-Maximization (EM) Clustering •  The goal of maximum likelihood estimation (MLE) is to choose the parameters θ that maximize the likelihood, that is •  General idea § Starts with an initial estimate of the parameter vector § Iteratively rescores the patterns against the mixture density produced by the parameter vector § The rescored patterns are used to update the parameter updates § Patterns belonging to the same cluster, if they are placed by their scores in a particular component 75
  • 76. Prof. Pier Luca Lanzi The EM (Expectation Maximization) Algorithm •  Initially, randomly assign k cluster centers •  Iteratively refine the clusters based on two steps •  Expectation step § Assign each data point xi to cluster Ci with the following probability where p(xi|Ck) follows the normal distribution. •  This step calculates the probability of cluster membership of xi for each Ck •  Maximization step § The model parameters are estimated from the updated probabilities. § For instance, for the mean, 76