SlideShare ist ein Scribd-Unternehmen logo
1 von 51
Downloaden Sie, um offline zu lesen
http://publicationslist.org/junio
Data Analysis
Clustering
Prof. Dr. Jose Fernando Rodrigues Junior
ICMC-USP
http://publicationslist.org/junio
What is it about?
Clustering refers to the process of finding groups of points
that are in some way “lumped together”
A modality of unsupervised learning, as we do not know
ahead of time where and what are the clusters – no training!
Explanatorily tries to characterize the structure of a dataset
http://publicationslist.org/junio
But, what is a cluster?
groups of points that are similar
groups of points that are close to each other
groups well-separated one from each other
contiguous regions of high data point density separated by
regions of lower point density
http://publicationslist.org/junio
But, what is a cluster?
Any clusters here? There should not be, as they are uniformly (no two points overlap, yet)
generated points. Eventhough, most algorithms would point out some clusters.
It is not that there are clusters there, it is only that we do not have enough points yet.
http://publicationslist.org/junio
But, what is a cluster?
Any clusters here? There should not be, as they are uniformly (no two points overlap, yet)
generated points. Eventhough, most algorithms would point out some clusters.
It is not that there are clusters there, it is only that we do not have enough points yet.
The point here is – although one would find clusters, they definitely do not
explain the phenomenon accurately.
http://publicationslist.org/junio
But, what is a cluster?
Yes! Three clusters, I can see them. Distance-based algorithms can do well here.
Easy huh?! No wonder, here we have convex, disjoint, and well-separated groups of points.
Try the next ones!
http://publicationslist.org/junio
But, what is a cluster?
Non-convex clusters – simple distance-based algorithms would have trouble here.
A cluster is convex if the line connecting any two points lies entirely within the cluster
itself.
http://publicationslist.org/junio
But, what is a cluster?
Non-convex clusters – simple distance-based algorithms would have trouble here.
A cluster is convex if the line connecting any two points lies entirely within the cluster
itself.
There are also the star-convex clusters: in such case, the connecting line from
the spatial center of the cluster to any other point lies entirely within the
cluster.
http://publicationslist.org/junio
But, what is a cluster?
Intersecting clusters – quite a challenge!
http://publicationslist.org/junio
But, what is a cluster?
No general clustering algorithm can solve this.The clustering is given by the global
properties observed in the points – distance or neighbor based algorithms would yield a
single cluster.
http://publicationslist.org/junio
But, what is a cluster?
No general clustering algorithm can solve this.The clustering is given by the global
properties observed in the points – distance or neighbor based algorithms would yield a
single cluster.
In this case, for any algorithm that considers a single point (or a single pair of
points) at a time, this leads to a problem: to determine cluster membership,
we need the property of the whole cluster; but to determine the properties
(vertical, horizontal, and pairwise orthogonal) of the cluster, we must first
assign points to clusters.
http://publicationslist.org/junio
But, what is a cluster?
To handle such situations, we would need to perform some
kind of global structure analysis—a task our minds are
incredibly good at (which is why we tend to think of clusters
this way) but that we have a hard time teaching computers to
do
For problems in two dimensions, digital image processing has
developed methods to recognize and extract certain features
(such as edge detection)
But general clustering methods deal only with local properties
and therefore can’t handle problems such as these
http://publicationslist.org/junio
But, what is a cluster?
If we return to our candidate definitions of cluster, we can
verify that none of them survives the possibilities just
presented – try it!
 groups of points that are similar
 groups of points that are close to each other
 groups well-separated one from each other
 contiguous regions of high data point density separated by regions
of lower point density
http://publicationslist.org/junio
But, what is a cluster?
If we return to our candidate definitions of cluster, we can
verify that none of them survives the possibilities just
presented – try it!
 groups of points that are similar
 groups of points that are close to each other
 groups of points well-separated one from each other
 contiguous regions of high data point density separated by regions
of lower point density
So this is it.
• No mathematical, nor universal definition of a cluster
• Rather, we have our intuition and it could be quite useful provided we have a
good comprehension of the data properties – structural, statistical, and domain-
related
• Having, as much as possible, well-defined goals is also a demand
• Just as for any other data analysis approach, do not try to use it as a magic black
box – doing so will fail with high probability!
http://publicationslist.org/junio
Distances
Clustering does not actually require data points to be
embedded into a geometric space: all that is required is a
distance or (equivalently) a similarity measure for any pair of
points
 This makes it possible to perform clustering on a set of
strings, for example
 However, if the data points have properties of a vector space
then we can develop more efficient algorithms that exploit
these properties
http://publicationslist.org/junio
Distances – what is it?
A distance is any function d(x, y) that takes two points and returns
a scalar value that is a measure for how different these points are:
the more different, the larger the distance
A distance function – or, a similarity function:
 s(x, y) = 1-d(x,y), for 0 ≤ d(x,y) ≤ 1
 s(x,y) = 1/d(x,y)
 s(x,y) = e-d
For some problems, a particular distance measure will present itself
naturally - if the data points are points in space, then we will most
likely employ the Euclidean distance or a measure similar to it, but
for other problems, we have more freedom to define our own
metric
http://publicationslist.org/junio
Distances – metric distances
 There are certain properties that a distance (or similarity) function
should have. Mathematicians have developed a set of properties
that a function must possess to be considered a metric (or
distance) in a mathematical sense
 d(x, y) = 0
 d(x, y) = 0 if and only if x = y
 d(x, y) = d(y, x)
 d(x, y) + d(y, z) ≥ d(x, z)
 These conditions are not necessarily fulfilled in practice. A funny
example for an asymmetric distance occurs if you ask everyone in a
group of people how much they like every other member of the
group and then use the responses to construct a distance measure:
it is not at all guaranteed that the feelings of person A for person B
are requited by B
http://publicationslist.org/junio
Distances – metric distances
 There are certain properties that a distance (or similarity) function
should have. Mathematicians have developed a set of properties
that a function must possess to be considered a metric (or
distance) in a mathematical sense.
 d(x, y) = 0
 d(x, y) = 0 if and only if x = y
 d(x, y) = d(y, x)
 d(x, y) + d(y, z) ≥ d(x, z)
 These conditions are not necessarily fulfilled in practice. A funny
example for an asymmetric distance occurs if you ask everyone in a
group of people how much they like every other member of the
group and then use the responses to construct a distance measure:
it is not at all guaranteed that the feelings of person A for person B
are requited by B
For technical reasons, the symmetry property is usually highly
desirable. You can always construct a symmetric distance
function from an asymmetric one:
dS(x, y) = d(x, y) + d(y, x)
2
http://publicationslist.org/junio
Distances – common distances
 Commonly used distance and similarity measures for numeric data
http://publicationslist.org/junio
Distances – common distances
 Distances Manhattan, Euclidean, Maximum, and Minkowski have all
similar properties, the application of each may depend on empirical
testing, or on subtle details of the data-domain
Minkowski
(L metric)
Maximum
(L infinity)
Minkowski
(L metric)
http://publicationslist.org/junio
Distances – correlation-based
 Correlation-based measures: used if the data is numeric but not mixable
(so that it does not make sense to add a random fraction of one data set to
a random fraction of a different data set), as for example, in time series
The dot product of two points is the cosine of the angle that the two
vectors make with each other - if they are perfectly aligned then the
angle is 0 and the cosine (and the correlation) is 1; If they are at right
angles to each other, the cosine is 0
 The only difference between the dot
product and the correlation coefficient is
that for the second, we first center both
data points by subtracting their respective
means
 By construction, the value of a dot product
always falls in the interval [0, 1], and the
correlation coefficient always falls in the
interval [−1, 1]
http://publicationslist.org/junio
Distances – binary and sparse
If the data is categorical, then we can count the number of features
that do not agree in both data points (i.e., the number of mismatched
features); this is the Hamming distance
As an example, imagine a patient’s health record: each possible
medical condition constitutes a feature, and we want to know
whether the patient has ever suffered from it
In situations where the features are categorical, binary, and sparse
(just a few are On), we may be interested in matches between
features that are On than those that are Off; this leads us to the
Jaccard coefficient s: the number of matches between features that
are On for both points, divided by the number of features that are
On in at least one of the data points
The Jaccard coefficient is a similarity measure; the corresponding
distance function is the Jaccard distance dj = 1-sj
http://publicationslist.org/junio
Distances – binary and sparse
If the data is categorical, then we can count the number of features
that do not agree in both data points (i.e., the number of mismatched
features); this is the Hamming distance
As an example, imagine a patient’s health record: each possible
medical condition constitutes a feature, and we want to know
whether the patient has ever suffered from it
In situations where the features are categorical, binary, and sparse
(just a few are On), we may be interested in matches between
features that are On than those that are Off; this leads us to the
Jaccard coefficient s: the number of matches between features that
are On for both points, divided by the number of features that are
On in at least one of the data points
The Jaccard coefficient is a similarity measure; the corresponding
distance function is the Jaccard distance dj = 1-sj
The Jaccard distance:
As an example, imagine graph data.The similarity of two vertices is given by how
many neighbors they have in common (On) – what is usually sparse, as just a few
vertices are neighbors of a given vertex
http://publicationslist.org/junio
Distances – strings
If we are dealing with many strings that are rather similar to each
other (distorted through typos, for instance), then we can use a more
detailed measure of the difference between them—namely the edit
or Levenshtein distance. The Levenshtein distance is the minimum
number of single-character operations (insertions, deletions, and
substitutions) required to transform one string into the other
Another approach is to find the length of the longest common
subsequence; this metric is often used for gene sequence analysis in
computational biology
http://publicationslist.org/junio
Distances – strings
If we are dealing with many strings that are rather similar to each
other (distorted through typos, for instance), then we can use a more
detailed measure of the difference between them—namely the edit
or Levenshtein distance. The Levenshtein distance is the minimum
number of single-character operations (insertions, deletions, and
substitutions) required to transform one string into the other
Another approach is to find the length of the longest common
subsequence; this metric is often used for gene sequence analysis in
computational biology
The best distance measure to use does not follow automatically from data type; rather,
it depends on the semantics of the data—or, more precisely, on the semantics that you
care about for your current analysis!
In some cases, a simple metric that only calculates the difference in string length may be
perfectly sufficient. In another case, you might want to use the Hamming distance.
If you really care about the details of otherwise similar strings, the Levenshtein distance
is most appropriate.You might even want to calculate how often each letter appears in a
string and then base your comparison on that.
It all depends on what the data means and on what aspect of it you are interested at the
moment (which may also change as the analysis progresses).
Similar considerations apply everywhere—there are no “cookbook” rules.
http://publicationslist.org/junio
Clustering methods
Different algorithms are suitable for different kinds of problems—
depending, for example, on the shape and structure of the clusters
Some require vector-like data, whereas others require only a distance
function
Different algorithms tend to be misled by different kinds of pitfalls,
and they all have different performance (i.e., computational
complexity) characteristics
There are tree main categories of clustering algorithms: center
seekers, tree builders, and neighborhood growers – I said three main, not
only three (check Survey Of Clustering Data Mining Techniques of author Pavel
Berkhin)
http://publicationslist.org/junio
Clustering methods – k-means
One of the most popular clustering methods is the k-means
algorithm; the k-means algorithm requires the number of expected
clusters k as input, and works in an iterative scheme to search for the
correct center of each cluster
The main idea is to calculate the position of each cluster’s center (or
centroid) from the positions of the points belonging to the cluster
and then to assign points to their nearest centroid – this process is
repeated until sufficient convergence is achieved
The algorithm is as follows:
choose initial positions for the cluster centroids
repeat:
for each point:
calculate its distance from each cluster centroid
assign the point to the nearest cluster
recalculate the positions of the cluster centroids
http://publicationslist.org/junio
Clustering methods – k-means
 The k-means algorithm is nondeterministic: a different choice of starting
values may result in a different assignment of points to clusters; for this
reason, it is customary to run the k-means algorithm several times and then
compare the results
 If you have previous knowledge of likely positions for the cluster centers,
you can use it to precondition the algorithm; otherwise, choose random
data points as initial values.
 What makes this algorithm efficient is that you don’t have to search the
existing data points to find one that would make a good centroid—instead
you are free to construct a new centroid position; this is usually done by
calculating the cluster’s center of mass:
http://publicationslist.org/junio
Clustering methods – k-means
 The k-means algorithm is nondeterministic: a different choice of starting
values may result in a different assignment of points to clusters; for this
reason, it is customary to run the k-means algorithm several times and then
compare the results
 If you have previous knowledge of likely positions for the cluster centers,
you can use it to precondition the algorithm; otherwise, choose random
data points as initial values.
 What makes this algorithm efficient is that you don’t have to search the
existing data points to find one that would make a good centroid—instead
you are free to construct a new centroid position; this is usually done by
calculating the cluster’s center of mass:
If we are using categorical data, then the k-mean algorithm cannot be
used (one cannot calculate the mass center), in this case we must use
the k-medoids algorithm
The only difference is that instead of calculating a new centroid, it is
necessary to search all the points in the cluster to find the data point
that has the smallest average distance to all other points in its cluster
For this reason, the k-medoids algorithm is O(n2), meanwhile the k-
means algorithm is O(k*n), where k is the number of clusters
For performance,it is possible to run k-medoids in a sample of the
dataset to have an idea of the cluster centers, and then run it on the
entire dataset
http://publicationslist.org/junio
Clustering methods – k-means
 Despite its cheap-and-cheerful appearance, the k-means algorithm works surprisingly
well. It is pretty fast and relatively robust. Convergence is usually quick. Because the
algorithm is simple and highly intuitive, it is easy to augment or extend it—for example,
to incorporate points with different weights. You might also want to experiment with
different ways to calculate the centroid, possibly using the median position rather than
the mean, and so on.
 In summary:
 The k-means algorithms and its variants work best for globular (at least star-convex) clusters; the
results will be meaningless for clusters with complicated shapes and for nested clusters
 The expected number of clusters is required as an input; if this number is not known, it will be
necessary to repeat the algorithm with different values and compare the results
 The algorithm is iterative and nondeterministic; the specific outcome may depend on the choice of
starting values
 The k-means algorithm requires vector data; use the k-medoids algorithm for categorical data
 The algorithm can be misled if there are clusters of highly different size or different density
 The k-means algorithm is linear in the number of data points; the k-medoids algorithm is quadratic in
the number of points
http://publicationslist.org/junio
Clustering methods – DBSCAN
Neighborhood growers work by connecting points that are
“sufficiently close” to each other to form a cluster and then keep
doing so until all points have been classified
Based on the idea (definition) of a cluster as a region of high density,
and it makes no assumptions about the overall shape of the cluster
More robust than k-means variations in respect to the structure of
the clusters
http://publicationslist.org/junio
Clustering methods – DBSCAN
The DBSCAN algorithm is an example of Neighborhood grower
It is based on two metrics:
 The minimum density accepted for the points that define the cluster
 The size of the region over which we expect the minimum density to be
verified
 In practice, the algorithm asks for:
 The neighborhood radius r
 The minimum number of points n that we expect to find within the neighborhood of each
point
http://publicationslist.org/junio
Clustering methods – DBSCAN
DBSCAN distinguishes between three types of points: noise, core,
and edge points:
 A noise point is a point which has fewer than n points in its
neighborhood of radius r, such a point does not belong to any
cluster – background data
A core point has more than n neighbors
An edge point is a point that has fewer neighbors than required for
a core point but that is itself the neighbor of a core point - the
algorithm discards noise points and concentrates on core points
Whenever the algorithm finds a core point, it assigns a cluster
label to that point and then continues to add all its neighbors,
and their neighbors recursively to the cluster, until all points
have been classified
http://publicationslist.org/junio
Clustering methods – DBSCAN
DBSCAN distinguishes between three types of points: noise, core,
and edge points:
 A noise point is a point which has fewer than n points in its
neighborhood of radius r, such a point does not belong to any
cluster
A core point has more than n neighbors
An edge point is a point that has fewer neighbors than required for
a core point but that is itself the neighbor of a core point - the
algorithm discards noise points and concentrates on core points
Whenever the algorithm finds a core point, it assigns a cluster
label to that point and then continues to add all its neighbors,
and their neighbors recursively to the cluster, until all points
have been classified
Finally, the basic algorithm lends itself to elegant recursive implementations,
but keep in mind that the recursion will not unwind until the current
cluster is complete.This means that, in the worst case (of a single
connected cluster), you will end up putting the entire data set onto the
stack!
http://publicationslist.org/junio
Clustering methods – DBSCAN
DBSCAN is sensitive to the choice of parameters
For example, if a data set contains several clusters with widely varying
densities, then a single set of parameters may not be sufficient to
classify all of the clusters
A possible workaround it to use k-means first to identify cluster
candidates, and then to extract statistics that will help parametrize
DBSCAN
The computational complexity of DBSCAN is O(n2), what can be
ameliorated by indexing structures able to quickly find the neighbors
of each point
http://publicationslist.org/junio
Clustering methods – tree builders
Another way to find clusters is by successively combining clusters
that are “close” to each other into a larger cluster until only a single
cluster remains; this approach is known as agglomerative hierarchical
clustering, and it leads to a treelike hierarchy of clusters
The distance between clusters is given is respect to representative
points within each cluster, the possibilities are:
 Minimum or single link: the two points, one from each cluster that are
closest to each other; handles thinly connected clusters with complicated
shapes, but it is sensible to noise
 Maximum or complete link: considers the points the farthest away from each
other, favors compact globular clusters
 Average:considers the average between all pairs of points
 Centroid: considers the centroids of each cluster
 Ward’s method: combiners clusters whose coherence is higher; coherence
can be the average distance of all pairs, for example
http://publicationslist.org/junio
Clustering methods – tree builders
The result of hierarchical clustering is not actually a set of clusters;
instead, we obtain a treelike structure that contains the individual
data points at the leaf nodes - this structure can be represented
graphically in a dendrogram
Tree builder algorithms are expensive, on the order of O(n3)
 One outstanding feature of hierarchical clustering is that it does more than
produce a flat list of clusters; it also shows their relationships in an explicit way
 Tree builder can benefit from algorithms that are center seeker or
neighborhood growers
http://publicationslist.org/junio
Pre-processing
The core algorithm for grouping data points into clusters is usually
only part (though the most important one) of the whole strategy
Some data sets may require some cleanup or normalization before
they are suitable for clustering: that’s the first topic in this section
For example, look at the two plots below and answer: which one has
well-defined clusters?
http://publicationslist.org/junio
Pre-processing
For example, look at the two plots below and answer: which one has
well-defined clusters?
 Well, as a matter of fact, both plots show the same dataset, but with different
aspect ratios
 The same applies to datasets that spam to very different ranges – in such cases,
it is necessary to normalize the data
 Problems like these are not observed in correlation-based distance
http://publicationslist.org/junio
Pre-processing
 The simplest normalization can be achieved by:
x’ = (x – xmin)/(xmax – xmin)
 Or, otherwise, if the data is reasonably Gaussian, it is possible to use the Z-
score normalization:
x’ = (x – xmean)/xStdDev
But first, use an Interquartile Range analysis to get rid of outliers
 Actually, normalization is very sensitive to outliers and distributions that are
too skewed – for these cases, there are many other normalization
techniques, check for instance:
http://stn.spotfire.com/spotfire_client_help/norm/norm_normalizing_columns.
htm
http://publicationslist.org/junio
Pre-processing
 The simplest normalization can be achieved by:
x’ = (x – xmin)/(xmax-xmin)
 Or, otherwise, if the data is reasonably Gaussian, it is possible to use the Z-
score normalization:
x’ = (x - xmean)/xStdDev
But first, use an Interquartile Range analysis to get rid of outliers
 Actually, normalization is very sensitive to outliers and distributions that are
too skewed – for these cases, there are many other normalization
techniques, check for instance:
http://stn.spotfire.com/spotfire_client_help/norm/norm_normalizing_columns.
htm
http://stn.spotfire.com/spotfire_client_help/norm/norm_normalizing_columns.htm
Normalization by Mean
Normalization byTrimmed Mean
Normalization by Percentile
Scale between 0 and 1
Subtract the Mean
Subtract the Median
Normalization by Signed Ratio
Normalization by Log Ratio
Normalization by Log Ratio in Standard Deviation Units
Z-score Calculation
Normalization by Standard Deviation
Also, the Mahalanobis distance is less susceptible to normalization issues
http://publicationslist.org/junio
Post-processing (cluster evaluation)
 It is also necessary to inspect the results of every clustering algorithm in
order to validate and characterize the clusters that have been found
 Given a set of clusters whose centroids are known, we can think of two
metrics:
 Mass: the number of points in the cluster
 Radius: the standard deviation of the distances of all points in relation to
the center of a given cluster; for two dimensions, we would have:
r2 = ∑i (xc – xi)2 + (yc – yi)2
(xc,yc ) the center of a cluster
 We can also have the density of a cluster given by:
density = mass/radius
http://publicationslist.org/junio
Post-processing (cluster evaluation)
 Besides density, there are:
 Cohesion: the average distance between all points in a cluster, the smaller the more
compact
 Separation: the average distance between all point in one cluster, and all the points in
another cluster – if we know the centroids, we could use them to simplify calculi
 For a set of clusters, we can calculate the average cohesion and separation
for all clusters, and have an idea of the overall quality
 If a data set can be clearly grouped into clusters, then we expect the
distance between the clusters to be large compared to the radii of the
clusters; therefore, we can think of an interesting metric based on cohesion
and separation:
cluster_quality = separation/cohesion
http://publicationslist.org/junio
Post-processing (cluster evaluation)
 One the most used metrics for clustering is the Silhouette coefficient, which for a
sigle point i is given by:
Si = bi – ai .
max(ai,bi)
where ai is the average distance from point i to all other points in its cluster (this is
point i’s cohesion), bi is the smallest average distance from point i to all the points in
each of the other clusters (this is point i’s separation from the closest other cluster)
 The numerator is a measure for the “empty space” between clusters, the
denominator is the biggest between radius and distance between clusters
 Next, average the silhouette for all points in each cluster – this is the cluster’s
silhouette; average it for all clusters, this is the clustering’s silhouette
 The silhouette coefficient ranges from −1 to 1; negative values indicate that the
cluster radius is greater than the distance between clusters, so that clusters overlap;
this suggests poor clustering. Large values of S suggest good clustering
http://publicationslist.org/junio
Post-processing (cluster evaluation)
 One the most used metrics for clustering is the Silhouette coefficient, which for a
sigle point i is given by:
Si = bi – ai .
max(ai,bi)
where ai is the average distance from point i to all other points in its cluster (this is
point i’s cohesion), bi is the smallest average distance from point i to all the points in
each of the other clusters (this is point i’s separation from the closest other cluster)
 The numerator is a measure for the “empty space” between clusters, the
denominator is the biggest between radius and distance between clusters
 Next, average the silhouette for all points in each cluster – this is the cluster’s
silhouette; average it for all clusters, this is the clustering’s silhouette
 The silhouette coefficient ranges from −1 to 1; negative values indicate that the
cluster radius is greater than the distance between clusters, so that clusters overlap;
this suggests poor clustering. Large values of S suggest good clustering
The silhouette can be used to toss background points from the clustering process,
that is, points that notoriously exceed the average cohesion within a given cluster.
This process can be used iteratively – once some points are tossed off, the
clustering can be repeated and hopefully produce better results; and again.
http://publicationslist.org/junio
Post-processing (cluster evaluation)
 The clustering silhouette is very important, it not only tells us the quality of
a clustering, it can also tell us what is the correct clustering; for example,
consider the following dataset:
http://publicationslist.org/junio
Post-processing (cluster evaluation)
 The clustering silhouette is very important, it not only tells us the quality of
a clustering, it can also tell us what is the correct clustering; for example,
consider the following dataset:
Clearly we have clusters, but how many?Visually, we can track from 6 to 8 clusters,
depending on the observation.
What to do?
http://publicationslist.org/junio
Post-processing (cluster evaluation)
 One way to solve this problem is to use the k-mean algorithm and calculate
the Silhoutte different numbers of clusters
 In our example, we would get the following curve:
6 7
http://publicationslist.org/junio
Post-processing (cluster evaluation)
 One way to solve this problem is to use the k-mean algorithm and calculate
the Silhoutte different numbers of clusters
 In our example, we would get the following curve:
6 7
The plot indicates that 6 or 7 clusters are acceptable answers, the next stage is to
consider the data characteristics in order to define what the best answer is.
http://publicationslist.org/junio
Warning
 Just like any other analytical technique, clustering can lead you to
unproductive circumstances (waste of time) if not used with caution; some
points must be of concern:
 Most algorithms depend on heuristic parameters that may demand hours for one to
find the most appropriate values
 Also, the algorithm lend themselves to modifications that, although may sound
intuitively right, are taking you nowhere
 It is reasonably possible that, although you are looking for, the data has no clusters at
all; it is not such an improbable circumstance because clustering algorithms usually
are treated as black boxes – be circumspect, attention with the evidences!
 Despite the fact that there are evaluation methods and visualization tools, still the
clustering result may be flawed; remember, there are no formal theory behind cluster
concepts
 Finally, this review is mostly addressed for practitioners, and not for
academic personnel; for those, there are many other aspects that must be
considered – for more details, please check the paper “Survey Of Clustering
Data MiningTechniques” of author Pavel Berkhin, among other sources
http://publicationslist.org/junio
References
 Philipp K. Janert, Data Analysis with Open Source Tools,
O’Reilly, 2010.
 Wikipedia, http://en.wikipedia.org
 Wolfram MathWorld, http://mathworld.wolfram.com/

Weitere ähnliche Inhalte

Was ist angesagt?

Machine Learning by Analogy
Machine Learning by AnalogyMachine Learning by Analogy
Machine Learning by AnalogyColleen Farrelly
 
Active Image Clustering: Seeking Constraints from Humans to Complement Algori...
Active Image Clustering: Seeking Constraints from Humans to Complement Algori...Active Image Clustering: Seeking Constraints from Humans to Complement Algori...
Active Image Clustering: Seeking Constraints from Humans to Complement Algori...Harish Vaidyanathan
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster AnalysisSSA KPI
 
Community Detection
Community Detection Community Detection
Community Detection Kanika Kanwal
 
István Dienes Lecture For Unified Theories 2006
István Dienes Lecture For Unified Theories 2006István Dienes Lecture For Unified Theories 2006
István Dienes Lecture For Unified Theories 2006Istvan Dienes
 
The mathematical and philosophical concept of vector
The mathematical and philosophical concept of vectorThe mathematical and philosophical concept of vector
The mathematical and philosophical concept of vectorGeorge Mpantes
 

Was ist angesagt? (9)

Machine Learning by Analogy
Machine Learning by AnalogyMachine Learning by Analogy
Machine Learning by Analogy
 
Active Image Clustering: Seeking Constraints from Humans to Complement Algori...
Active Image Clustering: Seeking Constraints from Humans to Complement Algori...Active Image Clustering: Seeking Constraints from Humans to Complement Algori...
Active Image Clustering: Seeking Constraints from Humans to Complement Algori...
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
Community Detection
Community Detection Community Detection
Community Detection
 
A0310112
A0310112A0310112
A0310112
 
17 Statistical Models for Networks
17 Statistical Models for Networks17 Statistical Models for Networks
17 Statistical Models for Networks
 
István Dienes Lecture For Unified Theories 2006
István Dienes Lecture For Unified Theories 2006István Dienes Lecture For Unified Theories 2006
István Dienes Lecture For Unified Theories 2006
 
Bachelor's Thesis
Bachelor's ThesisBachelor's Thesis
Bachelor's Thesis
 
The mathematical and philosophical concept of vector
The mathematical and philosophical concept of vectorThe mathematical and philosophical concept of vector
The mathematical and philosophical concept of vector
 

Ähnlich wie Data Clustering Techniques Explored in Educational Document

International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
Read first few slides cluster analysis
Read first few slides cluster analysisRead first few slides cluster analysis
Read first few slides cluster analysisKritika Jain
 
Clusteranalysis 121206234137-phpapp01
Clusteranalysis 121206234137-phpapp01Clusteranalysis 121206234137-phpapp01
Clusteranalysis 121206234137-phpapp01deepti gupta
 
PCA (Principal component analysis)
PCA (Principal component analysis)PCA (Principal component analysis)
PCA (Principal component analysis)Learnbay Datascience
 
Data Science - Part VII - Cluster Analysis
Data Science - Part VII -  Cluster AnalysisData Science - Part VII -  Cluster Analysis
Data Science - Part VII - Cluster AnalysisDerek Kane
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithmLaura Petrosanu
 
Exploration of Imputation Methods for Missingness in Image Segmentation
Exploration of Imputation Methods for Missingness in Image SegmentationExploration of Imputation Methods for Missingness in Image Segmentation
Exploration of Imputation Methods for Missingness in Image SegmentationChristopher Peter Makris
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningKnoldus Inc.
 
Machine learning session8(svm nlp)
Machine learning   session8(svm nlp)Machine learning   session8(svm nlp)
Machine learning session8(svm nlp)Abhimanyu Dwivedi
 
GIS in Public Health Research: Understanding Spatial Analysis and Interpretin...
GIS in Public Health Research: Understanding Spatial Analysis and Interpretin...GIS in Public Health Research: Understanding Spatial Analysis and Interpretin...
GIS in Public Health Research: Understanding Spatial Analysis and Interpretin...hpaocec
 
20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.pptSamPrem3
 
20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.pptPalaniKumarR2
 
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkOBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkshesnasuneer
 

Ähnlich wie Data Clustering Techniques Explored in Educational Document (20)

International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Read first few slides cluster analysis
Read first few slides cluster analysisRead first few slides cluster analysis
Read first few slides cluster analysis
 
Clusteranalysis 121206234137-phpapp01
Clusteranalysis 121206234137-phpapp01Clusteranalysis 121206234137-phpapp01
Clusteranalysis 121206234137-phpapp01
 
Clusteranalysis
Clusteranalysis Clusteranalysis
Clusteranalysis
 
[PPT]
[PPT][PPT]
[PPT]
 
Document 8 1.pdf
Document 8 1.pdfDocument 8 1.pdf
Document 8 1.pdf
 
PCA (Principal component analysis)
PCA (Principal component analysis)PCA (Principal component analysis)
PCA (Principal component analysis)
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Clustering
ClusteringClustering
Clustering
 
Data Science - Part VII - Cluster Analysis
Data Science - Part VII -  Cluster AnalysisData Science - Part VII -  Cluster Analysis
Data Science - Part VII - Cluster Analysis
 
Cluster Analysis.pptx
Cluster Analysis.pptxCluster Analysis.pptx
Cluster Analysis.pptx
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm
 
Exploration of Imputation Methods for Missingness in Image Segmentation
Exploration of Imputation Methods for Missingness in Image SegmentationExploration of Imputation Methods for Missingness in Image Segmentation
Exploration of Imputation Methods for Missingness in Image Segmentation
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Machine learning session8(svm nlp)
Machine learning   session8(svm nlp)Machine learning   session8(svm nlp)
Machine learning session8(svm nlp)
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
GIS in Public Health Research: Understanding Spatial Analysis and Interpretin...
GIS in Public Health Research: Understanding Spatial Analysis and Interpretin...GIS in Public Health Research: Understanding Spatial Analysis and Interpretin...
GIS in Public Health Research: Understanding Spatial Analysis and Interpretin...
 
20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt
 
20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt
 
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkOBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
 

Mehr von Universidade de São Paulo

Introdução às ferramentas de Business Intelligence do ecossistema Hadoop
Introdução às ferramentas de Business Intelligence do ecossistema HadoopIntrodução às ferramentas de Business Intelligence do ecossistema Hadoop
Introdução às ferramentas de Business Intelligence do ecossistema HadoopUniversidade de São Paulo
 
On the Support of a Similarity-Enabled Relational Database Management System ...
On the Support of a Similarity-Enabled Relational Database Management System ...On the Support of a Similarity-Enabled Relational Database Management System ...
On the Support of a Similarity-Enabled Relational Database Management System ...Universidade de São Paulo
 
Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...
Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...
Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...Universidade de São Paulo
 
Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...
Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...
Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...Universidade de São Paulo
 
Unveiling smoke in social images with the SmokeBlock approach
Unveiling smoke in social images with the SmokeBlock approachUnveiling smoke in social images with the SmokeBlock approach
Unveiling smoke in social images with the SmokeBlock approachUniversidade de São Paulo
 
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale Graphs
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale GraphsVertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale Graphs
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale GraphsUniversidade de São Paulo
 
Fast Billion-scale Graph Computation Using a Bimodal Block Processing Model
Fast Billion-scale Graph Computation Using a Bimodal Block Processing ModelFast Billion-scale Graph Computation Using a Bimodal Block Processing Model
Fast Billion-scale Graph Computation Using a Bimodal Block Processing ModelUniversidade de São Paulo
 
StructMatrix: large-scale visualization of graphs by means of structure detec...
StructMatrix: large-scale visualization of graphs by means of structure detec...StructMatrix: large-scale visualization of graphs by means of structure detec...
StructMatrix: large-scale visualization of graphs by means of structure detec...Universidade de São Paulo
 
Techniques for effective and efficient fire detection from social media images
Techniques for effective and efficient fire detection from social media imagesTechniques for effective and efficient fire detection from social media images
Techniques for effective and efficient fire detection from social media imagesUniversidade de São Paulo
 
Multimodal graph-based analysis over the DBLP repository: critical discoverie...
Multimodal graph-based analysis over the DBLP repository: critical discoverie...Multimodal graph-based analysis over the DBLP repository: critical discoverie...
Multimodal graph-based analysis over the DBLP repository: critical discoverie...Universidade de São Paulo
 
Supervised-Learning Link Recommendation in the DBLP co-authoring network
Supervised-Learning Link Recommendation in the DBLP co-authoring networkSupervised-Learning Link Recommendation in the DBLP co-authoring network
Supervised-Learning Link Recommendation in the DBLP co-authoring networkUniversidade de São Paulo
 
Reviewing Data Visualization: an Analytical Taxonomical Study
Reviewing Data Visualization: an Analytical Taxonomical StudyReviewing Data Visualization: an Analytical Taxonomical Study
Reviewing Data Visualization: an Analytical Taxonomical StudyUniversidade de São Paulo
 
Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...
Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...
Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...Universidade de São Paulo
 
Visualization tree multiple linked analytical decisions
Visualization tree multiple linked analytical decisionsVisualization tree multiple linked analytical decisions
Visualization tree multiple linked analytical decisionsUniversidade de São Paulo
 

Mehr von Universidade de São Paulo (20)

A gentle introduction to Deep Learning
A gentle introduction to Deep LearningA gentle introduction to Deep Learning
A gentle introduction to Deep Learning
 
Computação: carreira e mercado de trabalho
Computação: carreira e mercado de trabalhoComputação: carreira e mercado de trabalho
Computação: carreira e mercado de trabalho
 
Introdução às ferramentas de Business Intelligence do ecossistema Hadoop
Introdução às ferramentas de Business Intelligence do ecossistema HadoopIntrodução às ferramentas de Business Intelligence do ecossistema Hadoop
Introdução às ferramentas de Business Intelligence do ecossistema Hadoop
 
On the Support of a Similarity-Enabled Relational Database Management System ...
On the Support of a Similarity-Enabled Relational Database Management System ...On the Support of a Similarity-Enabled Relational Database Management System ...
On the Support of a Similarity-Enabled Relational Database Management System ...
 
Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...
Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...
Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...
 
Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...
Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...
Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...
 
Unveiling smoke in social images with the SmokeBlock approach
Unveiling smoke in social images with the SmokeBlock approachUnveiling smoke in social images with the SmokeBlock approach
Unveiling smoke in social images with the SmokeBlock approach
 
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale Graphs
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale GraphsVertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale Graphs
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale Graphs
 
Fast Billion-scale Graph Computation Using a Bimodal Block Processing Model
Fast Billion-scale Graph Computation Using a Bimodal Block Processing ModelFast Billion-scale Graph Computation Using a Bimodal Block Processing Model
Fast Billion-scale Graph Computation Using a Bimodal Block Processing Model
 
An introduction to MongoDB
An introduction to MongoDBAn introduction to MongoDB
An introduction to MongoDB
 
StructMatrix: large-scale visualization of graphs by means of structure detec...
StructMatrix: large-scale visualization of graphs by means of structure detec...StructMatrix: large-scale visualization of graphs by means of structure detec...
StructMatrix: large-scale visualization of graphs by means of structure detec...
 
Apresentacao vldb
Apresentacao vldbApresentacao vldb
Apresentacao vldb
 
Techniques for effective and efficient fire detection from social media images
Techniques for effective and efficient fire detection from social media imagesTechniques for effective and efficient fire detection from social media images
Techniques for effective and efficient fire detection from social media images
 
Multimodal graph-based analysis over the DBLP repository: critical discoverie...
Multimodal graph-based analysis over the DBLP repository: critical discoverie...Multimodal graph-based analysis over the DBLP repository: critical discoverie...
Multimodal graph-based analysis over the DBLP repository: critical discoverie...
 
Supervised-Learning Link Recommendation in the DBLP co-authoring network
Supervised-Learning Link Recommendation in the DBLP co-authoring networkSupervised-Learning Link Recommendation in the DBLP co-authoring network
Supervised-Learning Link Recommendation in the DBLP co-authoring network
 
Graph-based Relational Data Visualization
Graph-based RelationalData VisualizationGraph-based RelationalData Visualization
Graph-based Relational Data Visualization
 
Reviewing Data Visualization: an Analytical Taxonomical Study
Reviewing Data Visualization: an Analytical Taxonomical StudyReviewing Data Visualization: an Analytical Taxonomical Study
Reviewing Data Visualization: an Analytical Taxonomical Study
 
Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...
Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...
Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...
 
Dawarehouse e OLAP
Dawarehouse e OLAPDawarehouse e OLAP
Dawarehouse e OLAP
 
Visualization tree multiple linked analytical decisions
Visualization tree multiple linked analytical decisionsVisualization tree multiple linked analytical decisions
Visualization tree multiple linked analytical decisions
 

Kürzlich hochgeladen

ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvRicaMaeCastro1
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxSayali Powar
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleCeline George
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...DhatriParmar
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseCeline George
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
 
Mental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsMental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsPooky Knightsmith
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 

Kürzlich hochgeladen (20)

ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP Module
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 Database
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
 
Mental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsMental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young minds
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 
prashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Professionprashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Profession
 

Data Clustering Techniques Explored in Educational Document

  • 2. http://publicationslist.org/junio What is it about? Clustering refers to the process of finding groups of points that are in some way “lumped together” A modality of unsupervised learning, as we do not know ahead of time where and what are the clusters – no training! Explanatorily tries to characterize the structure of a dataset
  • 3. http://publicationslist.org/junio But, what is a cluster? groups of points that are similar groups of points that are close to each other groups well-separated one from each other contiguous regions of high data point density separated by regions of lower point density
  • 4. http://publicationslist.org/junio But, what is a cluster? Any clusters here? There should not be, as they are uniformly (no two points overlap, yet) generated points. Eventhough, most algorithms would point out some clusters. It is not that there are clusters there, it is only that we do not have enough points yet.
  • 5. http://publicationslist.org/junio But, what is a cluster? Any clusters here? There should not be, as they are uniformly (no two points overlap, yet) generated points. Eventhough, most algorithms would point out some clusters. It is not that there are clusters there, it is only that we do not have enough points yet. The point here is – although one would find clusters, they definitely do not explain the phenomenon accurately.
  • 6. http://publicationslist.org/junio But, what is a cluster? Yes! Three clusters, I can see them. Distance-based algorithms can do well here. Easy huh?! No wonder, here we have convex, disjoint, and well-separated groups of points. Try the next ones!
  • 7. http://publicationslist.org/junio But, what is a cluster? Non-convex clusters – simple distance-based algorithms would have trouble here. A cluster is convex if the line connecting any two points lies entirely within the cluster itself.
  • 8. http://publicationslist.org/junio But, what is a cluster? Non-convex clusters – simple distance-based algorithms would have trouble here. A cluster is convex if the line connecting any two points lies entirely within the cluster itself. There are also the star-convex clusters: in such case, the connecting line from the spatial center of the cluster to any other point lies entirely within the cluster.
  • 9. http://publicationslist.org/junio But, what is a cluster? Intersecting clusters – quite a challenge!
  • 10. http://publicationslist.org/junio But, what is a cluster? No general clustering algorithm can solve this.The clustering is given by the global properties observed in the points – distance or neighbor based algorithms would yield a single cluster.
  • 11. http://publicationslist.org/junio But, what is a cluster? No general clustering algorithm can solve this.The clustering is given by the global properties observed in the points – distance or neighbor based algorithms would yield a single cluster. In this case, for any algorithm that considers a single point (or a single pair of points) at a time, this leads to a problem: to determine cluster membership, we need the property of the whole cluster; but to determine the properties (vertical, horizontal, and pairwise orthogonal) of the cluster, we must first assign points to clusters.
  • 12. http://publicationslist.org/junio But, what is a cluster? To handle such situations, we would need to perform some kind of global structure analysis—a task our minds are incredibly good at (which is why we tend to think of clusters this way) but that we have a hard time teaching computers to do For problems in two dimensions, digital image processing has developed methods to recognize and extract certain features (such as edge detection) But general clustering methods deal only with local properties and therefore can’t handle problems such as these
  • 13. http://publicationslist.org/junio But, what is a cluster? If we return to our candidate definitions of cluster, we can verify that none of them survives the possibilities just presented – try it!  groups of points that are similar  groups of points that are close to each other  groups well-separated one from each other  contiguous regions of high data point density separated by regions of lower point density
  • 14. http://publicationslist.org/junio But, what is a cluster? If we return to our candidate definitions of cluster, we can verify that none of them survives the possibilities just presented – try it!  groups of points that are similar  groups of points that are close to each other  groups of points well-separated one from each other  contiguous regions of high data point density separated by regions of lower point density So this is it. • No mathematical, nor universal definition of a cluster • Rather, we have our intuition and it could be quite useful provided we have a good comprehension of the data properties – structural, statistical, and domain- related • Having, as much as possible, well-defined goals is also a demand • Just as for any other data analysis approach, do not try to use it as a magic black box – doing so will fail with high probability!
  • 15. http://publicationslist.org/junio Distances Clustering does not actually require data points to be embedded into a geometric space: all that is required is a distance or (equivalently) a similarity measure for any pair of points  This makes it possible to perform clustering on a set of strings, for example  However, if the data points have properties of a vector space then we can develop more efficient algorithms that exploit these properties
  • 16. http://publicationslist.org/junio Distances – what is it? A distance is any function d(x, y) that takes two points and returns a scalar value that is a measure for how different these points are: the more different, the larger the distance A distance function – or, a similarity function:  s(x, y) = 1-d(x,y), for 0 ≤ d(x,y) ≤ 1  s(x,y) = 1/d(x,y)  s(x,y) = e-d For some problems, a particular distance measure will present itself naturally - if the data points are points in space, then we will most likely employ the Euclidean distance or a measure similar to it, but for other problems, we have more freedom to define our own metric
  • 17. http://publicationslist.org/junio Distances – metric distances  There are certain properties that a distance (or similarity) function should have. Mathematicians have developed a set of properties that a function must possess to be considered a metric (or distance) in a mathematical sense  d(x, y) = 0  d(x, y) = 0 if and only if x = y  d(x, y) = d(y, x)  d(x, y) + d(y, z) ≥ d(x, z)  These conditions are not necessarily fulfilled in practice. A funny example for an asymmetric distance occurs if you ask everyone in a group of people how much they like every other member of the group and then use the responses to construct a distance measure: it is not at all guaranteed that the feelings of person A for person B are requited by B
  • 18. http://publicationslist.org/junio Distances – metric distances  There are certain properties that a distance (or similarity) function should have. Mathematicians have developed a set of properties that a function must possess to be considered a metric (or distance) in a mathematical sense.  d(x, y) = 0  d(x, y) = 0 if and only if x = y  d(x, y) = d(y, x)  d(x, y) + d(y, z) ≥ d(x, z)  These conditions are not necessarily fulfilled in practice. A funny example for an asymmetric distance occurs if you ask everyone in a group of people how much they like every other member of the group and then use the responses to construct a distance measure: it is not at all guaranteed that the feelings of person A for person B are requited by B For technical reasons, the symmetry property is usually highly desirable. You can always construct a symmetric distance function from an asymmetric one: dS(x, y) = d(x, y) + d(y, x) 2
  • 19. http://publicationslist.org/junio Distances – common distances  Commonly used distance and similarity measures for numeric data
  • 20. http://publicationslist.org/junio Distances – common distances  Distances Manhattan, Euclidean, Maximum, and Minkowski have all similar properties, the application of each may depend on empirical testing, or on subtle details of the data-domain Minkowski (L metric) Maximum (L infinity) Minkowski (L metric)
  • 21. http://publicationslist.org/junio Distances – correlation-based  Correlation-based measures: used if the data is numeric but not mixable (so that it does not make sense to add a random fraction of one data set to a random fraction of a different data set), as for example, in time series The dot product of two points is the cosine of the angle that the two vectors make with each other - if they are perfectly aligned then the angle is 0 and the cosine (and the correlation) is 1; If they are at right angles to each other, the cosine is 0  The only difference between the dot product and the correlation coefficient is that for the second, we first center both data points by subtracting their respective means  By construction, the value of a dot product always falls in the interval [0, 1], and the correlation coefficient always falls in the interval [−1, 1]
  • 22. http://publicationslist.org/junio Distances – binary and sparse If the data is categorical, then we can count the number of features that do not agree in both data points (i.e., the number of mismatched features); this is the Hamming distance As an example, imagine a patient’s health record: each possible medical condition constitutes a feature, and we want to know whether the patient has ever suffered from it In situations where the features are categorical, binary, and sparse (just a few are On), we may be interested in matches between features that are On than those that are Off; this leads us to the Jaccard coefficient s: the number of matches between features that are On for both points, divided by the number of features that are On in at least one of the data points The Jaccard coefficient is a similarity measure; the corresponding distance function is the Jaccard distance dj = 1-sj
  • 23. http://publicationslist.org/junio Distances – binary and sparse If the data is categorical, then we can count the number of features that do not agree in both data points (i.e., the number of mismatched features); this is the Hamming distance As an example, imagine a patient’s health record: each possible medical condition constitutes a feature, and we want to know whether the patient has ever suffered from it In situations where the features are categorical, binary, and sparse (just a few are On), we may be interested in matches between features that are On than those that are Off; this leads us to the Jaccard coefficient s: the number of matches between features that are On for both points, divided by the number of features that are On in at least one of the data points The Jaccard coefficient is a similarity measure; the corresponding distance function is the Jaccard distance dj = 1-sj The Jaccard distance: As an example, imagine graph data.The similarity of two vertices is given by how many neighbors they have in common (On) – what is usually sparse, as just a few vertices are neighbors of a given vertex
  • 24. http://publicationslist.org/junio Distances – strings If we are dealing with many strings that are rather similar to each other (distorted through typos, for instance), then we can use a more detailed measure of the difference between them—namely the edit or Levenshtein distance. The Levenshtein distance is the minimum number of single-character operations (insertions, deletions, and substitutions) required to transform one string into the other Another approach is to find the length of the longest common subsequence; this metric is often used for gene sequence analysis in computational biology
  • 25. http://publicationslist.org/junio Distances – strings If we are dealing with many strings that are rather similar to each other (distorted through typos, for instance), then we can use a more detailed measure of the difference between them—namely the edit or Levenshtein distance. The Levenshtein distance is the minimum number of single-character operations (insertions, deletions, and substitutions) required to transform one string into the other Another approach is to find the length of the longest common subsequence; this metric is often used for gene sequence analysis in computational biology The best distance measure to use does not follow automatically from data type; rather, it depends on the semantics of the data—or, more precisely, on the semantics that you care about for your current analysis! In some cases, a simple metric that only calculates the difference in string length may be perfectly sufficient. In another case, you might want to use the Hamming distance. If you really care about the details of otherwise similar strings, the Levenshtein distance is most appropriate.You might even want to calculate how often each letter appears in a string and then base your comparison on that. It all depends on what the data means and on what aspect of it you are interested at the moment (which may also change as the analysis progresses). Similar considerations apply everywhere—there are no “cookbook” rules.
  • 26. http://publicationslist.org/junio Clustering methods Different algorithms are suitable for different kinds of problems— depending, for example, on the shape and structure of the clusters Some require vector-like data, whereas others require only a distance function Different algorithms tend to be misled by different kinds of pitfalls, and they all have different performance (i.e., computational complexity) characteristics There are tree main categories of clustering algorithms: center seekers, tree builders, and neighborhood growers – I said three main, not only three (check Survey Of Clustering Data Mining Techniques of author Pavel Berkhin)
  • 27. http://publicationslist.org/junio Clustering methods – k-means One of the most popular clustering methods is the k-means algorithm; the k-means algorithm requires the number of expected clusters k as input, and works in an iterative scheme to search for the correct center of each cluster The main idea is to calculate the position of each cluster’s center (or centroid) from the positions of the points belonging to the cluster and then to assign points to their nearest centroid – this process is repeated until sufficient convergence is achieved The algorithm is as follows: choose initial positions for the cluster centroids repeat: for each point: calculate its distance from each cluster centroid assign the point to the nearest cluster recalculate the positions of the cluster centroids
  • 28. http://publicationslist.org/junio Clustering methods – k-means  The k-means algorithm is nondeterministic: a different choice of starting values may result in a different assignment of points to clusters; for this reason, it is customary to run the k-means algorithm several times and then compare the results  If you have previous knowledge of likely positions for the cluster centers, you can use it to precondition the algorithm; otherwise, choose random data points as initial values.  What makes this algorithm efficient is that you don’t have to search the existing data points to find one that would make a good centroid—instead you are free to construct a new centroid position; this is usually done by calculating the cluster’s center of mass:
  • 29. http://publicationslist.org/junio Clustering methods – k-means  The k-means algorithm is nondeterministic: a different choice of starting values may result in a different assignment of points to clusters; for this reason, it is customary to run the k-means algorithm several times and then compare the results  If you have previous knowledge of likely positions for the cluster centers, you can use it to precondition the algorithm; otherwise, choose random data points as initial values.  What makes this algorithm efficient is that you don’t have to search the existing data points to find one that would make a good centroid—instead you are free to construct a new centroid position; this is usually done by calculating the cluster’s center of mass: If we are using categorical data, then the k-mean algorithm cannot be used (one cannot calculate the mass center), in this case we must use the k-medoids algorithm The only difference is that instead of calculating a new centroid, it is necessary to search all the points in the cluster to find the data point that has the smallest average distance to all other points in its cluster For this reason, the k-medoids algorithm is O(n2), meanwhile the k- means algorithm is O(k*n), where k is the number of clusters For performance,it is possible to run k-medoids in a sample of the dataset to have an idea of the cluster centers, and then run it on the entire dataset
  • 30. http://publicationslist.org/junio Clustering methods – k-means  Despite its cheap-and-cheerful appearance, the k-means algorithm works surprisingly well. It is pretty fast and relatively robust. Convergence is usually quick. Because the algorithm is simple and highly intuitive, it is easy to augment or extend it—for example, to incorporate points with different weights. You might also want to experiment with different ways to calculate the centroid, possibly using the median position rather than the mean, and so on.  In summary:  The k-means algorithms and its variants work best for globular (at least star-convex) clusters; the results will be meaningless for clusters with complicated shapes and for nested clusters  The expected number of clusters is required as an input; if this number is not known, it will be necessary to repeat the algorithm with different values and compare the results  The algorithm is iterative and nondeterministic; the specific outcome may depend on the choice of starting values  The k-means algorithm requires vector data; use the k-medoids algorithm for categorical data  The algorithm can be misled if there are clusters of highly different size or different density  The k-means algorithm is linear in the number of data points; the k-medoids algorithm is quadratic in the number of points
  • 31. http://publicationslist.org/junio Clustering methods – DBSCAN Neighborhood growers work by connecting points that are “sufficiently close” to each other to form a cluster and then keep doing so until all points have been classified Based on the idea (definition) of a cluster as a region of high density, and it makes no assumptions about the overall shape of the cluster More robust than k-means variations in respect to the structure of the clusters
  • 32. http://publicationslist.org/junio Clustering methods – DBSCAN The DBSCAN algorithm is an example of Neighborhood grower It is based on two metrics:  The minimum density accepted for the points that define the cluster  The size of the region over which we expect the minimum density to be verified  In practice, the algorithm asks for:  The neighborhood radius r  The minimum number of points n that we expect to find within the neighborhood of each point
  • 33. http://publicationslist.org/junio Clustering methods – DBSCAN DBSCAN distinguishes between three types of points: noise, core, and edge points:  A noise point is a point which has fewer than n points in its neighborhood of radius r, such a point does not belong to any cluster – background data A core point has more than n neighbors An edge point is a point that has fewer neighbors than required for a core point but that is itself the neighbor of a core point - the algorithm discards noise points and concentrates on core points Whenever the algorithm finds a core point, it assigns a cluster label to that point and then continues to add all its neighbors, and their neighbors recursively to the cluster, until all points have been classified
  • 34. http://publicationslist.org/junio Clustering methods – DBSCAN DBSCAN distinguishes between three types of points: noise, core, and edge points:  A noise point is a point which has fewer than n points in its neighborhood of radius r, such a point does not belong to any cluster A core point has more than n neighbors An edge point is a point that has fewer neighbors than required for a core point but that is itself the neighbor of a core point - the algorithm discards noise points and concentrates on core points Whenever the algorithm finds a core point, it assigns a cluster label to that point and then continues to add all its neighbors, and their neighbors recursively to the cluster, until all points have been classified Finally, the basic algorithm lends itself to elegant recursive implementations, but keep in mind that the recursion will not unwind until the current cluster is complete.This means that, in the worst case (of a single connected cluster), you will end up putting the entire data set onto the stack!
  • 35. http://publicationslist.org/junio Clustering methods – DBSCAN DBSCAN is sensitive to the choice of parameters For example, if a data set contains several clusters with widely varying densities, then a single set of parameters may not be sufficient to classify all of the clusters A possible workaround it to use k-means first to identify cluster candidates, and then to extract statistics that will help parametrize DBSCAN The computational complexity of DBSCAN is O(n2), what can be ameliorated by indexing structures able to quickly find the neighbors of each point
  • 36. http://publicationslist.org/junio Clustering methods – tree builders Another way to find clusters is by successively combining clusters that are “close” to each other into a larger cluster until only a single cluster remains; this approach is known as agglomerative hierarchical clustering, and it leads to a treelike hierarchy of clusters The distance between clusters is given is respect to representative points within each cluster, the possibilities are:  Minimum or single link: the two points, one from each cluster that are closest to each other; handles thinly connected clusters with complicated shapes, but it is sensible to noise  Maximum or complete link: considers the points the farthest away from each other, favors compact globular clusters  Average:considers the average between all pairs of points  Centroid: considers the centroids of each cluster  Ward’s method: combiners clusters whose coherence is higher; coherence can be the average distance of all pairs, for example
  • 37. http://publicationslist.org/junio Clustering methods – tree builders The result of hierarchical clustering is not actually a set of clusters; instead, we obtain a treelike structure that contains the individual data points at the leaf nodes - this structure can be represented graphically in a dendrogram Tree builder algorithms are expensive, on the order of O(n3)  One outstanding feature of hierarchical clustering is that it does more than produce a flat list of clusters; it also shows their relationships in an explicit way  Tree builder can benefit from algorithms that are center seeker or neighborhood growers
  • 38. http://publicationslist.org/junio Pre-processing The core algorithm for grouping data points into clusters is usually only part (though the most important one) of the whole strategy Some data sets may require some cleanup or normalization before they are suitable for clustering: that’s the first topic in this section For example, look at the two plots below and answer: which one has well-defined clusters?
  • 39. http://publicationslist.org/junio Pre-processing For example, look at the two plots below and answer: which one has well-defined clusters?  Well, as a matter of fact, both plots show the same dataset, but with different aspect ratios  The same applies to datasets that spam to very different ranges – in such cases, it is necessary to normalize the data  Problems like these are not observed in correlation-based distance
  • 40. http://publicationslist.org/junio Pre-processing  The simplest normalization can be achieved by: x’ = (x – xmin)/(xmax – xmin)  Or, otherwise, if the data is reasonably Gaussian, it is possible to use the Z- score normalization: x’ = (x – xmean)/xStdDev But first, use an Interquartile Range analysis to get rid of outliers  Actually, normalization is very sensitive to outliers and distributions that are too skewed – for these cases, there are many other normalization techniques, check for instance: http://stn.spotfire.com/spotfire_client_help/norm/norm_normalizing_columns. htm
  • 41. http://publicationslist.org/junio Pre-processing  The simplest normalization can be achieved by: x’ = (x – xmin)/(xmax-xmin)  Or, otherwise, if the data is reasonably Gaussian, it is possible to use the Z- score normalization: x’ = (x - xmean)/xStdDev But first, use an Interquartile Range analysis to get rid of outliers  Actually, normalization is very sensitive to outliers and distributions that are too skewed – for these cases, there are many other normalization techniques, check for instance: http://stn.spotfire.com/spotfire_client_help/norm/norm_normalizing_columns. htm http://stn.spotfire.com/spotfire_client_help/norm/norm_normalizing_columns.htm Normalization by Mean Normalization byTrimmed Mean Normalization by Percentile Scale between 0 and 1 Subtract the Mean Subtract the Median Normalization by Signed Ratio Normalization by Log Ratio Normalization by Log Ratio in Standard Deviation Units Z-score Calculation Normalization by Standard Deviation Also, the Mahalanobis distance is less susceptible to normalization issues
  • 42. http://publicationslist.org/junio Post-processing (cluster evaluation)  It is also necessary to inspect the results of every clustering algorithm in order to validate and characterize the clusters that have been found  Given a set of clusters whose centroids are known, we can think of two metrics:  Mass: the number of points in the cluster  Radius: the standard deviation of the distances of all points in relation to the center of a given cluster; for two dimensions, we would have: r2 = ∑i (xc – xi)2 + (yc – yi)2 (xc,yc ) the center of a cluster  We can also have the density of a cluster given by: density = mass/radius
  • 43. http://publicationslist.org/junio Post-processing (cluster evaluation)  Besides density, there are:  Cohesion: the average distance between all points in a cluster, the smaller the more compact  Separation: the average distance between all point in one cluster, and all the points in another cluster – if we know the centroids, we could use them to simplify calculi  For a set of clusters, we can calculate the average cohesion and separation for all clusters, and have an idea of the overall quality  If a data set can be clearly grouped into clusters, then we expect the distance between the clusters to be large compared to the radii of the clusters; therefore, we can think of an interesting metric based on cohesion and separation: cluster_quality = separation/cohesion
  • 44. http://publicationslist.org/junio Post-processing (cluster evaluation)  One the most used metrics for clustering is the Silhouette coefficient, which for a sigle point i is given by: Si = bi – ai . max(ai,bi) where ai is the average distance from point i to all other points in its cluster (this is point i’s cohesion), bi is the smallest average distance from point i to all the points in each of the other clusters (this is point i’s separation from the closest other cluster)  The numerator is a measure for the “empty space” between clusters, the denominator is the biggest between radius and distance between clusters  Next, average the silhouette for all points in each cluster – this is the cluster’s silhouette; average it for all clusters, this is the clustering’s silhouette  The silhouette coefficient ranges from −1 to 1; negative values indicate that the cluster radius is greater than the distance between clusters, so that clusters overlap; this suggests poor clustering. Large values of S suggest good clustering
  • 45. http://publicationslist.org/junio Post-processing (cluster evaluation)  One the most used metrics for clustering is the Silhouette coefficient, which for a sigle point i is given by: Si = bi – ai . max(ai,bi) where ai is the average distance from point i to all other points in its cluster (this is point i’s cohesion), bi is the smallest average distance from point i to all the points in each of the other clusters (this is point i’s separation from the closest other cluster)  The numerator is a measure for the “empty space” between clusters, the denominator is the biggest between radius and distance between clusters  Next, average the silhouette for all points in each cluster – this is the cluster’s silhouette; average it for all clusters, this is the clustering’s silhouette  The silhouette coefficient ranges from −1 to 1; negative values indicate that the cluster radius is greater than the distance between clusters, so that clusters overlap; this suggests poor clustering. Large values of S suggest good clustering The silhouette can be used to toss background points from the clustering process, that is, points that notoriously exceed the average cohesion within a given cluster. This process can be used iteratively – once some points are tossed off, the clustering can be repeated and hopefully produce better results; and again.
  • 46. http://publicationslist.org/junio Post-processing (cluster evaluation)  The clustering silhouette is very important, it not only tells us the quality of a clustering, it can also tell us what is the correct clustering; for example, consider the following dataset:
  • 47. http://publicationslist.org/junio Post-processing (cluster evaluation)  The clustering silhouette is very important, it not only tells us the quality of a clustering, it can also tell us what is the correct clustering; for example, consider the following dataset: Clearly we have clusters, but how many?Visually, we can track from 6 to 8 clusters, depending on the observation. What to do?
  • 48. http://publicationslist.org/junio Post-processing (cluster evaluation)  One way to solve this problem is to use the k-mean algorithm and calculate the Silhoutte different numbers of clusters  In our example, we would get the following curve: 6 7
  • 49. http://publicationslist.org/junio Post-processing (cluster evaluation)  One way to solve this problem is to use the k-mean algorithm and calculate the Silhoutte different numbers of clusters  In our example, we would get the following curve: 6 7 The plot indicates that 6 or 7 clusters are acceptable answers, the next stage is to consider the data characteristics in order to define what the best answer is.
  • 50. http://publicationslist.org/junio Warning  Just like any other analytical technique, clustering can lead you to unproductive circumstances (waste of time) if not used with caution; some points must be of concern:  Most algorithms depend on heuristic parameters that may demand hours for one to find the most appropriate values  Also, the algorithm lend themselves to modifications that, although may sound intuitively right, are taking you nowhere  It is reasonably possible that, although you are looking for, the data has no clusters at all; it is not such an improbable circumstance because clustering algorithms usually are treated as black boxes – be circumspect, attention with the evidences!  Despite the fact that there are evaluation methods and visualization tools, still the clustering result may be flawed; remember, there are no formal theory behind cluster concepts  Finally, this review is mostly addressed for practitioners, and not for academic personnel; for those, there are many other aspects that must be considered – for more details, please check the paper “Survey Of Clustering Data MiningTechniques” of author Pavel Berkhin, among other sources
  • 51. http://publicationslist.org/junio References  Philipp K. Janert, Data Analysis with Open Source Tools, O’Reilly, 2010.  Wikipedia, http://en.wikipedia.org  Wolfram MathWorld, http://mathworld.wolfram.com/