SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Clustering Agenda
• Definition of Clustering
• Existing clustering methods
• Clustering examples
• Clustering demonstration
• Clustering validity
Definition
• Clustering can be considered the most important unsupervised learning
technique; so, as every other problem of this kind, it deals with finding a
structure in a collection of unlabeled data.
• Unsupervised: no information is provided to the algorithm on which data
points belong to which clusters.
• Clustering is “the process of organizing objects into groups whose
members are similar in some way”.
• A cluster is therefore a collection of objects which are “similar” between
them and are “dissimilar” to the objects belonging to other clusters.
What Cluster Analysis is not
• Supervised classification
• Have class label information
• Simple segmentation
• Dividing students into different registration groups alphabetically, by last name
• Results of a query
• Groupings are a result of an external specification
Why and Where to use Clustering?
Why?
• Simplifications
• Pattern detection
• Useful in data concept construction
• Unsupervised learning process
Where?
• Data mining
• Information retrieval
• text mining
• Web analysis
• marketing
• medical diagnostic
Applications
• Retail – Group similar customers
• Biology – Group similar plants/animals to study their common behavior
• Financial services – Groups similar types of accounts or customers
• Air line – Group similar types of customers to offer different discounts
• Insurance – Group similar nature of consumers as well as claims to take policy
decisions
• Government – Group similar areas to announce various subsidies or other benefits
Which method to use?
It depends on following:
• Type of attributes in data
• Dictates type of similarity
• Scalability to larger dataset
• Ability to work with irregular data
• Time cost
• Complexity
• Data order dependency
• Result presentation
Major existing clustering algorithms
• K-means and its variants
• Hierarchical clustering
• Density-based clustering
K-means Clustering
• Partition clustering approach
• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest centroid
• Number of clusters, K, must be specified
• The basic algorithm is very simple
K-means Clustering – Details
• Initial centroids are often chosen randomly.
• Clusters produced vary from one run to another.
• The centroid is (typically) the mean of the points in the cluster.
• ‘Closeness’ is measured by Euclidean distance, correlation, etc.
• K-means will converge for common similarity measures mentioned above.
• Most of the convergence happens in the first few iterations.
• Often the stopping condition is changed to ‘Until relatively few points
change clusters’
• Complexity is O( n * K * I * d )
• n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Evaluating K-means Clusters
• Most common measure is Sum of Squared Error (SSE)
• For each point, the error is the distance to the nearest cluster
• To get SSE, we square these errors and sum them.
• x is a data point in cluster Ci and mi is the representative point for cluster Ci
• can show that mi corresponds to the center (mean) of the cluster
• Given two clusters, we can choose the one with the smallest error
• One easy way to reduce SSE is to increase K, the number of clusters
• A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K
 

K
i Cx
i
i
xmdistSSE
1
2
),(
Limitations of K-means
• K-means has problems when clusters are of differing
• Sizes
• Densities
• Non-globular shapes
• K-means has problems when the data contains outliers.
K-means Clustering Algorithm
1. The K-Means algorithm calculates the Arithmetic Mean of each cluster formed in the
dataset.
a) The Arithmetic Mean of a cluster is the mean of all the individual records in the
cluster. In each of the first K initial clusters, their is only one record.
b) The Arithmetic Mean of a cluster with one record is the set of values that make
up that record.
c) For Example if the dataset we are discussing is a set of Avg Txn Amount,
Merchant Categories Transacted and Age with Citrus measurements for USER,
where a record P in the dataset S is represented by a P = {Avg TxnAmt,
Mer_Cat_Cnt, Age_Citrus).
d) Then a record containing the measurements of a User (9898084242), would be
represented as 9898084242 = {2000, 3, 6} where 9898084242’s Txn Amount =
2000 rs, Mer Categories = 3 and Age with Citrus = 6 Months.
e) Since there is only one record in each initial cluster then the Arithmetic Mean of
a cluster with only the record for 9898084242 as a member = {2000, 3, 6}.
2. Next, K-Means assigns each record in the dataset to only one of the initial clusters. Each
record is assigned to the nearest cluster (the cluster which it is most similar to) using a
measure of distance or similarity like the Euclidean Distance Measure or
Manhattan/City-Block Distance Measure.
K-means Clustering Algorithm
3. K-Means re-assigns each record in the dataset to the most similar cluster and re-calculates
the arithmetic mean of all the clusters in the dataset. The arithmetic mean of a cluster is
the arithmetic mean of all the records in that cluster.
4. For Example, if a cluster contains two records where the record of the set of measurements
for 9898084242 = {2000, 3, 6} and 8652084242 = {1000, 2, 2} then the arithmetic mean
Pmean is represented as Pmean= {Avgmean, Mer Catmean, Agemean). Avgmean= (2000 + 1000)/2,
Mer Catmean= (3 + 2)/2 and Agemean= (6 + 2)/2. The arithmetic mean of this cluster = {1500,
2.5, 4}. This new arithmetic mean becomes the center of this new cluster. Following the
same procedure, new cluster centers are formed for all the existing clusters.
5. K-Means re-assigns each record in the dataset to only one of the new clusters formed. A
record or data point is assigned to the nearest cluster (the cluster which it is most similar to)
using a measure of distance or similarity
6. The preceding steps are repeated until stable clusters are formed and the K-Means
clustering procedure is completed. Stable clusters are formed when new iterations or
repetitions of the K-Means clustering algorithm does not create new clusters as the cluster
center or Arithmetic Mean of each cluster formed is the same as the old cluster center.
There are different techniques for determining when a stable cluster is formed or when the
k-means clustering algorithm procedure is completed.
13
K - means clustering - demonstration
1) k initial "means" (in this case k=3) are
randomly selected from the data set
2)k clusters are created by associating every
observation with the nearest mean.
3) The centroid of each of the k clusters
becomes the new means.
4) Steps 2 and 3 are repeated until
convergence has been reached
Steps: K-means Clustering analysis
• It is important to define the problem to be solved beforehand so that
clustering method, variables, data range can be selected.
• Variable identification
• Variable categorization (e.g. Numeric, Categorical, Discrete, Continuous etc.)
• Conversion of non-numeric variables to numeric form
• Running Descriptive Analysis
• Importing data
• Selecting the variables
• Scaling the variables to common metric
• Deciding on the number of clusters to be created
• Running the analysis
• Interpreting the results
15
Case : K-Means
• Step 1: Data preparation and Selecting Variables
• Step 2: Scaling data –
ruspini.scaled <- scale(ruspini)
Case : K-Means
• Step 3: Identify Number of Clusters
Case : K-Means
• Step 4: K-Means Cluster
km <- kmeans(ruspini.scaled, centers=6, nstart=10)
km
Case : K-Means
• Step 4: Plot Cluster
plot(ruspini.scaled, col=km$cluster)
points(km$centers, pch=3, cex=2) # this adds the centroids
text(km$centers, labels=1:4, pos=2) # this adds the cluster ID
Hierarchical Clustering
p4
p1
p3
p2
p4
p1
p3
p2
p4p1 p2 p3
p4p1 p2 p3
Traditional Hierarchical Clustering
Non-traditional Hierarchical
Clustering
Non-traditional Dendrogram
Traditional Dendrogram
Hierarchical clustering
Hierarchical clustering
Agglomerative (bottom up)
1. Start with 1 point
(singleton)
2. Recursively add two or
more appropriate
clusters
3. Stop when k number of
clusters is achieved.
Divisive (top down)
1. Start with a big cluster
2. Recursively divide into
smaller clusters
3. Stop when k number of
clusters is achieved.
Case : Hierarchical Clustering
• Step 1: Get distance between data points
• dist.ruspini <- dist(ruspini.scaled)
Case : Hierarchical Clustering
• Step 1: Create and plot cluster
• hc.ruspini <- hclust(dist.ruspini, method="complete")
• plot(hc.ruspini)
• rect.hclust(hc.ruspini, k=4)
Density Based Clustering : DBSCAN
• DBSCAN is a density-based algorithm.
• Density = number of points within a specified radius (Eps)
• A point is a core point if it has more than a specified number of points
(MinPts) within Eps
• These are points that are at the interior of a cluster
• A border point has fewer than MinPts within Eps, but is in the
neighborhood of a core point
• A noise point is any point that is not a core point or a border point.
DBSCAN Algorithm
• Eliminate noise points
• Perform clustering on the remaining points
Case: DBSCAN Density based clustering
• Step 1: Get KKN plot for Epsilon value.
library(dbscan)
kNNdistplot(ruspini.scaled, k = 3)
abline(h=.25, col="red")
Case: DBSCAN Density based clustering
• Step 2: Run DBSCAN.
db <- dbscan(ruspini.scaled, eps=.25, minPts=3)
db
Case: DBSCAN Density based clustering
• Step 3: Plot Cluster.
plot(ruspini.scaled, col=db$cluster+1L)
Cluster Validity
• For supervised classification we have a variety of measures to evaluate how good
our model is
• Accuracy, precision, recall
• For cluster analysis, the question is how to evaluate the “goodness” of the
resulting clusters?
• Then why do we want to evaluate them?
• To avoid finding patterns in noise
• To compare clustering algorithms
• To compare two sets of clusters
• To compare two clusters
1. Determining the clustering tendency of a set of data, i.e., distinguishing
whether non-random structure actually exists in the data.
2. Comparing the results of a cluster analysis to externally known results, e.g.,
to externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the data without
reference to external information.
- Use only the data
4. Comparing the results of two different sets of cluster analyses to determine
which is better.
5. Determining the ‘correct’ number of clusters.
For 2, 3, and 4, we can further distinguish whether we want to evaluate the
entire clustering or just individual clusters.
Cluster Validation - Different Aspects
• Below are the different types of numerical measures that are applied to for
cluster validity
• External Index: Measures the extent to which cluster labels match
externally supplied class labels.
• Entropy
• Internal Index: Used to measure the goodness of a clustering structure
without respect to external information.
• Sum of Squared Error (SSE)
• Relative Index: Compare two different clusters.
• Often an external or internal index is used for this function, e.g., SSE or
entropy
• Sometimes these are referred to as criteria instead of indices
• However, sometimes criterion is the general strategy and index is the
numerical measure that implements the criterion.
Cluster Validity: Measures
• Need a framework to interpret any measure.
• For example, if our measure of evaluation has the value, 10, is that
good, fair, or poor?
• Statistics provide a framework for cluster validity
• The more “atypical” a clustering result is, the more likely it represents
valid structure in the data
• Can compare the values of an index that result from random data or
clusterings to those of a clustering result.
• If the value of the index is unlikely, then the cluster results are valid
• These approaches are more complicated and harder to understand.
• For comparing the results of two different sets of cluster analyses, a
framework is less necessary.
• However, there is the question of whether the difference between two
index values is significant
Framework for Cluster Validity
• Cluster Cohesion: Measures how closely related are objects in a cluster
• Example: SSE
• Cluster Separation: Measure how distinct or well-separated a cluster is from
other clusters
• Example: Squared Error
• Cohesion is measured by the within cluster sum of squares (SSE)
• Separation is measured by the between cluster sum of squares
• Where |Ci| is the size of cluster i
Internal Measures: Cohesion and Separation
 


i Cx
i
i
mxWSS 2
)(
 
i
ii mmCBSS 2
)(

Weitere ähnliche Inhalte

Was ist angesagt?

Data clustering
Data clustering Data clustering
Data clustering GARIMA SHAKYA
 
Clustering
ClusteringClustering
ClusteringMeme Hei
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering AlgorithmLino Possamai
 
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...Edureka!
 
Clustering: A Survey
Clustering: A SurveyClustering: A Survey
Clustering: A SurveyRaffaele Capaldo
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means ClusteringAnna Fensel
 
K means Clustering
K means ClusteringK means Clustering
K means ClusteringEdureka!
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyijpla
 
Cluster Analysis
Cluster Analysis Cluster Analysis
Cluster Analysis Dr Athar Khan
 
Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningHouw Liong The
 
K means clustering
K means clusteringK means clustering
K means clusteringkeshav goyal
 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clusteringAfzaal Subhani
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsVarad Meru
 

Was ist angesagt? (20)

Data clustering
Data clustering Data clustering
Data clustering
 
Clustering
ClusteringClustering
Clustering
 
08 clustering
08 clustering08 clustering
08 clustering
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
Clustering
ClusteringClustering
Clustering
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
 
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
 
Clustering: A Survey
Clustering: A SurveyClustering: A Survey
Clustering: A Survey
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracy
 
Cluster Analysis
Cluster Analysis Cluster Analysis
Cluster Analysis
 
Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text mining
 
Kmeans
KmeansKmeans
Kmeans
 
K means clustering
K means clusteringK means clustering
K means clustering
 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 

Ähnlich wie machine learning - Clustering in R

26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017Iwan Sofana
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningPyingkodi Maran
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptxJK970901
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxsandeepsandy494692
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit vmalathieswaran29
 
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptxNANDHINIS900805
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxnikshaikh786
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSandinoBerutu1
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptImXaib
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdfbintis1
 
K means Clustering - algorithm to cluster n objects
K means Clustering - algorithm to cluster n objectsK means Clustering - algorithm to cluster n objects
K means Clustering - algorithm to cluster n objectsVoidVampire
 
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)Pravinkumar Landge
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfSowmyaJyothi3
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptionsrefedey275
 
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Maninda Edirisooriya
 

Ähnlich wie machine learning - Clustering in R (20)

26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptx
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit v
 
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
PPT s10-machine vision-s2
PPT s10-machine vision-s2PPT s10-machine vision-s2
PPT s10-machine vision-s2
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdf
 
K means Clustering - algorithm to cluster n objects
K means Clustering - algorithm to cluster n objectsK means Clustering - algorithm to cluster n objects
K means Clustering - algorithm to cluster n objects
 
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
 
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
 
kmean clustering
kmean clusteringkmean clustering
kmean clustering
 

KĂźrzlich hochgeladen

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 

KĂźrzlich hochgeladen (20)

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 

machine learning - Clustering in R

  • 1. Clustering Agenda • Definition of Clustering • Existing clustering methods • Clustering examples • Clustering demonstration • Clustering validity
  • 2. Definition • Clustering can be considered the most important unsupervised learning technique; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. • Unsupervised: no information is provided to the algorithm on which data points belong to which clusters. • Clustering is “the process of organizing objects into groups whose members are similar in some way”. • A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.
  • 3. What Cluster Analysis is not • Supervised classification • Have class label information • Simple segmentation • Dividing students into different registration groups alphabetically, by last name • Results of a query • Groupings are a result of an external specification
  • 4. Why and Where to use Clustering? Why? • Simplifications • Pattern detection • Useful in data concept construction • Unsupervised learning process Where? • Data mining • Information retrieval • text mining • Web analysis • marketing • medical diagnostic
  • 5. Applications • Retail – Group similar customers • Biology – Group similar plants/animals to study their common behavior • Financial services – Groups similar types of accounts or customers • Air line – Group similar types of customers to offer different discounts • Insurance – Group similar nature of consumers as well as claims to take policy decisions • Government – Group similar areas to announce various subsidies or other benefits
  • 6. Which method to use? It depends on following: • Type of attributes in data • Dictates type of similarity • Scalability to larger dataset • Ability to work with irregular data • Time cost • Complexity • Data order dependency • Result presentation
  • 7. Major existing clustering algorithms • K-means and its variants • Hierarchical clustering • Density-based clustering
  • 8. K-means Clustering • Partition clustering approach • Each cluster is associated with a centroid (center point) • Each point is assigned to the cluster with the closest centroid • Number of clusters, K, must be specified • The basic algorithm is very simple
  • 9. K-means Clustering – Details • Initial centroids are often chosen randomly. • Clusters produced vary from one run to another. • The centroid is (typically) the mean of the points in the cluster. • ‘Closeness’ is measured by Euclidean distance, correlation, etc. • K-means will converge for common similarity measures mentioned above. • Most of the convergence happens in the first few iterations. • Often the stopping condition is changed to ‘Until relatively few points change clusters’ • Complexity is O( n * K * I * d ) • n = number of points, K = number of clusters, I = number of iterations, d = number of attributes
  • 10. Evaluating K-means Clusters • Most common measure is Sum of Squared Error (SSE) • For each point, the error is the distance to the nearest cluster • To get SSE, we square these errors and sum them. • x is a data point in cluster Ci and mi is the representative point for cluster Ci • can show that mi corresponds to the center (mean) of the cluster • Given two clusters, we can choose the one with the smallest error • One easy way to reduce SSE is to increase K, the number of clusters • A good clustering with smaller K can have a lower SSE than a poor clustering with higher K    K i Cx i i xmdistSSE 1 2 ),(
  • 11. Limitations of K-means • K-means has problems when clusters are of differing • Sizes • Densities • Non-globular shapes • K-means has problems when the data contains outliers.
  • 12. K-means Clustering Algorithm 1. The K-Means algorithm calculates the Arithmetic Mean of each cluster formed in the dataset. a) The Arithmetic Mean of a cluster is the mean of all the individual records in the cluster. In each of the first K initial clusters, their is only one record. b) The Arithmetic Mean of a cluster with one record is the set of values that make up that record. c) For Example if the dataset we are discussing is a set of Avg Txn Amount, Merchant Categories Transacted and Age with Citrus measurements for USER, where a record P in the dataset S is represented by a P = {Avg TxnAmt, Mer_Cat_Cnt, Age_Citrus). d) Then a record containing the measurements of a User (9898084242), would be represented as 9898084242 = {2000, 3, 6} where 9898084242’s Txn Amount = 2000 rs, Mer Categories = 3 and Age with Citrus = 6 Months. e) Since there is only one record in each initial cluster then the Arithmetic Mean of a cluster with only the record for 9898084242 as a member = {2000, 3, 6}. 2. Next, K-Means assigns each record in the dataset to only one of the initial clusters. Each record is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity like the Euclidean Distance Measure or Manhattan/City-Block Distance Measure.
  • 13. K-means Clustering Algorithm 3. K-Means re-assigns each record in the dataset to the most similar cluster and re-calculates the arithmetic mean of all the clusters in the dataset. The arithmetic mean of a cluster is the arithmetic mean of all the records in that cluster. 4. For Example, if a cluster contains two records where the record of the set of measurements for 9898084242 = {2000, 3, 6} and 8652084242 = {1000, 2, 2} then the arithmetic mean Pmean is represented as Pmean= {Avgmean, Mer Catmean, Agemean). Avgmean= (2000 + 1000)/2, Mer Catmean= (3 + 2)/2 and Agemean= (6 + 2)/2. The arithmetic mean of this cluster = {1500, 2.5, 4}. This new arithmetic mean becomes the center of this new cluster. Following the same procedure, new cluster centers are formed for all the existing clusters. 5. K-Means re-assigns each record in the dataset to only one of the new clusters formed. A record or data point is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity 6. The preceding steps are repeated until stable clusters are formed and the K-Means clustering procedure is completed. Stable clusters are formed when new iterations or repetitions of the K-Means clustering algorithm does not create new clusters as the cluster center or Arithmetic Mean of each cluster formed is the same as the old cluster center. There are different techniques for determining when a stable cluster is formed or when the k-means clustering algorithm procedure is completed. 13
  • 14. K - means clustering - demonstration 1) k initial "means" (in this case k=3) are randomly selected from the data set 2)k clusters are created by associating every observation with the nearest mean. 3) The centroid of each of the k clusters becomes the new means. 4) Steps 2 and 3 are repeated until convergence has been reached
  • 15. Steps: K-means Clustering analysis • It is important to define the problem to be solved beforehand so that clustering method, variables, data range can be selected. • Variable identification • Variable categorization (e.g. Numeric, Categorical, Discrete, Continuous etc.) • Conversion of non-numeric variables to numeric form • Running Descriptive Analysis • Importing data • Selecting the variables • Scaling the variables to common metric • Deciding on the number of clusters to be created • Running the analysis • Interpreting the results 15
  • 16. Case : K-Means • Step 1: Data preparation and Selecting Variables • Step 2: Scaling data – ruspini.scaled <- scale(ruspini)
  • 17. Case : K-Means • Step 3: Identify Number of Clusters
  • 18. Case : K-Means • Step 4: K-Means Cluster km <- kmeans(ruspini.scaled, centers=6, nstart=10) km
  • 19. Case : K-Means • Step 4: Plot Cluster plot(ruspini.scaled, col=km$cluster) points(km$centers, pch=3, cex=2) # this adds the centroids text(km$centers, labels=1:4, pos=2) # this adds the cluster ID
  • 20. Hierarchical Clustering p4 p1 p3 p2 p4 p1 p3 p2 p4p1 p2 p3 p4p1 p2 p3 Traditional Hierarchical Clustering Non-traditional Hierarchical Clustering Non-traditional Dendrogram Traditional Dendrogram
  • 22. Hierarchical clustering Agglomerative (bottom up) 1. Start with 1 point (singleton) 2. Recursively add two or more appropriate clusters 3. Stop when k number of clusters is achieved. Divisive (top down) 1. Start with a big cluster 2. Recursively divide into smaller clusters 3. Stop when k number of clusters is achieved.
  • 23. Case : Hierarchical Clustering • Step 1: Get distance between data points • dist.ruspini <- dist(ruspini.scaled)
  • 24. Case : Hierarchical Clustering • Step 1: Create and plot cluster • hc.ruspini <- hclust(dist.ruspini, method="complete") • plot(hc.ruspini) • rect.hclust(hc.ruspini, k=4)
  • 25. Density Based Clustering : DBSCAN • DBSCAN is a density-based algorithm. • Density = number of points within a specified radius (Eps) • A point is a core point if it has more than a specified number of points (MinPts) within Eps • These are points that are at the interior of a cluster • A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point • A noise point is any point that is not a core point or a border point.
  • 26. DBSCAN Algorithm • Eliminate noise points • Perform clustering on the remaining points
  • 27. Case: DBSCAN Density based clustering • Step 1: Get KKN plot for Epsilon value. library(dbscan) kNNdistplot(ruspini.scaled, k = 3) abline(h=.25, col="red")
  • 28. Case: DBSCAN Density based clustering • Step 2: Run DBSCAN. db <- dbscan(ruspini.scaled, eps=.25, minPts=3) db
  • 29. Case: DBSCAN Density based clustering • Step 3: Plot Cluster. plot(ruspini.scaled, col=db$cluster+1L)
  • 30. Cluster Validity • For supervised classification we have a variety of measures to evaluate how good our model is • Accuracy, precision, recall • For cluster analysis, the question is how to evaluate the “goodness” of the resulting clusters? • Then why do we want to evaluate them? • To avoid finding patterns in noise • To compare clustering algorithms • To compare two sets of clusters • To compare two clusters
  • 31. 1. Determining the clustering tendency of a set of data, i.e., distinguishing whether non-random structure actually exists in the data. 2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels. 3. Evaluating how well the results of a cluster analysis fit the data without reference to external information. - Use only the data 4. Comparing the results of two different sets of cluster analyses to determine which is better. 5. Determining the ‘correct’ number of clusters. For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire clustering or just individual clusters. Cluster Validation - Different Aspects
  • 32. • Below are the different types of numerical measures that are applied to for cluster validity • External Index: Measures the extent to which cluster labels match externally supplied class labels. • Entropy • Internal Index: Used to measure the goodness of a clustering structure without respect to external information. • Sum of Squared Error (SSE) • Relative Index: Compare two different clusters. • Often an external or internal index is used for this function, e.g., SSE or entropy • Sometimes these are referred to as criteria instead of indices • However, sometimes criterion is the general strategy and index is the numerical measure that implements the criterion. Cluster Validity: Measures
  • 33. • Need a framework to interpret any measure. • For example, if our measure of evaluation has the value, 10, is that good, fair, or poor? • Statistics provide a framework for cluster validity • The more “atypical” a clustering result is, the more likely it represents valid structure in the data • Can compare the values of an index that result from random data or clusterings to those of a clustering result. • If the value of the index is unlikely, then the cluster results are valid • These approaches are more complicated and harder to understand. • For comparing the results of two different sets of cluster analyses, a framework is less necessary. • However, there is the question of whether the difference between two index values is significant Framework for Cluster Validity
  • 34. • Cluster Cohesion: Measures how closely related are objects in a cluster • Example: SSE • Cluster Separation: Measure how distinct or well-separated a cluster is from other clusters • Example: Squared Error • Cohesion is measured by the within cluster sum of squares (SSE) • Separation is measured by the between cluster sum of squares • Where |Ci| is the size of cluster i Internal Measures: Cohesion and Separation     i Cx i i mxWSS 2 )(   i ii mmCBSS 2 )(