SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Clustering
Binoy B Nair
What is Clustering?
• Cluster: a collection of data objects
• Similar to one another within the same cluster
• Dissimilar to the objects in other clusters
• Cluster analysis
• Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no predefined classes
• Typical applications
• As a stand-alone tool to get insight into data distribution
• As a preprocessing step for other algorithms
3
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in their customer bases, and then use this
knowledge to develop targeted marketing programs
• Land use: Identification of areas of similar land use in an earth observation database
• Insurance: Identifying groups of motor insurance policy holders with a high average claim cost
• Urban planning: Identifying groups of houses according to their house type, value, and
geographical location
• Seismology: Observed earth quake epicenters should be clustered along continent faults
4
What Is a Good Clustering?
• A good clustering method will produce clusters with
• High intra-class similarity
• Low inter-class similarity
• Precise definition of clustering quality is difficult
• Application-dependent
• Ultimately subjective
5
Requirements for Clustering in Data Mining
• Scalability
• Ability to deal with different types of attributes
• Discovery of clusters with arbitrary shape
• Minimal domain knowledge required to determine input parameters
• Ability to deal with noise and outliers
• Insensitivity to order of input records
• Robustness wrt high dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability
6
Major Clustering Approaches
• Partitioning: Construct various partitions and then evaluate them by some criterion
• Hierarchical: Create a hierarchical decomposition of the set of objects using some
criterion
• Model-based: Hypothesize a model for each cluster and find best fit of models to data
• Density-based: Guided by connectivity and density functions
7
Partitioning Algorithms
• Partitioning method: Construct a partition of a database D of n objects into a set of k
clusters
• Given a k, find a partition of k clusters that optimizes the chosen partitioning
criterion
• Global optimal: exhaustively enumerate all partitions
• Heuristic methods: k-means and k-medoids algorithms
• k-means (MacQueen, 1967): Each cluster is represented by the center of the
cluster
• k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw, 1987):
Each cluster is represented by one of the objects in the cluster
Partitional Clustering
• Each instance is placed in exactly one of K nonoverlapping
clusters.
• Since only one set of clusters is output, the user normally has
to input the desired number of clusters K.
9
Similarity and Dissimilarity Between Objects
• Euclidean distance (p = 2):
• Properties of a metric d(i,j):
• d(i,j)  0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j)  d(i,k) + d(k,j)
)||...|||(|),( 22
22
2
11 pp j
x
i
x
j
x
i
x
j
x
i
xjid 
Squared Error
10
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
Objective Function
Algorithm k-means
1. Decide on a value for k.
2. Initialize the k cluster centers (randomly, if necessary).
3. Decide the class memberships of the N objects by assigning them to the
nearest cluster center.
4. Re-estimate the k cluster centers, by assuming the memberships found
above are correct.
5. If none of the N objects changed membership in the last iteration, exit.
Otherwise goto 3.
0
1
2
3
4
5
0 1 2 3 4 5
K-means Clustering: Step 1
Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2
k3
0
1
2
3
4
5
0 1 2 3 4 5
K-means Clustering: Step 2
Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2
k3
0
1
2
3
4
5
0 1 2 3 4 5
K-means Clustering: Step 3
Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2
k3
0
1
2
3
4
5
0 1 2 3 4 5
K-means Clustering: Step 4
Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2
k3
0
1
2
3
4
5
0 1 2 3 4 5
expression in condition 1
expressionincondition2
K-means Clustering: Step 5
Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2
k3
Worked out Example
Example
Subject Features
1 (1,1)
2 (1.5,2)
3 (3,4)
4 (5,7)
5 (3.5,5)
6 (4.5,5)
7 (3.5,4.5)
As a simple illustration of a k-means algorithm, consider the following data set
consisting of the scores of two variables on each of seven individuals:
Scatter Plot
Working
Individual Mean Vector (centroid)
Group 1 1 (1, 1)
Group 2 4 (5, 7)
This data set is to be grouped into two clusters, i.e k=2.
As a first step in finding a sensible initial partition, let the feature values of the two
individuals furthest apart (using the Euclidean distance measure), define the initial
cluster means, giving:
Working
• The remaining individuals are now examined in sequence and
allocated to the cluster to which they are closest, in terms of
Euclidean distance to the cluster mean.
• The mean vector is recalculated each time a new member is added.
• This leads to the following series of steps:
Iteration 1- Assign Objects to closest clusters
Object
(Oi)
Features
Centroid 1
(C1)
D(Oi ,C1)
Centroid 2
(C2)
D(Oi ,C2)
Closest
Centroid
1 (1,1) (1,1) (5,7)
2 (1.5,2) (1,1) (5,7)
3 (3,4) (1,1) (5,7)
4 (5,7) (1,1) (5,7)
5 (3.5,5) (1,1) (5,7)
6 (4.5,5) (1,1) (5,7)
7 (3.5,4.5) (1,1) (5,7)
A cluster is
defined by
its centroid
Iteration 1- Assign Objects to closest clusters
Object
(Oi)
Features
Centroid 1
(C1)
D(Oi ,C1)
Centroid 2
(C2)
D(Oi ,C2)
Closest
Centroid
1 (1,1) (1,1) 0 (5,7) 7.21
2 (1.5,2) (1,1) 1.11 (5,7) 6.1
3 (3,4) (1,1) 3.05 (5,7) 3.60
4 (5,7) (1,1) 7.21 (5,7) 0
5 (3.5,5) (1,1) 4.71 (5,7) 2.5
6 (4.5,5) (1,1) 5.31 (5,7) 2.06
7 (3.5,4.5) (1,1) 4.3 (5,7) 2.91
A cluster is
defined by
its centroid
(1 − 1.5)2+(1 − 2)2 =1.11
And so on…
Iteration 1- Assign Objects to closest clusters
Object
(Oi)
Features
Centroid 1
(C1)
D(Oi ,C1)
Centroid 2
(C2)
D(Oi ,C2)
Closest
Centroid
1 (1,1) (1,1) 0 (5,7) 7.21 C1
2 (1.5,2) (1,1) 1.11 (5,7) 6.1 C1
3 (3,4) (1,1) 3.05 (5,7) 3.60 C1
4 (5,7) (1,1) 7.21 (5,7) 0 C2
5 (3.5,5) (1,1) 4.71 (5,7) 2.5 C2
6 (4.5,5) (1,1) 5.31 (5,7) 2.06 C2
7 (3.5,4.5) (1,1) 4.3 (5,7) 2.91 C2
A cluster is
defined by
its centroid
Object 1 is
assigned to
cluster 1 and so
on..
Re computing Centroids at the end of Iteration 1
Individuals New Centroids
Cluster 1 1, 2, 3 C1 = ((1,1)+(1.5,2)+(3,4))/3 = (1.8, 2.3)
Cluster 2 4, 5, 6, 7 C2 =((5,7)+(3.5,5)+(4.5,5)+ (3.5,4.5))/4 = (4.1, 5.4)
Now the initial partition has changed, and the two clusters at this stage having the
following characteristics:
Iteration 2- Check if any object has changed clusters
Object
(Oi)
Features
Centroid 1
(C1)
D(Oi ,C1)
Centroid 2
(C2)
D(Oi ,C2)
Closest
Centroid
1 (1,1) (1.8,2.3) (4.1, 5.4)
2 (1.5,2) (1.8,2.3) (4.1, 5.4)
3 (3,4) (1.8,2.3) (4.1, 5.4)
4 (5,7) (1.8,2.3) (4.1, 5.4)
5 (3.5,5) (1.8,2.3) (4.1, 5.4)
6 (4.5,5) (1.8,2.3) (4.1, 5.4)
7 (3.5,4.5) (1.8,2.3) (4.1, 5.4)
Iteration 2- Check if any object has changed clusters
Object
(Oi)
Features
Centroid 1
(C1)
D(Oi ,C1)
Centroid 2
(C2)
D(Oi ,C2)
Closest
Centroid
1 (1,1) (1.8,2.3) 1.53 (4.1, 5.4) 5.38 C1
2 (1.5,2) (1.8,2.3) 0.42 (4.1, 5.4) 4.28 C1
3 (3,4) (1.8,2.3) 2.08 (4.1, 5.4) 1.78 C2
4 (5,7) (1.8,2.3) 5.69 (4.1, 5.4) 1.84 C2
5 (3.5,5) (1.8,2.3) 3.19 (4.1, 5.4) 0.72 C2
6 (4.5,5) (1.8,2.3) 3.82 (4.1, 5.4) 0.57 C2
7 (3.5,4.5) (1.8,2.3) 2.78 (4.1, 5.4) 1.08 C2
Object 3 has
changed
cluster from 1
to 2
Re computing Centroids at the end of Iteration 2
Individuals New Centroids
Cluster 1 1, 2 C1 = ((1,1)+(1.5,2))/2 = (1.3, 1.5)
Cluster 2 3,4, 5, 6, 7 C2 =((3,4)+(5,7)+(3.5,5)+(4.5,5)+ (3.5,4.5))/5 = (3.9, 5.1)
Now the initial partition has changed wih Object 3 getting relocated to cluster 2 and
the two clusters at this stage having the following characteristics:
Iteration 3- Check if any object has changed clusters
Object
(Oi)
Features
Centroid 1
(C1)
D(Oi ,C1)
Centroid 2
(C2)
D(Oi ,C2)
Closest
Centroid
1 (1,1) (1.3, 1.5) 0.58 (3.9, 5.1) 5.02 C1
2 (1.5,2) (1.3, 1.5) 0.54 (3.9, 5.1) 3.92 C1
3 (3,4) (1.3, 1.5) 3.02 (3.9, 5.1) 1.42 C2
4 (5,7) (1.3, 1.5) 6.63 (3.9, 5.1) 2.19 C2
5 (3.5,5) (1.3, 1.5) 4.13 (3.9, 5.1) 0.41 C2
6 (4.5,5) (1.3, 1.5) 4.74 (3.9, 5.1) 0.61 C2
7 (3.5,4.5) (1.3, 1.5) 3.72 (3.9, 5.1) 0.72 C2
No change in
clusters
compared to
previous
iteration
Conclusion
• In this example each individual is now nearer its own cluster mean
than that of the other cluster and the iteration stops, choosing the
latest partitioning as the final cluster solution.
• Hence Objects {1,2} belong to first cluster and Objects {3,4,5,6,7}
belong to second cluster.
Notes
• The iterative relocation would continue until no more relocations occur.
• Luckily, in the example, we got the no-relocation condition satisfied in 3
iterations, but this is not usually the case. It might require hundreds of
iterations depending on the dataset.
• Also, it is possible that the k-means algorithm won't find a final solution at
all.
• In this case it would be a good idea to consider stopping the algorithm
after a pre-chosen maximum of iterations.
Comments on the K-Means Method
• Strength
• Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations.
Normally, k, t << n.
• Often terminates at a local optimum. The global optimum may be found using
techniques such as: deterministic annealing and genetic algorithms
• Weakness
• Applicable only when mean is defined, then what about categorical data?
• Need to specify k, the number of clusters, in advance
• Unable to handle noisy data and outliers
• Not suitable to discover clusters with non-convex shapes
32
Summary
• K-means algorithm is a simple yet popular method for clustering analysis
• Its performance is determined by initialisation and appropriate distance
measure
• There are several variants of K-means to overcome its weaknesses
• K-Medoids: resistance to noise and/or outliers
• K-Modes: extension to categorical data clustering analysis
• CLARA: extension to deal with large data sets
• Mixture models (EM algorithm): handling uncertainty of clusters
Online tutorial: the K-means function in Matlab
https://www.youtube.com/watch?v=aYzjenNNOcc
References
• http://mnemstudio.org/clustering-k-means-example-1.htm
• Ke Chen, K-means Clustering ,COMP24111-Machine Learning,
University of Manchester, 2016.
• {Insert Reference}/ 10.ppt
• {Insert Reference}/MachinLearning3.ppt

Weitere ähnliche Inhalte

Was ist angesagt?

Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorVivian S. Zhang
 
17. Java data structures trees representation and traversal
17. Java data structures trees representation and traversal17. Java data structures trees representation and traversal
17. Java data structures trees representation and traversalIntro C# Book
 
Independent component analysis
Independent component analysisIndependent component analysis
Independent component analysisVanessa S
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clusteringishmecse13
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingFraboni Ec
 
Advanced data structures vol. 1
Advanced data structures   vol. 1Advanced data structures   vol. 1
Advanced data structures vol. 1Christalin Nelson
 
Decision Tree - ID3
Decision Tree - ID3Decision Tree - ID3
Decision Tree - ID3Xueping Peng
 
MACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHMMACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHMPuneet Kulyana
 
k Nearest Neighbor
k Nearest Neighbork Nearest Neighbor
k Nearest Neighborbutest
 
Scalable community detection with the louvain algorithm
Scalable community detection with the louvain algorithmScalable community detection with the louvain algorithm
Scalable community detection with the louvain algorithmNavid Sedighpour
 
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...Praxitelis Nikolaos Kouroupetroglou
 
Data Clustering with R
Data Clustering with RData Clustering with R
Data Clustering with RYanchang Zhao
 
Parallel Programming for Multi- Core and Cluster Systems - Performance Analysis
Parallel Programming for Multi- Core and Cluster Systems - Performance AnalysisParallel Programming for Multi- Core and Cluster Systems - Performance Analysis
Parallel Programming for Multi- Core and Cluster Systems - Performance AnalysisShah Zaib
 
Introduction to random forest and gradient boosting methods a lecture
Introduction to random forest and gradient boosting methods   a lectureIntroduction to random forest and gradient boosting methods   a lecture
Introduction to random forest and gradient boosting methods a lectureShreyas S K
 

Was ist angesagt? (20)

Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 
17. Java data structures trees representation and traversal
17. Java data structures trees representation and traversal17. Java data structures trees representation and traversal
17. Java data structures trees representation and traversal
 
AI local search
AI local searchAI local search
AI local search
 
Classification Using Decision tree
Classification Using Decision treeClassification Using Decision tree
Classification Using Decision tree
 
Independent component analysis
Independent component analysisIndependent component analysis
Independent component analysis
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Advanced data structures vol. 1
Advanced data structures   vol. 1Advanced data structures   vol. 1
Advanced data structures vol. 1
 
Disjoint sets union, find
Disjoint sets  union, findDisjoint sets  union, find
Disjoint sets union, find
 
Decision Tree - ID3
Decision Tree - ID3Decision Tree - ID3
Decision Tree - ID3
 
MACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHMMACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHM
 
Evaluation metrics: Precision, Recall, F-Measure, ROC
Evaluation metrics: Precision, Recall, F-Measure, ROCEvaluation metrics: Precision, Recall, F-Measure, ROC
Evaluation metrics: Precision, Recall, F-Measure, ROC
 
k Nearest Neighbor
k Nearest Neighbork Nearest Neighbor
k Nearest Neighbor
 
Scalable community detection with the louvain algorithm
Scalable community detection with the louvain algorithmScalable community detection with the louvain algorithm
Scalable community detection with the louvain algorithm
 
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...
 
Data Clustering with R
Data Clustering with RData Clustering with R
Data Clustering with R
 
Parallel Programming for Multi- Core and Cluster Systems - Performance Analysis
Parallel Programming for Multi- Core and Cluster Systems - Performance AnalysisParallel Programming for Multi- Core and Cluster Systems - Performance Analysis
Parallel Programming for Multi- Core and Cluster Systems - Performance Analysis
 
Clustering
ClusteringClustering
Clustering
 
Bayesian networks
Bayesian networksBayesian networks
Bayesian networks
 
Introduction to random forest and gradient boosting methods a lecture
Introduction to random forest and gradient boosting methods   a lectureIntroduction to random forest and gradient boosting methods   a lecture
Introduction to random forest and gradient boosting methods a lecture
 

Ähnlich wie Pattern recognition binoy k means clustering

Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsNithyananthSengottai
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means ClusteringJunghoon Kim
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit vmalathieswaran29
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017Iwan Sofana
 
Business analytics course in delhi
Business analytics course in delhiBusiness analytics course in delhi
Business analytics course in delhibhuvan8999
 
data science course in delhi
data science course in delhidata science course in delhi
data science course in delhidevipatnala1
 
business analytics course in delhi
business analytics course in delhibusiness analytics course in delhi
business analytics course in delhidevipatnala1
 
Best data science training, best data science training institute in hyderabad.
 Best data science training, best data science training institute in hyderabad. Best data science training, best data science training institute in hyderabad.
Best data science training, best data science training institute in hyderabad.Data Analytics Courses in Pune
 
Data scientist course in hyderabad
Data scientist course in hyderabadData scientist course in hyderabad
Data scientist course in hyderabadprathyusha1234
 
Data scientist training in bangalore
Data scientist training in bangaloreData scientist training in bangalore
Data scientist training in bangaloreprathyusha1234
 
Data science course in chennai (3)
Data science course in chennai (3)Data science course in chennai (3)
Data science course in chennai (3)prathyusha1234
 
data science course in chennai
data science course in chennaidata science course in chennai
data science course in chennaidevipatnala1
 
Best institute for data science in hyderabad
Best institute for data science in hyderabadBest institute for data science in hyderabad
Best institute for data science in hyderabadprathyusha1234
 

Ähnlich wie Pattern recognition binoy k means clustering (20)

Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means Clustering
 
08 clustering
08 clustering08 clustering
08 clustering
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit v
 
Clustering
ClusteringClustering
Clustering
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017
 
Clustering
ClusteringClustering
Clustering
 
Business analytics course in delhi
Business analytics course in delhiBusiness analytics course in delhi
Business analytics course in delhi
 
data science course in delhi
data science course in delhidata science course in delhi
data science course in delhi
 
business analytics course in delhi
business analytics course in delhibusiness analytics course in delhi
business analytics course in delhi
 
Best data science training, best data science training institute in hyderabad.
 Best data science training, best data science training institute in hyderabad. Best data science training, best data science training institute in hyderabad.
Best data science training, best data science training institute in hyderabad.
 
Data scientist course in hyderabad
Data scientist course in hyderabadData scientist course in hyderabad
Data scientist course in hyderabad
 
Data scientist training in bangalore
Data scientist training in bangaloreData scientist training in bangalore
Data scientist training in bangalore
 
Data science course in chennai (3)
Data science course in chennai (3)Data science course in chennai (3)
Data science course in chennai (3)
 
data science course in chennai
data science course in chennaidata science course in chennai
data science course in chennai
 
Best institute for data science in hyderabad
Best institute for data science in hyderabadBest institute for data science in hyderabad
Best institute for data science in hyderabad
 

Kürzlich hochgeladen

Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxsomshekarkn64
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction managementMariconPadriquez1
 
Piping Basic stress analysis by engineering
Piping Basic stress analysis by engineeringPiping Basic stress analysis by engineering
Piping Basic stress analysis by engineeringJuanCarlosMorales19600
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 

Kürzlich hochgeladen (20)

Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptx
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction management
 
Piping Basic stress analysis by engineering
Piping Basic stress analysis by engineeringPiping Basic stress analysis by engineering
Piping Basic stress analysis by engineering
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 

Pattern recognition binoy k means clustering

  • 2. What is Clustering? • Cluster: a collection of data objects • Similar to one another within the same cluster • Dissimilar to the objects in other clusters • Cluster analysis • Grouping a set of data objects into clusters • Clustering is unsupervised classification: no predefined classes • Typical applications • As a stand-alone tool to get insight into data distribution • As a preprocessing step for other algorithms
  • 3. 3 Examples of Clustering Applications • Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs • Land use: Identification of areas of similar land use in an earth observation database • Insurance: Identifying groups of motor insurance policy holders with a high average claim cost • Urban planning: Identifying groups of houses according to their house type, value, and geographical location • Seismology: Observed earth quake epicenters should be clustered along continent faults
  • 4. 4 What Is a Good Clustering? • A good clustering method will produce clusters with • High intra-class similarity • Low inter-class similarity • Precise definition of clustering quality is difficult • Application-dependent • Ultimately subjective
  • 5. 5 Requirements for Clustering in Data Mining • Scalability • Ability to deal with different types of attributes • Discovery of clusters with arbitrary shape • Minimal domain knowledge required to determine input parameters • Ability to deal with noise and outliers • Insensitivity to order of input records • Robustness wrt high dimensionality • Incorporation of user-specified constraints • Interpretability and usability
  • 6. 6 Major Clustering Approaches • Partitioning: Construct various partitions and then evaluate them by some criterion • Hierarchical: Create a hierarchical decomposition of the set of objects using some criterion • Model-based: Hypothesize a model for each cluster and find best fit of models to data • Density-based: Guided by connectivity and density functions
  • 7. 7 Partitioning Algorithms • Partitioning method: Construct a partition of a database D of n objects into a set of k clusters • Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion • Global optimal: exhaustively enumerate all partitions • Heuristic methods: k-means and k-medoids algorithms • k-means (MacQueen, 1967): Each cluster is represented by the center of the cluster • k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw, 1987): Each cluster is represented by one of the objects in the cluster
  • 8. Partitional Clustering • Each instance is placed in exactly one of K nonoverlapping clusters. • Since only one set of clusters is output, the user normally has to input the desired number of clusters K.
  • 9. 9 Similarity and Dissimilarity Between Objects • Euclidean distance (p = 2): • Properties of a metric d(i,j): • d(i,j)  0 • d(i,i) = 0 • d(i,j) = d(j,i) • d(i,j)  d(i,k) + d(k,j) )||...|||(|),( 22 22 2 11 pp j x i x j x i x j x i xjid 
  • 10. Squared Error 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 Objective Function
  • 11. Algorithm k-means 1. Decide on a value for k. 2. Initialize the k cluster centers (randomly, if necessary). 3. Decide the class memberships of the N objects by assigning them to the nearest cluster center. 4. Re-estimate the k cluster centers, by assuming the memberships found above are correct. 5. If none of the N objects changed membership in the last iteration, exit. Otherwise goto 3.
  • 12. 0 1 2 3 4 5 0 1 2 3 4 5 K-means Clustering: Step 1 Algorithm: k-means, Distance Metric: Euclidean Distance k1 k2 k3
  • 13. 0 1 2 3 4 5 0 1 2 3 4 5 K-means Clustering: Step 2 Algorithm: k-means, Distance Metric: Euclidean Distance k1 k2 k3
  • 14. 0 1 2 3 4 5 0 1 2 3 4 5 K-means Clustering: Step 3 Algorithm: k-means, Distance Metric: Euclidean Distance k1 k2 k3
  • 15. 0 1 2 3 4 5 0 1 2 3 4 5 K-means Clustering: Step 4 Algorithm: k-means, Distance Metric: Euclidean Distance k1 k2 k3
  • 16. 0 1 2 3 4 5 0 1 2 3 4 5 expression in condition 1 expressionincondition2 K-means Clustering: Step 5 Algorithm: k-means, Distance Metric: Euclidean Distance k1 k2 k3
  • 18. Example Subject Features 1 (1,1) 2 (1.5,2) 3 (3,4) 4 (5,7) 5 (3.5,5) 6 (4.5,5) 7 (3.5,4.5) As a simple illustration of a k-means algorithm, consider the following data set consisting of the scores of two variables on each of seven individuals: Scatter Plot
  • 19. Working Individual Mean Vector (centroid) Group 1 1 (1, 1) Group 2 4 (5, 7) This data set is to be grouped into two clusters, i.e k=2. As a first step in finding a sensible initial partition, let the feature values of the two individuals furthest apart (using the Euclidean distance measure), define the initial cluster means, giving:
  • 20. Working • The remaining individuals are now examined in sequence and allocated to the cluster to which they are closest, in terms of Euclidean distance to the cluster mean. • The mean vector is recalculated each time a new member is added. • This leads to the following series of steps:
  • 21. Iteration 1- Assign Objects to closest clusters Object (Oi) Features Centroid 1 (C1) D(Oi ,C1) Centroid 2 (C2) D(Oi ,C2) Closest Centroid 1 (1,1) (1,1) (5,7) 2 (1.5,2) (1,1) (5,7) 3 (3,4) (1,1) (5,7) 4 (5,7) (1,1) (5,7) 5 (3.5,5) (1,1) (5,7) 6 (4.5,5) (1,1) (5,7) 7 (3.5,4.5) (1,1) (5,7) A cluster is defined by its centroid
  • 22. Iteration 1- Assign Objects to closest clusters Object (Oi) Features Centroid 1 (C1) D(Oi ,C1) Centroid 2 (C2) D(Oi ,C2) Closest Centroid 1 (1,1) (1,1) 0 (5,7) 7.21 2 (1.5,2) (1,1) 1.11 (5,7) 6.1 3 (3,4) (1,1) 3.05 (5,7) 3.60 4 (5,7) (1,1) 7.21 (5,7) 0 5 (3.5,5) (1,1) 4.71 (5,7) 2.5 6 (4.5,5) (1,1) 5.31 (5,7) 2.06 7 (3.5,4.5) (1,1) 4.3 (5,7) 2.91 A cluster is defined by its centroid (1 − 1.5)2+(1 − 2)2 =1.11 And so on…
  • 23. Iteration 1- Assign Objects to closest clusters Object (Oi) Features Centroid 1 (C1) D(Oi ,C1) Centroid 2 (C2) D(Oi ,C2) Closest Centroid 1 (1,1) (1,1) 0 (5,7) 7.21 C1 2 (1.5,2) (1,1) 1.11 (5,7) 6.1 C1 3 (3,4) (1,1) 3.05 (5,7) 3.60 C1 4 (5,7) (1,1) 7.21 (5,7) 0 C2 5 (3.5,5) (1,1) 4.71 (5,7) 2.5 C2 6 (4.5,5) (1,1) 5.31 (5,7) 2.06 C2 7 (3.5,4.5) (1,1) 4.3 (5,7) 2.91 C2 A cluster is defined by its centroid Object 1 is assigned to cluster 1 and so on..
  • 24. Re computing Centroids at the end of Iteration 1 Individuals New Centroids Cluster 1 1, 2, 3 C1 = ((1,1)+(1.5,2)+(3,4))/3 = (1.8, 2.3) Cluster 2 4, 5, 6, 7 C2 =((5,7)+(3.5,5)+(4.5,5)+ (3.5,4.5))/4 = (4.1, 5.4) Now the initial partition has changed, and the two clusters at this stage having the following characteristics:
  • 25. Iteration 2- Check if any object has changed clusters Object (Oi) Features Centroid 1 (C1) D(Oi ,C1) Centroid 2 (C2) D(Oi ,C2) Closest Centroid 1 (1,1) (1.8,2.3) (4.1, 5.4) 2 (1.5,2) (1.8,2.3) (4.1, 5.4) 3 (3,4) (1.8,2.3) (4.1, 5.4) 4 (5,7) (1.8,2.3) (4.1, 5.4) 5 (3.5,5) (1.8,2.3) (4.1, 5.4) 6 (4.5,5) (1.8,2.3) (4.1, 5.4) 7 (3.5,4.5) (1.8,2.3) (4.1, 5.4)
  • 26. Iteration 2- Check if any object has changed clusters Object (Oi) Features Centroid 1 (C1) D(Oi ,C1) Centroid 2 (C2) D(Oi ,C2) Closest Centroid 1 (1,1) (1.8,2.3) 1.53 (4.1, 5.4) 5.38 C1 2 (1.5,2) (1.8,2.3) 0.42 (4.1, 5.4) 4.28 C1 3 (3,4) (1.8,2.3) 2.08 (4.1, 5.4) 1.78 C2 4 (5,7) (1.8,2.3) 5.69 (4.1, 5.4) 1.84 C2 5 (3.5,5) (1.8,2.3) 3.19 (4.1, 5.4) 0.72 C2 6 (4.5,5) (1.8,2.3) 3.82 (4.1, 5.4) 0.57 C2 7 (3.5,4.5) (1.8,2.3) 2.78 (4.1, 5.4) 1.08 C2 Object 3 has changed cluster from 1 to 2
  • 27. Re computing Centroids at the end of Iteration 2 Individuals New Centroids Cluster 1 1, 2 C1 = ((1,1)+(1.5,2))/2 = (1.3, 1.5) Cluster 2 3,4, 5, 6, 7 C2 =((3,4)+(5,7)+(3.5,5)+(4.5,5)+ (3.5,4.5))/5 = (3.9, 5.1) Now the initial partition has changed wih Object 3 getting relocated to cluster 2 and the two clusters at this stage having the following characteristics:
  • 28. Iteration 3- Check if any object has changed clusters Object (Oi) Features Centroid 1 (C1) D(Oi ,C1) Centroid 2 (C2) D(Oi ,C2) Closest Centroid 1 (1,1) (1.3, 1.5) 0.58 (3.9, 5.1) 5.02 C1 2 (1.5,2) (1.3, 1.5) 0.54 (3.9, 5.1) 3.92 C1 3 (3,4) (1.3, 1.5) 3.02 (3.9, 5.1) 1.42 C2 4 (5,7) (1.3, 1.5) 6.63 (3.9, 5.1) 2.19 C2 5 (3.5,5) (1.3, 1.5) 4.13 (3.9, 5.1) 0.41 C2 6 (4.5,5) (1.3, 1.5) 4.74 (3.9, 5.1) 0.61 C2 7 (3.5,4.5) (1.3, 1.5) 3.72 (3.9, 5.1) 0.72 C2 No change in clusters compared to previous iteration
  • 29. Conclusion • In this example each individual is now nearer its own cluster mean than that of the other cluster and the iteration stops, choosing the latest partitioning as the final cluster solution. • Hence Objects {1,2} belong to first cluster and Objects {3,4,5,6,7} belong to second cluster.
  • 30. Notes • The iterative relocation would continue until no more relocations occur. • Luckily, in the example, we got the no-relocation condition satisfied in 3 iterations, but this is not usually the case. It might require hundreds of iterations depending on the dataset. • Also, it is possible that the k-means algorithm won't find a final solution at all. • In this case it would be a good idea to consider stopping the algorithm after a pre-chosen maximum of iterations.
  • 31. Comments on the K-Means Method • Strength • Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. • Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms • Weakness • Applicable only when mean is defined, then what about categorical data? • Need to specify k, the number of clusters, in advance • Unable to handle noisy data and outliers • Not suitable to discover clusters with non-convex shapes
  • 32. 32 Summary • K-means algorithm is a simple yet popular method for clustering analysis • Its performance is determined by initialisation and appropriate distance measure • There are several variants of K-means to overcome its weaknesses • K-Medoids: resistance to noise and/or outliers • K-Modes: extension to categorical data clustering analysis • CLARA: extension to deal with large data sets • Mixture models (EM algorithm): handling uncertainty of clusters Online tutorial: the K-means function in Matlab https://www.youtube.com/watch?v=aYzjenNNOcc
  • 33. References • http://mnemstudio.org/clustering-k-means-example-1.htm • Ke Chen, K-means Clustering ,COMP24111-Machine Learning, University of Manchester, 2016. • {Insert Reference}/ 10.ppt • {Insert Reference}/MachinLearning3.ppt