SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Presented by: Derek Kane
 Introduction to Clustering Techniques
 K-Means
 Hierarchical Clustering
 Gaussian Mixed Model
 Visualization of Distance Matrix
 Practical Example
 Cluster analysis or clustering is the task of
grouping a set of objects in such a way that
objects in the same group (called a cluster)
are more similar (in some sense or another)
to each other than to those in other groups
(clusters).
 It is a main task of exploratory data mining,
and a common technique for statistical data
analysis, used in many fields, including
machine learning, pattern recognition, image
analysis, information retrieval, and
bioinformatics.
 Cluster analysis itself is not one specific
algorithm, but the general task to be solved.
There are many real world applications of clustering:
 Grouping or Hierarchies of Products
 Recommendation Engines
 Biological Classification
 Typologies
 Crime Analysis
 Medical Imaging
 Market Research
 Social Network Analysis
 Markov Chain Monte Carlo Methods
 According to Vladimir Estivill-Castro, the notion
of a "cluster" cannot be precisely defined, which
is one of the reasons why there are so many
clustering algorithms.
 There is no objectively "correct" clustering
algorithm, but as it was noted, "clustering is in
the eye of the beholder.“
There are many different types of clustering
models including:
 Connectivity models
 Centroid models
 Distribution models
 Density models
 Group models
 Graph-based models
 A "clustering" is essentially a set of such
clusters, usually containing all objects in the
data set.
 Additionally, it may specify the relationship of
the clusters to each other, for example a
hierarchy of clusters embedded in each other.
Clusterings can be roughly distinguished as:
 hard clustering: each object belongs to a
cluster or not.
 soft clustering (also: fuzzy clustering): each
object belongs to each cluster to a certain
degree (e.g. a likelihood of belonging to the
cluster).
 K-means is one of the simplest unsupervised
learning algorithms that solve the well known
clustering problem.
The algorithm is composed of the following steps:
 Place K points into the space represented by
the objects that are being clustered. These
points represent initial group centroids.
 Assign each object to the group that has the
closest centroid.
 When all objects have been assigned,
recalculate the positions of the K centroids.
 Repeat Steps 2 and 3 until the centroids no
longer move.
 The k-means algorithm does not necessarily find
the most optimal configuration, corresponding
to the global objective function minimum.
 The algorithm is also significantly sensitive to the
initial randomly selected cluster centers.
 A scree plot is a graphical display of the
variance of each (cluster) component in
the dataset which is used to determine
how many components should be
retained in order to explain a high
percentage of the variation in the data.
 The variance of each component is
calculated using the following formula:
λi∑i=1nλi where λi is the ith eigenvalue
and ∑i=1nλi is the sum of all of the
eigenvalues.
 The plot shows the variance for the first
component and then for the subsequent
components, it shows the additional
variance that each component is adding.
 A scree plot is sometimes referred to as an “elbow” plot.
 In order to identify the optimal number of clusters for further analysis, we need to look
for the “bend” in the graph at the elbow.
# of Clusters should be 3 or 4
 Silhouette refers to a method of interpretation and validation of clusters of data.
 The silhouette technique provides a succinct graphical representation of how well each object
lies within its cluster and was first described by Peter J. Rousseeuw in 1986.
 The goal of a silhouette plot is to identify the point of the highest Average Silhouette
Width (ASW). This is the optimal number of clusters for the analysis.
Cluster = 4
 The Cluster plot shows the spread of the data within each of the generated clusters.
 A Principal Component Analysis (greater than 2 dimensions) can be utilized to view the clustering.
 Here is an example of K-Means clustering based on the Ruspini data.
Cluster = 4
 Hierarchical clustering (or hierarchic
clustering ) outputs a hierarchy, a
structure that is more informative than
the unstructured set of clusters returned
by flat clustering.
 Hierarchical clustering does not require
us to prespecify the number of clusters
and most hierarchical algorithms that
have been used in IR are deterministic.
 These advantages of hierarchical
clustering come at the cost of lower
efficiency.
Given a set of N items to be clustered, and an NxN
distance (or similarity) matrix, the basic process of
Johnson's (1967) hierarchical clustering is this:
 Start by assigning each item to its own cluster,
so that if you have N items, you now have N
clusters, each containing just one item. Let the
distances (similarities) between the clusters equal
the distances (similarities) between the items
they contain.
 Find the closest (most similar) pair of clusters
and merge them into a single cluster, so that
now you have one less cluster.
 Compute distances (similarities) between the
new cluster and each of the old clusters.
 Repeat steps 2 and 3 until all items are clustered
into a single cluster of size N.
Cluster = 4
Step 3 can be done in different ways, which is what
distinguishes single-linkage from complete-linkage and
average-linkage clustering.
 In single-linkage clustering, we consider the distance
between one cluster and another cluster to be equal
to the shortest distance from any member of one
cluster to any member of the other cluster.
 In complete-linkage clustering, we consider the
distance between one cluster and another cluster to
be equal to the greatest distance from any member of
one cluster to any member of the other cluster.
 In average-linkage clustering, we consider the distance
between one cluster and another cluster to be equal
to the average distance from any member of one
cluster to any member of the other cluster.
Single-linkage on density-
based clusters.
 Here is an example of Hierarchical clustering based on the Ruspini data.
Cluster = 4
 Another technique we can use for
clustering involves the application of the
EM algorithm for Gaussian Mixed Models.
 The approach uses the Mclust package in
R that utilizes the BIC statistic as the
measurement criteria.
 The goal for identifying the number of
clusters is to maximize the BIC.
 This approach is considered to be a fuzzy
clustering method.
 Here are some diagnostics which are useful to establish the correct fit to the data.
The maximum BIC
indicates that a Cluster = 5
should be used
 The Mclust classification chart depicts the 5 centroid distribution and also the relative
degree of the classification uncertainty.
 The contour plot shows the density of the data points relative to each of the
centroids.
 A similarity matrix is a matrix of scores
that represent the similarity between a
number of data points.
 Each element of the similarity matrix
contains a measure of similarity between
two of the data points.
 A distance matrix is a matrix (two-
dimensional array) containing the
distances, taken pairwise, of a set of
points.
 This matrix will have a size of N×N where
N is the number of points, nodes or
vertices (often in a graph).
Heat map of Similarity Matrix
Ex. Points on a graph
Distance Matrix
Heat map
 Here is an application of the Distance Matrix to the Ruspini dataset.
 Now we will reorder the distance matrix by the kmeans (k=4) algorithm.
 Now we bring in the dissplot to compare the heatmap against expected values.
We are looking for symmetry between the observed colors and the expected
colors.
 This is an example of an incorrect
kmeans specification of k = 3 instead
of k = 4.
 The heatmap shows that within the
1st cluster there appears to be
datapoints which should be
separated into an additional distinct
cluster.
 The similarity and distance matrices
can be powerful visual aids when
evaluating the fit of clusters.
 The goal of a clustering exercise should
be to identify the appropriate pockets or
groups of data that should be grouped
together.
 Because the structure of the underlying
data is usually unknown, there could be
many different clusters created by
different clustering techniques.
 Therefore, lets use a dataset where it is
obvious what the correct number of
clusters should be.
 This dataset clearly shows that there should be 4 clusters: 2 eyes, 1 nose, and 1 mouth.
 Lets first run a k-means clustering algorithm:
K = 6 Clusters The k-means produced too many
clusters based off of the dataset.
 Now lets use the Hierarchical clustering method:
K = 4 Clusters
 Reside in Wayne, Illinois
 Active Semi-Professional Classical Musician
(Bassoon).
 Married my wife on 10/10/10 and been
together for 10 years.
 Pet Yorkshire Terrier / Toy Poodle named
Brunzie.
 Pet Maine Coons’ named Maximus Power and
Nemesis Gul du Cat.
 Enjoy Cooking, Hiking, Cycling, Kayaking, and
Astronomy.
 Self proclaimed Data Nerd and Technology
Lover.
 http://www.stat.washington.edu/mclust/
 http://www.stats.gla.ac.uk/glossary/?q=node/451
 http://www.norusis.com/pdf/SPC_v13.pdf
 http://en.wikipedia.org/wiki/Cluster_analysis
 http://michael.hahsler.net/SMU/EMIS7332/R/chap8.html
 http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html
 http://www.autonlab.org/tutorials/gmm14.pdf
 http://en.wikipedia.org/wiki/Distance_matrix
Data Science - Part VII -  Cluster Analysis

Weitere ähnliche Inhalte

Was ist angesagt?

Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsSalah Amean
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysisguest0edcaf
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning pyingkodi maran
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsKush Kulshrestha
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierNeha Kulkarni
 
7. Data Import – Data Export
7. Data Import – Data Export7. Data Import – Data Export
7. Data Import – Data ExportFAO
 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Mustafa Sherazi
 
Clustering, k-means clustering
Clustering, k-means clusteringClustering, k-means clustering
Clustering, k-means clusteringMegha Sharma
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersFunctional Imperative
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data MiningValerii Klymchuk
 
Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data MiningDHIVYADEVAKI
 

Was ist angesagt? (20)

Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning Algorithms
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
7. Data Import – Data Export
7. Data Import – Data Export7. Data Import – Data Export
7. Data Import – Data Export
 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)
 
Clustering, k-means clustering
Clustering, k-means clusteringClustering, k-means clustering
Clustering, k-means clustering
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Decision tree
Decision treeDecision tree
Decision tree
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
Text clustering
Text clusteringText clustering
Text clustering
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 

Ähnlich wie Data Science - Part VII - Cluster Analysis

Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSandinoBerutu1
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptImXaib
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithmLaura Petrosanu
 
Survey on Unsupervised Learning in Datamining
Survey on Unsupervised Learning in DataminingSurvey on Unsupervised Learning in Datamining
Survey on Unsupervised Learning in DataminingIOSR Journals
 
Literature Survey On Clustering Techniques
Literature Survey On Clustering TechniquesLiterature Survey On Clustering Techniques
Literature Survey On Clustering TechniquesIOSR Journals
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learningAnil Yadav
 
Ch 4 Cluster Analysis.pdf
Ch 4 Cluster Analysis.pdfCh 4 Cluster Analysis.pdf
Ch 4 Cluster Analysis.pdfYaseenRashid4
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptionsrefedey275
 
Clustering & classification
Clustering & classificationClustering & classification
Clustering & classificationJamshed Khan
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfSowmyaJyothi3
 
Data Mining: Cluster Analysis
Data Mining: Cluster AnalysisData Mining: Cluster Analysis
Data Mining: Cluster AnalysisSuman Mia
 
Machine learning session9(clustering)
Machine learning   session9(clustering)Machine learning   session9(clustering)
Machine learning session9(clustering)Abhimanyu Dwivedi
 
Assessing the compactness and isolation of individual clusters
Assessing the compactness and isolation of individual clustersAssessing the compactness and isolation of individual clusters
Assessing the compactness and isolation of individual clustersperfj
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
ClustIII.ppt
ClustIII.pptClustIII.ppt
ClustIII.pptSueMiu
 

Ähnlich wie Data Science - Part VII - Cluster Analysis (20)

Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm
 
Survey on Unsupervised Learning in Datamining
Survey on Unsupervised Learning in DataminingSurvey on Unsupervised Learning in Datamining
Survey on Unsupervised Learning in Datamining
 
Literature Survey On Clustering Techniques
Literature Survey On Clustering TechniquesLiterature Survey On Clustering Techniques
Literature Survey On Clustering Techniques
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
A0310112
A0310112A0310112
A0310112
 
Ch 4 Cluster Analysis.pdf
Ch 4 Cluster Analysis.pdfCh 4 Cluster Analysis.pdf
Ch 4 Cluster Analysis.pdf
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
 
Clustering & classification
Clustering & classificationClustering & classification
Clustering & classification
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
47 292-298
47 292-29847 292-298
47 292-298
 
Data Mining: Cluster Analysis
Data Mining: Cluster AnalysisData Mining: Cluster Analysis
Data Mining: Cluster Analysis
 
cluster analysis
cluster analysiscluster analysis
cluster analysis
 
Machine learning session9(clustering)
Machine learning   session9(clustering)Machine learning   session9(clustering)
Machine learning session9(clustering)
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
Assessing the compactness and isolation of individual clusters
Assessing the compactness and isolation of individual clustersAssessing the compactness and isolation of individual clusters
Assessing the compactness and isolation of individual clusters
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
ClustIII.ppt
ClustIII.pptClustIII.ppt
ClustIII.ppt
 

Mehr von Derek Kane

Data Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image ProcessingData Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image ProcessingDerek Kane
 
Data Science - Part XVI - Fourier Analysis
Data Science - Part XVI - Fourier AnalysisData Science - Part XVI - Fourier Analysis
Data Science - Part XVI - Fourier AnalysisDerek Kane
 
Data Science - Part XV - MARS, Logistic Regression, & Survival Analysis
Data Science -  Part XV - MARS, Logistic Regression, & Survival AnalysisData Science -  Part XV - MARS, Logistic Regression, & Survival Analysis
Data Science - Part XV - MARS, Logistic Regression, & Survival AnalysisDerek Kane
 
Data Science - Part XIV - Genetic Algorithms
Data Science - Part XIV - Genetic AlgorithmsData Science - Part XIV - Genetic Algorithms
Data Science - Part XIV - Genetic AlgorithmsDerek Kane
 
Data Science - Part XIII - Hidden Markov Models
Data Science - Part XIII - Hidden Markov ModelsData Science - Part XIII - Hidden Markov Models
Data Science - Part XIII - Hidden Markov ModelsDerek Kane
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsDerek Kane
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 
Data Science - Part X - Time Series Forecasting
Data Science - Part X - Time Series ForecastingData Science - Part X - Time Series Forecasting
Data Science - Part X - Time Series ForecastingDerek Kane
 
Data Science - Part IX - Support Vector Machine
Data Science - Part IX -  Support Vector MachineData Science - Part IX -  Support Vector Machine
Data Science - Part IX - Support Vector MachineDerek Kane
 
Data Science - Part VIII - Artifical Neural Network
Data Science - Part VIII -  Artifical Neural NetworkData Science - Part VIII -  Artifical Neural Network
Data Science - Part VIII - Artifical Neural NetworkDerek Kane
 
Data Science - Part VI - Market Basket and Product Recommendation Engines
Data Science - Part VI - Market Basket and Product Recommendation EnginesData Science - Part VI - Market Basket and Product Recommendation Engines
Data Science - Part VI - Market Basket and Product Recommendation EnginesDerek Kane
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Derek Kane
 
Data Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVAData Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVADerek Kane
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionDerek Kane
 
Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studioDerek Kane
 
Data Science - Part I - Sustaining Predictive Analytics Capabilities
Data Science - Part I - Sustaining Predictive Analytics CapabilitiesData Science - Part I - Sustaining Predictive Analytics Capabilities
Data Science - Part I - Sustaining Predictive Analytics CapabilitiesDerek Kane
 

Mehr von Derek Kane (16)

Data Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image ProcessingData Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image Processing
 
Data Science - Part XVI - Fourier Analysis
Data Science - Part XVI - Fourier AnalysisData Science - Part XVI - Fourier Analysis
Data Science - Part XVI - Fourier Analysis
 
Data Science - Part XV - MARS, Logistic Regression, & Survival Analysis
Data Science -  Part XV - MARS, Logistic Regression, & Survival AnalysisData Science -  Part XV - MARS, Logistic Regression, & Survival Analysis
Data Science - Part XV - MARS, Logistic Regression, & Survival Analysis
 
Data Science - Part XIV - Genetic Algorithms
Data Science - Part XIV - Genetic AlgorithmsData Science - Part XIV - Genetic Algorithms
Data Science - Part XIV - Genetic Algorithms
 
Data Science - Part XIII - Hidden Markov Models
Data Science - Part XIII - Hidden Markov ModelsData Science - Part XIII - Hidden Markov Models
Data Science - Part XIII - Hidden Markov Models
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Data Science - Part X - Time Series Forecasting
Data Science - Part X - Time Series ForecastingData Science - Part X - Time Series Forecasting
Data Science - Part X - Time Series Forecasting
 
Data Science - Part IX - Support Vector Machine
Data Science - Part IX -  Support Vector MachineData Science - Part IX -  Support Vector Machine
Data Science - Part IX - Support Vector Machine
 
Data Science - Part VIII - Artifical Neural Network
Data Science - Part VIII -  Artifical Neural NetworkData Science - Part VIII -  Artifical Neural Network
Data Science - Part VIII - Artifical Neural Network
 
Data Science - Part VI - Market Basket and Product Recommendation Engines
Data Science - Part VI - Market Basket and Product Recommendation EnginesData Science - Part VI - Market Basket and Product Recommendation Engines
Data Science - Part VI - Market Basket and Product Recommendation Engines
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
Data Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVAData Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVA
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model Selection
 
Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studio
 
Data Science - Part I - Sustaining Predictive Analytics Capabilities
Data Science - Part I - Sustaining Predictive Analytics CapabilitiesData Science - Part I - Sustaining Predictive Analytics Capabilities
Data Science - Part I - Sustaining Predictive Analytics Capabilities
 

Kürzlich hochgeladen

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 

Kürzlich hochgeladen (20)

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 

Data Science - Part VII - Cluster Analysis

  • 2.  Introduction to Clustering Techniques  K-Means  Hierarchical Clustering  Gaussian Mixed Model  Visualization of Distance Matrix  Practical Example
  • 3.  Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).  It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.  Cluster analysis itself is not one specific algorithm, but the general task to be solved.
  • 4. There are many real world applications of clustering:  Grouping or Hierarchies of Products  Recommendation Engines  Biological Classification  Typologies  Crime Analysis  Medical Imaging  Market Research  Social Network Analysis  Markov Chain Monte Carlo Methods
  • 5.  According to Vladimir Estivill-Castro, the notion of a "cluster" cannot be precisely defined, which is one of the reasons why there are so many clustering algorithms.  There is no objectively "correct" clustering algorithm, but as it was noted, "clustering is in the eye of the beholder.“ There are many different types of clustering models including:  Connectivity models  Centroid models  Distribution models  Density models  Group models  Graph-based models
  • 6.  A "clustering" is essentially a set of such clusters, usually containing all objects in the data set.  Additionally, it may specify the relationship of the clusters to each other, for example a hierarchy of clusters embedded in each other. Clusterings can be roughly distinguished as:  hard clustering: each object belongs to a cluster or not.  soft clustering (also: fuzzy clustering): each object belongs to each cluster to a certain degree (e.g. a likelihood of belonging to the cluster).
  • 7.  K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The algorithm is composed of the following steps:  Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.  Assign each object to the group that has the closest centroid.  When all objects have been assigned, recalculate the positions of the K centroids.  Repeat Steps 2 and 3 until the centroids no longer move.  The k-means algorithm does not necessarily find the most optimal configuration, corresponding to the global objective function minimum.  The algorithm is also significantly sensitive to the initial randomly selected cluster centers.
  • 8.  A scree plot is a graphical display of the variance of each (cluster) component in the dataset which is used to determine how many components should be retained in order to explain a high percentage of the variation in the data.  The variance of each component is calculated using the following formula: λi∑i=1nλi where λi is the ith eigenvalue and ∑i=1nλi is the sum of all of the eigenvalues.  The plot shows the variance for the first component and then for the subsequent components, it shows the additional variance that each component is adding.
  • 9.  A scree plot is sometimes referred to as an “elbow” plot.  In order to identify the optimal number of clusters for further analysis, we need to look for the “bend” in the graph at the elbow. # of Clusters should be 3 or 4
  • 10.  Silhouette refers to a method of interpretation and validation of clusters of data.  The silhouette technique provides a succinct graphical representation of how well each object lies within its cluster and was first described by Peter J. Rousseeuw in 1986.
  • 11.  The goal of a silhouette plot is to identify the point of the highest Average Silhouette Width (ASW). This is the optimal number of clusters for the analysis. Cluster = 4
  • 12.  The Cluster plot shows the spread of the data within each of the generated clusters.  A Principal Component Analysis (greater than 2 dimensions) can be utilized to view the clustering.
  • 13.  Here is an example of K-Means clustering based on the Ruspini data. Cluster = 4
  • 14.  Hierarchical clustering (or hierarchic clustering ) outputs a hierarchy, a structure that is more informative than the unstructured set of clusters returned by flat clustering.  Hierarchical clustering does not require us to prespecify the number of clusters and most hierarchical algorithms that have been used in IR are deterministic.  These advantages of hierarchical clustering come at the cost of lower efficiency.
  • 15. Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process of Johnson's (1967) hierarchical clustering is this:  Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters equal the distances (similarities) between the items they contain.  Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster.  Compute distances (similarities) between the new cluster and each of the old clusters.  Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. Cluster = 4
  • 16. Step 3 can be done in different ways, which is what distinguishes single-linkage from complete-linkage and average-linkage clustering.  In single-linkage clustering, we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster.  In complete-linkage clustering, we consider the distance between one cluster and another cluster to be equal to the greatest distance from any member of one cluster to any member of the other cluster.  In average-linkage clustering, we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster. Single-linkage on density- based clusters.
  • 17.  Here is an example of Hierarchical clustering based on the Ruspini data. Cluster = 4
  • 18.  Another technique we can use for clustering involves the application of the EM algorithm for Gaussian Mixed Models.  The approach uses the Mclust package in R that utilizes the BIC statistic as the measurement criteria.  The goal for identifying the number of clusters is to maximize the BIC.  This approach is considered to be a fuzzy clustering method.
  • 19.  Here are some diagnostics which are useful to establish the correct fit to the data. The maximum BIC indicates that a Cluster = 5 should be used
  • 20.  The Mclust classification chart depicts the 5 centroid distribution and also the relative degree of the classification uncertainty.
  • 21.  The contour plot shows the density of the data points relative to each of the centroids.
  • 22.  A similarity matrix is a matrix of scores that represent the similarity between a number of data points.  Each element of the similarity matrix contains a measure of similarity between two of the data points.  A distance matrix is a matrix (two- dimensional array) containing the distances, taken pairwise, of a set of points.  This matrix will have a size of N×N where N is the number of points, nodes or vertices (often in a graph). Heat map of Similarity Matrix
  • 23. Ex. Points on a graph Distance Matrix Heat map
  • 24.  Here is an application of the Distance Matrix to the Ruspini dataset.
  • 25.  Now we will reorder the distance matrix by the kmeans (k=4) algorithm.
  • 26.  Now we bring in the dissplot to compare the heatmap against expected values. We are looking for symmetry between the observed colors and the expected colors.
  • 27.  This is an example of an incorrect kmeans specification of k = 3 instead of k = 4.  The heatmap shows that within the 1st cluster there appears to be datapoints which should be separated into an additional distinct cluster.  The similarity and distance matrices can be powerful visual aids when evaluating the fit of clusters.
  • 28.
  • 29.  The goal of a clustering exercise should be to identify the appropriate pockets or groups of data that should be grouped together.  Because the structure of the underlying data is usually unknown, there could be many different clusters created by different clustering techniques.  Therefore, lets use a dataset where it is obvious what the correct number of clusters should be.
  • 30.  This dataset clearly shows that there should be 4 clusters: 2 eyes, 1 nose, and 1 mouth.
  • 31.  Lets first run a k-means clustering algorithm: K = 6 Clusters The k-means produced too many clusters based off of the dataset.
  • 32.  Now lets use the Hierarchical clustering method: K = 4 Clusters
  • 33.  Reside in Wayne, Illinois  Active Semi-Professional Classical Musician (Bassoon).  Married my wife on 10/10/10 and been together for 10 years.  Pet Yorkshire Terrier / Toy Poodle named Brunzie.  Pet Maine Coons’ named Maximus Power and Nemesis Gul du Cat.  Enjoy Cooking, Hiking, Cycling, Kayaking, and Astronomy.  Self proclaimed Data Nerd and Technology Lover.
  • 34.  http://www.stat.washington.edu/mclust/  http://www.stats.gla.ac.uk/glossary/?q=node/451  http://www.norusis.com/pdf/SPC_v13.pdf  http://en.wikipedia.org/wiki/Cluster_analysis  http://michael.hahsler.net/SMU/EMIS7332/R/chap8.html  http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html  http://www.autonlab.org/tutorials/gmm14.pdf  http://en.wikipedia.org/wiki/Distance_matrix