SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Deepak George
Staff Data Scientist
Unsupervised Learning: Clustering
K-Means, Hierarchical Clustering & DBSCAN
➢ Data Science Career
▪ General Electric
▪ Accenture Management Consulting
▪ Mu Sigma
➢ Highlights
▪ 1st Prize Best Data Science Project (BAI 5) – IIM Bangalore
▪ Co-author of Markdown Optimization case published at Harvard Business School
▪ Kaggle Bronze medal – Toxic Comment Classification
▪ Kaggle Bronze medal - Coupon Purchase Prediction (Recommender System)
▪ SAS Certified Statistical Business Analyst: Regression and Modeling Credentials
➢ Education
▪ Indian Institute Of Management Bangalore - Business Analytics & Intelligence
▪ College Of Engineering Trivandrum - Computer Science Engineering
➢ Passion
▪ Deep Learning, Photography, Football
▪ Profile
▪ linkedin.com/in/deepakgeorge7/
▪ https://github.com/deepakiim
Deepak George, IIM Bangalore
2
About Me
1. Introduction to clustering and unsupervised learning
2. K means
3. Divisive and agglomerative clustering (Hierarchical)
4. Density-based clustering (DBSCAN)
5. Recommendations
Agenda
Deepak George, IIM Bangalore
What is Unsupervised Learning?
• Training data is labelled
• Used for predict the label
• Classification and Regression
• Training data is unlabelled
• Used for finding patterns in the data
• Clustering, Dimensionality reduction, Association Rules
.
Deepak George, IIM Bangalore
What is Clustering?
Deepak George, IIM Bangalore
What is Norm?
Let p ≥ 1 be a real number. The p-norm (also called of Lp norm) of vector x =(x1, x2 ….,xn)
• Norm measures the magnitude (or size, length) of vector
• On an intuitive level, the norm of a vector x measures the distance from the origin to the point x.
Geometric Interpretation of L2 Norm
Consider a unit ball containing the origin.
The Euclidean norm of a vector is simply the factor by which the ball must
be expanded or shrunk in order to fit the given vector exactly
Deepak George, IIM Bangalore
Dissimilarity/Proximity Matrix
Euclidean distance
Dissimilarity Matrix
Weighted Dissimilarity Matrix
Data Matrix (n*p)
Dissimilarity Matrix (n*n)
Distance is inversely proportional to Similarity
Deepak George, IIM Bangalore
Types of Clustering Algorithms
1.Combinatorial algorithms
2.Mixture modelling
3.Mode Seekers
Deepak George, IIM Bangalore
Minimizing W(C) is equivalent to maximizing B(C) given that T is
constant for any given data.
C(i) is the encoder that we seek which assigns the ith observation to the kth cluster
Within Cluster Point Scatter
Total point scatter
Between Cluster Point Scatter
Combinatorial algorithm directly specify a mathematical loss function and attempt to minimize it through some combinatorial
optimization algorithm. Since the goal is to assign close points to the same cluster, a natural loss function would be
Combinatorial Algorithm
Deepak George, IIM Bangalore
K-Means Visual Explanation
Random seeds Assign Update
It is intended for situations in which all variables are of the quantitative type, and squared Euclidean distance
Deepak George, IIM Bangalore
K-Means Mathematical Explanation
(for special case of K means)
* The Elements of Statistical Learning Deepak George, IIM Bangalore
K-Means algorithm animation
X1
X2
Deepak George, IIM Bangalore
Kmeans_animation.gif
X1
X2
K-Means starting seed position issue animation
Deepak George, IIM Bangalore
Kmeans_starting_issue_animation.gif
K-Means Clustering
Advantages
• Scales well on large dataset
• Does NOT require ANY assumptions about data distribution
Disadvantages
• Assumes clusters are spherical
• Assumes clusters are approximately equal in size
• Can only use Euclidean dissimilarity
• Choosing the wrong K
• Doesn’t guarantee global optima
• Could depend on choice of initial seeds
• Works only with continuous data
Deepak George, IIM Bangalore
Hierarchical Clustering
Agglomerative Clustering:
• Bottom Up
• Each object is initially considered as a single-element
cluster
• At each step, the two clusters that are the most similar
are combined into a new bigger cluster
• Repeated until all points are member of just one single
big cluster
Divisive Clustering:
• Top Down
• Initially all objects are assigned to a single cluster
• At each step, the most heterogeneous cluster is divided
into two.
• Repeated until all objects are in their own cluster
Deepak George, IIM Bangalore
Measuring Dissimilarity between two clusters
Deepak George, IIM Bangalore
Hierarchical Clustering Visual Explanation
Deepak George, IIM Bangalore
Hierarchical Clustering Algorithm
* The Elements of Statistical Learning Deepak George, IIM Bangalore
Hierarchical Clustering algorithm animation
X1
X2
Deepak George, IIM Bangalore
Hierarchical_animation.gif
Hierarchical Clustering
Advantages
• No need to choose K before running the algorithm
• Dendrogram will give visual guidance in choosing K
• Can use any dissimilarity measure
• Works on any kind of data including categorical and mixed
• Does NOT require ANY assumptions about data distribution
Disadvantages
• Doesn’t scales well on large dataset
• Doesn’t guarantee global optima
Deepak George, IIM Bangalore
Density-based spatial clustering of applications with noise
DBSCAN Parameters:
1. Minpts - Minimum number of points required to form a cluster
2. Epsilon – Radius of the circle drown around a point within which all points falling inside the circle
belong to the same cluster.
DBSCAN Fundamentals
• Clusters are considered zones that are sufficiently dense.
• Points that lack neighbours i.e. not dense do not belong to any cluster are
classified as noise
• DBSCAN can return clusters of any shape
Deepak George, IIM Bangalore
DBSCAN Algorithm
Density Reachable
Not Density Reachable
Deepak George, IIM Bangalore
DBSCAN algorithm animation
Deepak George, IIM Bangalore
DBSCAN_animation.gif
Advantages
• It can discover any number of clusters
• Clusters of varying shapes and sizes can be obtained using the DBSCAN algorithm
• It can detect and ignore outliers
Disadvantages
• Assumes that’s clusters are of uniform density
• The epsilon value could be sensitive
• Too small a value can result in elimination of spare clusters as outliers
• Too large a value would merge dense clusters together giving incorrect clusters
DBSCAN Pros & Cons
Deepak George, IIM Bangalore
General recommendations
Profiling
• Identify the unique properties of each cluster and give appropriate labels
• Identify which feature is dominating in which cluster
• Ensure that clusters are well separated and can be explained from business point of view
Appropriate Dissimilarity measure
• For mix data try Gower distance
Feature scaling
• Always scale/normalize the features before training the clustering algorithm
Stability check
• Before clustering split data into training and test.
• Run the same final clustering model on both
• If clustering is stable, you will get the same metrics in both the datasets
Deepak George, IIM Bangalore

Weitere ähnliche Inhalte

Was ist angesagt?

05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data MiningValerii Klymchuk
 
Random Forest Classifier in Machine Learning | Palin Analytics
Random Forest Classifier in Machine Learning | Palin AnalyticsRandom Forest Classifier in Machine Learning | Palin Analytics
Random Forest Classifier in Machine Learning | Palin AnalyticsPalin analytics
 
Gradient Boosted trees
Gradient Boosted treesGradient Boosted trees
Gradient Boosted treesNihar Ranjan
 
Random forest
Random forestRandom forest
Random forestUjjawal
 
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Marina Santini
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learningSANTHOSH RAJA M G
 
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Simplilearn
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithmhadifar
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsMd. Main Uddin Rony
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means ClusteringAnna Fensel
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data miningEr. Nawaraj Bhandari
 
Decision trees for machine learning
Decision trees for machine learningDecision trees for machine learning
Decision trees for machine learningAmr BARAKAT
 
05 Classification And Prediction
05   Classification And Prediction05   Classification And Prediction
05 Classification And PredictionAchmad Solichin
 

Was ist angesagt? (20)

05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
Random Forest Classifier in Machine Learning | Palin Analytics
Random Forest Classifier in Machine Learning | Palin AnalyticsRandom Forest Classifier in Machine Learning | Palin Analytics
Random Forest Classifier in Machine Learning | Palin Analytics
 
Gradient Boosted trees
Gradient Boosted treesGradient Boosted trees
Gradient Boosted trees
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
Random forest
Random forestRandom forest
Random forest
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learning
 
Decision tree and random forest
Decision tree and random forestDecision tree and random forest
Decision tree and random forest
 
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data mining
 
Decision trees for machine learning
Decision trees for machine learningDecision trees for machine learning
Decision trees for machine learning
 
05 Classification And Prediction
05   Classification And Prediction05   Classification And Prediction
05 Classification And Prediction
 

Ähnlich wie Unsupervised learning: Clustering

BigML Education - Clusters
BigML Education - ClustersBigML Education - Clusters
BigML Education - ClustersBigML, Inc
 
DutchMLSchool. Logistic Regression, Deepnets, Time Series
DutchMLSchool. Logistic Regression, Deepnets, Time SeriesDutchMLSchool. Logistic Regression, Deepnets, Time Series
DutchMLSchool. Logistic Regression, Deepnets, Time SeriesBigML, Inc
 
5 analytic hierarchy_process
5 analytic hierarchy_process5 analytic hierarchy_process
5 analytic hierarchy_processFEG
 
analytic hierarchy_process
analytic hierarchy_processanalytic hierarchy_process
analytic hierarchy_processFEG
 
MachineLearning.pptx
MachineLearning.pptxMachineLearning.pptx
MachineLearning.pptxBangtangurl
 
pattern classification
pattern classificationpattern classification
pattern classificationRanjan Ganguli
 
Leveraging high level and low-level features for multimedia event detection.2...
Leveraging high level and low-level features for multimedia event detection.2...Leveraging high level and low-level features for multimedia event detection.2...
Leveraging high level and low-level features for multimedia event detection.2...Lu Jiang
 
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...Lucidworks
 
BSSML17 - Deepnets
BSSML17 - DeepnetsBSSML17 - Deepnets
BSSML17 - DeepnetsBigML, Inc
 
Benchmarking Automated Machine Learning For Clustering
Benchmarking Automated Machine Learning For ClusteringBenchmarking Automated Machine Learning For Clustering
Benchmarking Automated Machine Learning For Clusteringbiagiolicari7
 
Deep learning crash course
Deep learning crash courseDeep learning crash course
Deep learning crash courseVishwas N
 
Master defense presentation 2019 04_18_rev2
Master defense presentation 2019 04_18_rev2Master defense presentation 2019 04_18_rev2
Master defense presentation 2019 04_18_rev2Hyun Wong Choi
 
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...Egyptian Engineers Association
 
Clustering[306] [Read-Only].pdf
Clustering[306] [Read-Only].pdfClustering[306] [Read-Only].pdf
Clustering[306] [Read-Only].pdfigeabroad
 
Performance analysis of KNN & K-Means using internet advertisements data
Performance analysis of KNN & K-Means using internet advertisements dataPerformance analysis of KNN & K-Means using internet advertisements data
Performance analysis of KNN & K-Means using internet advertisements dataMuhammad GulRaj
 
التقنيات المستخدمة لتطوير المكتبات
التقنيات المستخدمة لتطوير المكتباتالتقنيات المستخدمة لتطوير المكتبات
التقنيات المستخدمة لتطوير المكتباتMohammed El Rafie Tarabay
 

Ähnlich wie Unsupervised learning: Clustering (20)

PPT s10-machine vision-s2
PPT s10-machine vision-s2PPT s10-machine vision-s2
PPT s10-machine vision-s2
 
BigML Education - Clusters
BigML Education - ClustersBigML Education - Clusters
BigML Education - Clusters
 
DutchMLSchool. Logistic Regression, Deepnets, Time Series
DutchMLSchool. Logistic Regression, Deepnets, Time SeriesDutchMLSchool. Logistic Regression, Deepnets, Time Series
DutchMLSchool. Logistic Regression, Deepnets, Time Series
 
5 analytic hierarchy_process
5 analytic hierarchy_process5 analytic hierarchy_process
5 analytic hierarchy_process
 
analytic hierarchy_process
analytic hierarchy_processanalytic hierarchy_process
analytic hierarchy_process
 
MachineLearning.pptx
MachineLearning.pptxMachineLearning.pptx
MachineLearning.pptx
 
07 learning
07 learning07 learning
07 learning
 
pattern classification
pattern classificationpattern classification
pattern classification
 
Leveraging high level and low-level features for multimedia event detection.2...
Leveraging high level and low-level features for multimedia event detection.2...Leveraging high level and low-level features for multimedia event detection.2...
Leveraging high level and low-level features for multimedia event detection.2...
 
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...
 
BSSML17 - Deepnets
BSSML17 - DeepnetsBSSML17 - Deepnets
BSSML17 - Deepnets
 
Benchmarking Automated Machine Learning For Clustering
Benchmarking Automated Machine Learning For ClusteringBenchmarking Automated Machine Learning For Clustering
Benchmarking Automated Machine Learning For Clustering
 
Deep learning crash course
Deep learning crash courseDeep learning crash course
Deep learning crash course
 
Master defense presentation 2019 04_18_rev2
Master defense presentation 2019 04_18_rev2Master defense presentation 2019 04_18_rev2
Master defense presentation 2019 04_18_rev2
 
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
 
Clustering[306] [Read-Only].pdf
Clustering[306] [Read-Only].pdfClustering[306] [Read-Only].pdf
Clustering[306] [Read-Only].pdf
 
Cluster
ClusterCluster
Cluster
 
Performance analysis of KNN & K-Means using internet advertisements data
Performance analysis of KNN & K-Means using internet advertisements dataPerformance analysis of KNN & K-Means using internet advertisements data
Performance analysis of KNN & K-Means using internet advertisements data
 
التقنيات المستخدمة لتطوير المكتبات
التقنيات المستخدمة لتطوير المكتباتالتقنيات المستخدمة لتطوير المكتبات
التقنيات المستخدمة لتطوير المكتبات
 
Clustering on DSS
Clustering on DSSClustering on DSS
Clustering on DSS
 

Kürzlich hochgeladen

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 

Kürzlich hochgeladen (20)

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 

Unsupervised learning: Clustering

  • 1. Deepak George Staff Data Scientist Unsupervised Learning: Clustering K-Means, Hierarchical Clustering & DBSCAN
  • 2. ➢ Data Science Career ▪ General Electric ▪ Accenture Management Consulting ▪ Mu Sigma ➢ Highlights ▪ 1st Prize Best Data Science Project (BAI 5) – IIM Bangalore ▪ Co-author of Markdown Optimization case published at Harvard Business School ▪ Kaggle Bronze medal – Toxic Comment Classification ▪ Kaggle Bronze medal - Coupon Purchase Prediction (Recommender System) ▪ SAS Certified Statistical Business Analyst: Regression and Modeling Credentials ➢ Education ▪ Indian Institute Of Management Bangalore - Business Analytics & Intelligence ▪ College Of Engineering Trivandrum - Computer Science Engineering ➢ Passion ▪ Deep Learning, Photography, Football ▪ Profile ▪ linkedin.com/in/deepakgeorge7/ ▪ https://github.com/deepakiim Deepak George, IIM Bangalore 2 About Me
  • 3. 1. Introduction to clustering and unsupervised learning 2. K means 3. Divisive and agglomerative clustering (Hierarchical) 4. Density-based clustering (DBSCAN) 5. Recommendations Agenda Deepak George, IIM Bangalore
  • 4. What is Unsupervised Learning? • Training data is labelled • Used for predict the label • Classification and Regression • Training data is unlabelled • Used for finding patterns in the data • Clustering, Dimensionality reduction, Association Rules . Deepak George, IIM Bangalore
  • 5. What is Clustering? Deepak George, IIM Bangalore
  • 6. What is Norm? Let p ≥ 1 be a real number. The p-norm (also called of Lp norm) of vector x =(x1, x2 ….,xn) • Norm measures the magnitude (or size, length) of vector • On an intuitive level, the norm of a vector x measures the distance from the origin to the point x. Geometric Interpretation of L2 Norm Consider a unit ball containing the origin. The Euclidean norm of a vector is simply the factor by which the ball must be expanded or shrunk in order to fit the given vector exactly Deepak George, IIM Bangalore
  • 7. Dissimilarity/Proximity Matrix Euclidean distance Dissimilarity Matrix Weighted Dissimilarity Matrix Data Matrix (n*p) Dissimilarity Matrix (n*n) Distance is inversely proportional to Similarity Deepak George, IIM Bangalore
  • 8. Types of Clustering Algorithms 1.Combinatorial algorithms 2.Mixture modelling 3.Mode Seekers Deepak George, IIM Bangalore
  • 9. Minimizing W(C) is equivalent to maximizing B(C) given that T is constant for any given data. C(i) is the encoder that we seek which assigns the ith observation to the kth cluster Within Cluster Point Scatter Total point scatter Between Cluster Point Scatter Combinatorial algorithm directly specify a mathematical loss function and attempt to minimize it through some combinatorial optimization algorithm. Since the goal is to assign close points to the same cluster, a natural loss function would be Combinatorial Algorithm Deepak George, IIM Bangalore
  • 10. K-Means Visual Explanation Random seeds Assign Update It is intended for situations in which all variables are of the quantitative type, and squared Euclidean distance Deepak George, IIM Bangalore
  • 11. K-Means Mathematical Explanation (for special case of K means) * The Elements of Statistical Learning Deepak George, IIM Bangalore
  • 12. K-Means algorithm animation X1 X2 Deepak George, IIM Bangalore Kmeans_animation.gif
  • 13. X1 X2 K-Means starting seed position issue animation Deepak George, IIM Bangalore Kmeans_starting_issue_animation.gif
  • 14. K-Means Clustering Advantages • Scales well on large dataset • Does NOT require ANY assumptions about data distribution Disadvantages • Assumes clusters are spherical • Assumes clusters are approximately equal in size • Can only use Euclidean dissimilarity • Choosing the wrong K • Doesn’t guarantee global optima • Could depend on choice of initial seeds • Works only with continuous data Deepak George, IIM Bangalore
  • 15. Hierarchical Clustering Agglomerative Clustering: • Bottom Up • Each object is initially considered as a single-element cluster • At each step, the two clusters that are the most similar are combined into a new bigger cluster • Repeated until all points are member of just one single big cluster Divisive Clustering: • Top Down • Initially all objects are assigned to a single cluster • At each step, the most heterogeneous cluster is divided into two. • Repeated until all objects are in their own cluster Deepak George, IIM Bangalore
  • 16. Measuring Dissimilarity between two clusters Deepak George, IIM Bangalore
  • 17. Hierarchical Clustering Visual Explanation Deepak George, IIM Bangalore
  • 18. Hierarchical Clustering Algorithm * The Elements of Statistical Learning Deepak George, IIM Bangalore
  • 19. Hierarchical Clustering algorithm animation X1 X2 Deepak George, IIM Bangalore Hierarchical_animation.gif
  • 20. Hierarchical Clustering Advantages • No need to choose K before running the algorithm • Dendrogram will give visual guidance in choosing K • Can use any dissimilarity measure • Works on any kind of data including categorical and mixed • Does NOT require ANY assumptions about data distribution Disadvantages • Doesn’t scales well on large dataset • Doesn’t guarantee global optima Deepak George, IIM Bangalore
  • 21. Density-based spatial clustering of applications with noise DBSCAN Parameters: 1. Minpts - Minimum number of points required to form a cluster 2. Epsilon – Radius of the circle drown around a point within which all points falling inside the circle belong to the same cluster. DBSCAN Fundamentals • Clusters are considered zones that are sufficiently dense. • Points that lack neighbours i.e. not dense do not belong to any cluster are classified as noise • DBSCAN can return clusters of any shape Deepak George, IIM Bangalore
  • 22. DBSCAN Algorithm Density Reachable Not Density Reachable Deepak George, IIM Bangalore
  • 23. DBSCAN algorithm animation Deepak George, IIM Bangalore DBSCAN_animation.gif
  • 24. Advantages • It can discover any number of clusters • Clusters of varying shapes and sizes can be obtained using the DBSCAN algorithm • It can detect and ignore outliers Disadvantages • Assumes that’s clusters are of uniform density • The epsilon value could be sensitive • Too small a value can result in elimination of spare clusters as outliers • Too large a value would merge dense clusters together giving incorrect clusters DBSCAN Pros & Cons Deepak George, IIM Bangalore
  • 25. General recommendations Profiling • Identify the unique properties of each cluster and give appropriate labels • Identify which feature is dominating in which cluster • Ensure that clusters are well separated and can be explained from business point of view Appropriate Dissimilarity measure • For mix data try Gower distance Feature scaling • Always scale/normalize the features before training the clustering algorithm Stability check • Before clustering split data into training and test. • Run the same final clustering model on both • If clustering is stable, you will get the same metrics in both the datasets Deepak George, IIM Bangalore