1. Deepak George
Staff Data Scientist
Unsupervised Learning: Clustering
K-Means, Hierarchical Clustering & DBSCAN
2. ➢ Data Science Career
▪ General Electric
▪ Accenture Management Consulting
▪ Mu Sigma
➢ Highlights
▪ 1st Prize Best Data Science Project (BAI 5) – IIM Bangalore
▪ Co-author of Markdown Optimization case published at Harvard Business School
▪ Kaggle Bronze medal – Toxic Comment Classification
▪ Kaggle Bronze medal - Coupon Purchase Prediction (Recommender System)
▪ SAS Certified Statistical Business Analyst: Regression and Modeling Credentials
➢ Education
▪ Indian Institute Of Management Bangalore - Business Analytics & Intelligence
▪ College Of Engineering Trivandrum - Computer Science Engineering
➢ Passion
▪ Deep Learning, Photography, Football
▪ Profile
▪ linkedin.com/in/deepakgeorge7/
▪ https://github.com/deepakiim
Deepak George, IIM Bangalore
2
About Me
3. 1. Introduction to clustering and unsupervised learning
2. K means
3. Divisive and agglomerative clustering (Hierarchical)
4. Density-based clustering (DBSCAN)
5. Recommendations
Agenda
Deepak George, IIM Bangalore
4. What is Unsupervised Learning?
• Training data is labelled
• Used for predict the label
• Classification and Regression
• Training data is unlabelled
• Used for finding patterns in the data
• Clustering, Dimensionality reduction, Association Rules
.
Deepak George, IIM Bangalore
6. What is Norm?
Let p ≥ 1 be a real number. The p-norm (also called of Lp norm) of vector x =(x1, x2 ….,xn)
• Norm measures the magnitude (or size, length) of vector
• On an intuitive level, the norm of a vector x measures the distance from the origin to the point x.
Geometric Interpretation of L2 Norm
Consider a unit ball containing the origin.
The Euclidean norm of a vector is simply the factor by which the ball must
be expanded or shrunk in order to fit the given vector exactly
Deepak George, IIM Bangalore
9. Minimizing W(C) is equivalent to maximizing B(C) given that T is
constant for any given data.
C(i) is the encoder that we seek which assigns the ith observation to the kth cluster
Within Cluster Point Scatter
Total point scatter
Between Cluster Point Scatter
Combinatorial algorithm directly specify a mathematical loss function and attempt to minimize it through some combinatorial
optimization algorithm. Since the goal is to assign close points to the same cluster, a natural loss function would be
Combinatorial Algorithm
Deepak George, IIM Bangalore
10. K-Means Visual Explanation
Random seeds Assign Update
It is intended for situations in which all variables are of the quantitative type, and squared Euclidean distance
Deepak George, IIM Bangalore
14. K-Means Clustering
Advantages
• Scales well on large dataset
• Does NOT require ANY assumptions about data distribution
Disadvantages
• Assumes clusters are spherical
• Assumes clusters are approximately equal in size
• Can only use Euclidean dissimilarity
• Choosing the wrong K
• Doesn’t guarantee global optima
• Could depend on choice of initial seeds
• Works only with continuous data
Deepak George, IIM Bangalore
15. Hierarchical Clustering
Agglomerative Clustering:
• Bottom Up
• Each object is initially considered as a single-element
cluster
• At each step, the two clusters that are the most similar
are combined into a new bigger cluster
• Repeated until all points are member of just one single
big cluster
Divisive Clustering:
• Top Down
• Initially all objects are assigned to a single cluster
• At each step, the most heterogeneous cluster is divided
into two.
• Repeated until all objects are in their own cluster
Deepak George, IIM Bangalore
20. Hierarchical Clustering
Advantages
• No need to choose K before running the algorithm
• Dendrogram will give visual guidance in choosing K
• Can use any dissimilarity measure
• Works on any kind of data including categorical and mixed
• Does NOT require ANY assumptions about data distribution
Disadvantages
• Doesn’t scales well on large dataset
• Doesn’t guarantee global optima
Deepak George, IIM Bangalore
21. Density-based spatial clustering of applications with noise
DBSCAN Parameters:
1. Minpts - Minimum number of points required to form a cluster
2. Epsilon – Radius of the circle drown around a point within which all points falling inside the circle
belong to the same cluster.
DBSCAN Fundamentals
• Clusters are considered zones that are sufficiently dense.
• Points that lack neighbours i.e. not dense do not belong to any cluster are
classified as noise
• DBSCAN can return clusters of any shape
Deepak George, IIM Bangalore
24. Advantages
• It can discover any number of clusters
• Clusters of varying shapes and sizes can be obtained using the DBSCAN algorithm
• It can detect and ignore outliers
Disadvantages
• Assumes that’s clusters are of uniform density
• The epsilon value could be sensitive
• Too small a value can result in elimination of spare clusters as outliers
• Too large a value would merge dense clusters together giving incorrect clusters
DBSCAN Pros & Cons
Deepak George, IIM Bangalore
25. General recommendations
Profiling
• Identify the unique properties of each cluster and give appropriate labels
• Identify which feature is dominating in which cluster
• Ensure that clusters are well separated and can be explained from business point of view
Appropriate Dissimilarity measure
• For mix data try Gower distance
Feature scaling
• Always scale/normalize the features before training the clustering algorithm
Stability check
• Before clustering split data into training and test.
• Run the same final clustering model on both
• If clustering is stable, you will get the same metrics in both the datasets
Deepak George, IIM Bangalore