SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
mlcourse.ai. Clustering
Yury Kashnitskiy, Dmitry Ignatov
Higher School of Economics
November 16, 2018
(Higher School of Economics) Clustering 16.11.2018 1 / 24
Plan
1 Clustering
Problem formulation
Applications
2 Clustering methods
k-Means
Hierarchical methods
Agglomerative clustering
Density-based methods
(Higher School of Economics) Clustering 16.11.2018 2 / 24
Clustering
Plan
1 Clustering
Problem formulation
Applications
2 Clustering methods
k-Means
Hierarchical methods
Agglomerative clustering
Density-based methods
(Higher School of Economics) Clustering 16.11.2018 3 / 24
Clustering Problem formulation
Problem formulation
The main task of cluster analysis is to group instances into subgroups (clusters) of
similar ones.
These groups can be
Partitions
Hierarchies
Fuzzy partitions
Biclusters
Mixtures of distributions
(Higher School of Economics) Clustering 16.11.2018 4 / 24
Clustering Applications
Applications
Biology and medicine
Gene expression analysis
Tomography clustering
Humanitarian sciences
Sociology and anthropology
Psychology
Technical systems
Telemetry
Image segmentation
Marketing
Customer segmentation
Subgroup behavioral analysis
Text analytics
News clustering
Social networks
Comunity detection
(Higher School of Economics) Clustering 16.11.2018 5 / 24
Clustering methods
Plan
1 Clustering
Problem formulation
Applications
2 Clustering methods
k-Means
Hierarchical methods
Agglomerative clustering
Density-based methods
(Higher School of Economics) Clustering 16.11.2018 6 / 24
Clustering methods
How to measure dissimilarity of instances
Instances x ∈ Rm
are representaed as feature matrices.





x1
x2
...
xn





⇐⇒




x1
1 x2
1 · · · xm
1
x1
2 x2
2 · · · xm
2
· · · · · · · · · · · ·
x1
n xm
n · · · xm
n




Minkowski distance
d(x, y) =
m
i=1
|xi
− yi
|p
1
p
Cosine distance
d(x, y) = 1 −
⟨x, y⟩
⟨x, x⟩ ⟨y, y⟩
Hamming distance
d(x, y) =
1
m
m
i=1
[xi
̸= yi
]
(Higher School of Economics) Clustering 16.11.2018 7 / 24
Clustering methods
k-Means
k-Means is an iterative algorithm to split data into k clusters.
Geometrical mean of each cluster (called a centroid) is denoted with Cj is defined
as
cj =
1
|Cj |
i∈Cj
xi
The objective is the sum of squares of all distances between instances and
centroids of clusteres to which these instances belong.
J(C) =
k
j=1 i∈Cj
d(xi , cj )2
(Higher School of Economics) Clustering 16.11.2018 8 / 24
Clustering methods
k-Means
The algorithm
Input: Data, k — is a hyperparameter
Ouput: Partition of data into k clusters
* * *
1. Initialization: Set k points to be initial centroid
2. Update clusters: Given k centroids, each instance is attributed to one of
centroids. Thus, all instances attributed to a centroid cj
(j = 1 . . . k), form a cluster Cj .
3. Update centroids: For each cluster Cj , a new centroid is calculated as a
geometrical mean of all instances in this cluster.
Steps 2-3 are repeated until convergence.
(Higher School of Economics) Clustering 16.11.2018 9 / 24
Clustering methods
k-Means. Example
(Higher School of Economics) Clustering 16.11.2018 10 / 24
Clustering methods
k-Means. Example
(Higher School of Economics) Clustering 16.11.2018 10 / 24
Clustering methods
k-Means. Example
(Higher School of Economics) Clustering 16.11.2018 10 / 24
Clustering methods
k-Means. Example
(Higher School of Economics) Clustering 16.11.2018 10 / 24
Clustering methods
k-Means. Example
(Higher School of Economics) Clustering 16.11.2018 10 / 24
Clustering methods
k-Means. Example
(Higher School of Economics) Clustering 16.11.2018 10 / 24
Clustering methods
k-Means. Example
(Higher School of Economics) Clustering 16.11.2018 10 / 24
Clustering methods
Clustering quality and the number of clusters
Elbow method
For each k we can calculate J(C).
Then, we find such k that further increasing it does not decrease J “too much”.
Formally, we look for k that minimizes the following D(k):
D(k) =
|J(k) − J(k + 1)|
|J(k − 1) − J(k)|
(Higher School of Economics) Clustering 16.11.2018 11 / 24
Clustering methods
Clustering quality and the number of clusters
Elbow method
−6 −4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
6
2 3 4 5 6 7 8 9 10
0
500
1000
1500
2000
2500
3000
3500
4000
k
J(R)
Elbow Method
(Higher School of Economics) Clustering 16.11.2018 11 / 24
Clustering methods
Clustering quality and the number of clusters
Silhouette
Silhouette for an instance xi in a cluster C is a function
s(i) =
bi − ai
max(ai , bi )
,
where a(i) — is the mean distance from xi to all other instances from C, а bm(i)
— is the mean distance from xi to instances from other clusters.
(Higher School of Economics) Clustering 16.11.2018 12 / 24
Clustering methods
Silhouette
Acceptable number of clusters
(Higher School of Economics) Clustering 16.11.2018 13 / 24
Clustering methods
Silhouette
Bad number of clusters
(Higher School of Economics) Clustering 16.11.2018 14 / 24
Clustering methods Hierarchical methods
Hierarchical methods
From a feature matrix we can move to a pairwise distance matrix.




x1
1 x2
1 · · · xm
1
x1
2 x2
2 · · · xm
2
· · · · · · · · · · · ·
x1
n xm
n · · · xm
n



 ⇒






d(x1, x1) d(x1, x2) . . . d(x1, xn)
d(x2, x1)
...
... d(x2, xn)
...
...
...
...
d(xn, x1) d(xn, x2) · · · d(xn, xn)






(Higher School of Economics) Clustering 16.11.2018 15 / 24
Clustering methods Hierarchical methods
Hierarchical methods
From a feature matrix we can move to a pairwise distance matrix.




x1
1 x2
1 · · · xm
1
x1
2 x2
2 · · · xm
2
· · · · · · · · · · · ·
x1
n xm
n · · · xm
n



 ⇒







0 d(x1, x2) d(x1, x3) · · · d(x1, xn)
0 d(x2, x3) · · · d(x2, xn)
... · · · · · ·
0 d(xn−1, xn)
0







(Higher School of Economics) Clustering 16.11.2018 15 / 24
Clustering methods Hierarchical methods
Agglomerative clustering
Sequential merging of similar clusters
0 Start with each cluster having only one instance
1 Find two closest clusters
2 Merge them
Repeat steps 1-2 untill all instances are in the same cluster
How to define distance between clusters?
(Higher School of Economics) Clustering 16.11.2018 16 / 24
Clustering methods Hierarchical methods
Agglomerative clustering
Linkage
1 Single Linkage
d(A, B) = min
x∈A,y∈B
d(x, y)
2 Complete Linkage
d(A, B) = max
x∈A,y∈B
d(x, y)
(Higher School of Economics) Clustering 16.11.2018 17 / 24
Clustering methods Hierarchical methods
Agglomerative clustering
Linkage
3 Average Linkage
d(A, B) =
1
|A||B|
i∈A j∈B
d(xi , yj )
4 Weighted Average Linkage
Let clusterA be a union of clusters q и p. Then
d(A, B) =
d(p, B) + d(q, B)
2
5 Centroid Linkage
d(A, B) = ∥cA − cB ∥2
(Higher School of Economics) Clustering 16.11.2018 18 / 24
Clustering methods Hierarchical methods
Agglomerative clustering
Merging clusters can be depicted with a dendrogram.
Let us take a look at a 1D sample: { 1, 2, 3, 7, 10, 12, 25, 29 }
1 2 3 7 10 12 25 29
0
5
10
15
20
25
Objects
Clusterdistances
B
C
A
Distance between
cluster A and B
(Higher School of Economics) Clustering 16.11.2018 19 / 24
Clustering methods Density-based methods
Density-based methods
DBSCAN
DBSCAN stabds for Density Based Spatial Clustering of Applications with Noise.
(Higher School of Economics) Clustering 16.11.2018 20 / 24
Clustering methods Density-based methods
DBSCAN algorithm
All point can be divided into elements of dense regions, border points and noise
(skipping formal definition here).
(Higher School of Economics) Clustering 16.11.2018 21 / 24
Clustering methods Density-based methods
DBSCAN. Example
Hyperparams: M = 4, Eps > 0
(Higher School of Economics) Clustering 16.11.2018 22 / 24
Clustering methods Density-based methods
DBSCAN. Example
(Higher School of Economics) Clustering 16.11.2018 22 / 24
Clustering methods Density-based methods
DBSCAN. Example
(Higher School of Economics) Clustering 16.11.2018 22 / 24
Clustering methods Density-based methods
DBSCAN. Example
(Higher School of Economics) Clustering 16.11.2018 22 / 24
Clustering methods Density-based methods
DBSCAN. Pros and cons
Pros
+ Can find clusters of any shape
+ Easy to implement
+ Can find noise in data
+ Nice complexity — O(n log(n)) with a good data sctructure
(otherwise — O(n2
) )
Cons
- Parametric
- Doesn’t work well when clusters differ in density
- Depends on the chosen metric
(Higher School of Economics) Clustering 16.11.2018 23 / 24
Clustering methods Density-based methods
Contacts
Questions
Thanks!
Please ask your questions in OpenDataSciene Slack team.
http://ods.ai
(Higher School of Economics) Clustering 16.11.2018 24 / 24

Weitere ähnliche Inhalte

Was ist angesagt?

5HBC: How to Graph Implicit Relations Intro Packet!
5HBC: How to Graph Implicit Relations Intro Packet!5HBC: How to Graph Implicit Relations Intro Packet!
5HBC: How to Graph Implicit Relations Intro Packet!A Jorge Garcia
 
Data Science for Number and Coding Theory
Data Science for Number and Coding TheoryData Science for Number and Coding Theory
Data Science for Number and Coding TheoryCapgemini
 
Add Math(F5) Graph Of Function Ii 2.1
Add Math(F5) Graph Of Function Ii 2.1Add Math(F5) Graph Of Function Ii 2.1
Add Math(F5) Graph Of Function Ii 2.1roszelan
 
IRJET- Solving Quadratic Equations using C++ Application Program
IRJET-  	  Solving Quadratic Equations using C++ Application ProgramIRJET-  	  Solving Quadratic Equations using C++ Application Program
IRJET- Solving Quadratic Equations using C++ Application ProgramIRJET Journal
 
Presentation of my master thesis - Image Processing
Presentation of my master thesis - Image ProcessingPresentation of my master thesis - Image Processing
Presentation of my master thesis - Image ProcessingMichaelRra
 
JOSA TechTalks - Machine Learning on Graph-Structured Data
JOSA TechTalks - Machine Learning on Graph-Structured DataJOSA TechTalks - Machine Learning on Graph-Structured Data
JOSA TechTalks - Machine Learning on Graph-Structured DataJordan Open Source Association
 
11.polynomial regression model of making cost prediction in mixed cost analysis
11.polynomial regression model of making cost prediction in mixed cost analysis11.polynomial regression model of making cost prediction in mixed cost analysis
11.polynomial regression model of making cost prediction in mixed cost analysisAlexander Decker
 
Polynomial regression model of making cost prediction in mixed cost analysis
Polynomial regression model of making cost prediction in mixed cost analysisPolynomial regression model of making cost prediction in mixed cost analysis
Polynomial regression model of making cost prediction in mixed cost analysisAlexander Decker
 
Embeddings the geometry of relational algebra
Embeddings  the geometry of relational algebraEmbeddings  the geometry of relational algebra
Embeddings the geometry of relational algebraNikolaos Vasiloglou
 
An Efficient Elliptic Curve Cryptography Arithmetic Using Nikhilam Multiplica...
An Efficient Elliptic Curve Cryptography Arithmetic Using Nikhilam Multiplica...An Efficient Elliptic Curve Cryptography Arithmetic Using Nikhilam Multiplica...
An Efficient Elliptic Curve Cryptography Arithmetic Using Nikhilam Multiplica...theijes
 
Conference on theoretical and applied computer science
Conference on theoretical and applied computer scienceConference on theoretical and applied computer science
Conference on theoretical and applied computer scienceSandeep Katta
 
Neuro -fuzzy-networks-for-identification-of-mathematical-model-parameters-of-...
Neuro -fuzzy-networks-for-identification-of-mathematical-model-parameters-of-...Neuro -fuzzy-networks-for-identification-of-mathematical-model-parameters-of-...
Neuro -fuzzy-networks-for-identification-of-mathematical-model-parameters-of-...Cemal Ardil
 
Integral Solutions of the Ternary Cubic Equation 3(x2+y2)-4xy+2(x+y+1)=972z3
Integral Solutions of the Ternary Cubic Equation 3(x2+y2)-4xy+2(x+y+1)=972z3Integral Solutions of the Ternary Cubic Equation 3(x2+y2)-4xy+2(x+y+1)=972z3
Integral Solutions of the Ternary Cubic Equation 3(x2+y2)-4xy+2(x+y+1)=972z3IRJET Journal
 
Mcqs -Matrices and determinants
Mcqs -Matrices and determinantsMcqs -Matrices and determinants
Mcqs -Matrices and determinantss9182647608y
 

Was ist angesagt? (18)

5HBC: How to Graph Implicit Relations Intro Packet!
5HBC: How to Graph Implicit Relations Intro Packet!5HBC: How to Graph Implicit Relations Intro Packet!
5HBC: How to Graph Implicit Relations Intro Packet!
 
Tutorial1
Tutorial1Tutorial1
Tutorial1
 
Data Science for Number and Coding Theory
Data Science for Number and Coding TheoryData Science for Number and Coding Theory
Data Science for Number and Coding Theory
 
Add Math(F5) Graph Of Function Ii 2.1
Add Math(F5) Graph Of Function Ii 2.1Add Math(F5) Graph Of Function Ii 2.1
Add Math(F5) Graph Of Function Ii 2.1
 
IRJET- Solving Quadratic Equations using C++ Application Program
IRJET-  	  Solving Quadratic Equations using C++ Application ProgramIRJET-  	  Solving Quadratic Equations using C++ Application Program
IRJET- Solving Quadratic Equations using C++ Application Program
 
Presentation of my master thesis - Image Processing
Presentation of my master thesis - Image ProcessingPresentation of my master thesis - Image Processing
Presentation of my master thesis - Image Processing
 
Cmb part3
Cmb part3Cmb part3
Cmb part3
 
JOSA TechTalks - Machine Learning on Graph-Structured Data
JOSA TechTalks - Machine Learning on Graph-Structured DataJOSA TechTalks - Machine Learning on Graph-Structured Data
JOSA TechTalks - Machine Learning on Graph-Structured Data
 
11.polynomial regression model of making cost prediction in mixed cost analysis
11.polynomial regression model of making cost prediction in mixed cost analysis11.polynomial regression model of making cost prediction in mixed cost analysis
11.polynomial regression model of making cost prediction in mixed cost analysis
 
Polynomial regression model of making cost prediction in mixed cost analysis
Polynomial regression model of making cost prediction in mixed cost analysisPolynomial regression model of making cost prediction in mixed cost analysis
Polynomial regression model of making cost prediction in mixed cost analysis
 
Embeddings the geometry of relational algebra
Embeddings  the geometry of relational algebraEmbeddings  the geometry of relational algebra
Embeddings the geometry of relational algebra
 
An Efficient Elliptic Curve Cryptography Arithmetic Using Nikhilam Multiplica...
An Efficient Elliptic Curve Cryptography Arithmetic Using Nikhilam Multiplica...An Efficient Elliptic Curve Cryptography Arithmetic Using Nikhilam Multiplica...
An Efficient Elliptic Curve Cryptography Arithmetic Using Nikhilam Multiplica...
 
Conference on theoretical and applied computer science
Conference on theoretical and applied computer scienceConference on theoretical and applied computer science
Conference on theoretical and applied computer science
 
Neuro -fuzzy-networks-for-identification-of-mathematical-model-parameters-of-...
Neuro -fuzzy-networks-for-identification-of-mathematical-model-parameters-of-...Neuro -fuzzy-networks-for-identification-of-mathematical-model-parameters-of-...
Neuro -fuzzy-networks-for-identification-of-mathematical-model-parameters-of-...
 
Integral Solutions of the Ternary Cubic Equation 3(x2+y2)-4xy+2(x+y+1)=972z3
Integral Solutions of the Ternary Cubic Equation 3(x2+y2)-4xy+2(x+y+1)=972z3Integral Solutions of the Ternary Cubic Equation 3(x2+y2)-4xy+2(x+y+1)=972z3
Integral Solutions of the Ternary Cubic Equation 3(x2+y2)-4xy+2(x+y+1)=972z3
 
Tutorial7
Tutorial7Tutorial7
Tutorial7
 
Assignment 1
Assignment 1Assignment 1
Assignment 1
 
Mcqs -Matrices and determinants
Mcqs -Matrices and determinantsMcqs -Matrices and determinants
Mcqs -Matrices and determinants
 

Ähnlich wie mlcourse.ai. Clustering

A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...
A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...
A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...IRJET Journal
 
L4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseL4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseMohaiminur Rahman
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applicationsFrank Nielsen
 
4 image segmentation through clustering
4 image segmentation through clustering4 image segmentation through clustering
4 image segmentation through clusteringIAEME Publication
 
4 image segmentation through clustering
4 image segmentation through clustering4 image segmentation through clustering
4 image segmentation through clusteringprjpublications
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1arogozhnikov
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniquestalktoharry
 
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...Rafael Nogueras
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10mqasimsheikh5
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learningAnil Yadav
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
 
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...csandit
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmIJERA Editor
 

Ähnlich wie mlcourse.ai. Clustering (20)

A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...
A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...
A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
L4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseL4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics Course
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
 
QMC: Undergraduate Workshop, Introduction to Monte Carlo Methods with 'R' Sof...
QMC: Undergraduate Workshop, Introduction to Monte Carlo Methods with 'R' Sof...QMC: Undergraduate Workshop, Introduction to Monte Carlo Methods with 'R' Sof...
QMC: Undergraduate Workshop, Introduction to Monte Carlo Methods with 'R' Sof...
 
4 image segmentation through clustering
4 image segmentation through clustering4 image segmentation through clustering
4 image segmentation through clustering
 
4 image segmentation through clustering
4 image segmentation through clustering4 image segmentation through clustering
4 image segmentation through clustering
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniques
 
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
 
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
 
Data clustering
Data clustering Data clustering
Data clustering
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10
 
Extracting biclusters of similar values with Triadic Concept Analysis
Extracting biclusters of similar values with Triadic Concept AnalysisExtracting biclusters of similar values with Triadic Concept Analysis
Extracting biclusters of similar values with Triadic Concept Analysis
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
 
Ica group 3[1]
Ica group 3[1]Ica group 3[1]
Ica group 3[1]
 
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering Algorithm
 
Second subjective assignment
Second  subjective assignmentSecond  subjective assignment
Second subjective assignment
 

Mehr von Yury Kashnitsky

How to jump into Data Science
How to jump into Data ScienceHow to jump into Data Science
How to jump into Data ScienceYury Kashnitsky
 
mlcourse.ai fall2019 Live Session 0
mlcourse.ai fall2019 Live Session 0mlcourse.ai fall2019 Live Session 0
mlcourse.ai fall2019 Live Session 0Yury Kashnitsky
 
Benchmarking transfer learning approaches for NLP
Benchmarking transfer learning approaches for NLPBenchmarking transfer learning approaches for NLP
Benchmarking transfer learning approaches for NLPYury Kashnitsky
 
Gender-unbiased BERT-based Pronoun Resolution
Gender-unbiased BERT-based  Pronoun ResolutionGender-unbiased BERT-based  Pronoun Resolution
Gender-unbiased BERT-based Pronoun ResolutionYury Kashnitsky
 
Time series forecasting with ARIMA
Time series forecasting with ARIMATime series forecasting with ARIMA
Time series forecasting with ARIMAYury Kashnitsky
 
mlcourse.ai, introduction, course overview
mlcourse.ai, introduction, course overviewmlcourse.ai, introduction, course overview
mlcourse.ai, introduction, course overviewYury Kashnitsky
 
Необычные модели Playboy, или про поиск аномалий в данных
Необычные модели Playboy, или про поиск аномалий в данныхНеобычные модели Playboy, или про поиск аномалий в данных
Необычные модели Playboy, или про поиск аномалий в данныхYury Kashnitsky
 

Mehr von Yury Kashnitsky (8)

How to jump into Data Science
How to jump into Data ScienceHow to jump into Data Science
How to jump into Data Science
 
mlcourse.ai fall2019 Live Session 0
mlcourse.ai fall2019 Live Session 0mlcourse.ai fall2019 Live Session 0
mlcourse.ai fall2019 Live Session 0
 
Benchmarking transfer learning approaches for NLP
Benchmarking transfer learning approaches for NLPBenchmarking transfer learning approaches for NLP
Benchmarking transfer learning approaches for NLP
 
Gender-unbiased BERT-based Pronoun Resolution
Gender-unbiased BERT-based  Pronoun ResolutionGender-unbiased BERT-based  Pronoun Resolution
Gender-unbiased BERT-based Pronoun Resolution
 
mlcourse.ai. Outro
mlcourse.ai. Outromlcourse.ai. Outro
mlcourse.ai. Outro
 
Time series forecasting with ARIMA
Time series forecasting with ARIMATime series forecasting with ARIMA
Time series forecasting with ARIMA
 
mlcourse.ai, introduction, course overview
mlcourse.ai, introduction, course overviewmlcourse.ai, introduction, course overview
mlcourse.ai, introduction, course overview
 
Необычные модели Playboy, или про поиск аномалий в данных
Необычные модели Playboy, или про поиск аномалий в данныхНеобычные модели Playboy, или про поиск аномалий в данных
Необычные модели Playboy, или про поиск аномалий в данных
 

Kürzlich hochgeladen

psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIShubhangi Sonawane
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 

Kürzlich hochgeladen (20)

psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 

mlcourse.ai. Clustering

  • 1. mlcourse.ai. Clustering Yury Kashnitskiy, Dmitry Ignatov Higher School of Economics November 16, 2018 (Higher School of Economics) Clustering 16.11.2018 1 / 24
  • 2. Plan 1 Clustering Problem formulation Applications 2 Clustering methods k-Means Hierarchical methods Agglomerative clustering Density-based methods (Higher School of Economics) Clustering 16.11.2018 2 / 24
  • 3. Clustering Plan 1 Clustering Problem formulation Applications 2 Clustering methods k-Means Hierarchical methods Agglomerative clustering Density-based methods (Higher School of Economics) Clustering 16.11.2018 3 / 24
  • 4. Clustering Problem formulation Problem formulation The main task of cluster analysis is to group instances into subgroups (clusters) of similar ones. These groups can be Partitions Hierarchies Fuzzy partitions Biclusters Mixtures of distributions (Higher School of Economics) Clustering 16.11.2018 4 / 24
  • 5. Clustering Applications Applications Biology and medicine Gene expression analysis Tomography clustering Humanitarian sciences Sociology and anthropology Psychology Technical systems Telemetry Image segmentation Marketing Customer segmentation Subgroup behavioral analysis Text analytics News clustering Social networks Comunity detection (Higher School of Economics) Clustering 16.11.2018 5 / 24
  • 6. Clustering methods Plan 1 Clustering Problem formulation Applications 2 Clustering methods k-Means Hierarchical methods Agglomerative clustering Density-based methods (Higher School of Economics) Clustering 16.11.2018 6 / 24
  • 7. Clustering methods How to measure dissimilarity of instances Instances x ∈ Rm are representaed as feature matrices.      x1 x2 ... xn      ⇐⇒     x1 1 x2 1 · · · xm 1 x1 2 x2 2 · · · xm 2 · · · · · · · · · · · · x1 n xm n · · · xm n     Minkowski distance d(x, y) = m i=1 |xi − yi |p 1 p Cosine distance d(x, y) = 1 − ⟨x, y⟩ ⟨x, x⟩ ⟨y, y⟩ Hamming distance d(x, y) = 1 m m i=1 [xi ̸= yi ] (Higher School of Economics) Clustering 16.11.2018 7 / 24
  • 8. Clustering methods k-Means k-Means is an iterative algorithm to split data into k clusters. Geometrical mean of each cluster (called a centroid) is denoted with Cj is defined as cj = 1 |Cj | i∈Cj xi The objective is the sum of squares of all distances between instances and centroids of clusteres to which these instances belong. J(C) = k j=1 i∈Cj d(xi , cj )2 (Higher School of Economics) Clustering 16.11.2018 8 / 24
  • 9. Clustering methods k-Means The algorithm Input: Data, k — is a hyperparameter Ouput: Partition of data into k clusters * * * 1. Initialization: Set k points to be initial centroid 2. Update clusters: Given k centroids, each instance is attributed to one of centroids. Thus, all instances attributed to a centroid cj (j = 1 . . . k), form a cluster Cj . 3. Update centroids: For each cluster Cj , a new centroid is calculated as a geometrical mean of all instances in this cluster. Steps 2-3 are repeated until convergence. (Higher School of Economics) Clustering 16.11.2018 9 / 24
  • 10. Clustering methods k-Means. Example (Higher School of Economics) Clustering 16.11.2018 10 / 24
  • 11. Clustering methods k-Means. Example (Higher School of Economics) Clustering 16.11.2018 10 / 24
  • 12. Clustering methods k-Means. Example (Higher School of Economics) Clustering 16.11.2018 10 / 24
  • 13. Clustering methods k-Means. Example (Higher School of Economics) Clustering 16.11.2018 10 / 24
  • 14. Clustering methods k-Means. Example (Higher School of Economics) Clustering 16.11.2018 10 / 24
  • 15. Clustering methods k-Means. Example (Higher School of Economics) Clustering 16.11.2018 10 / 24
  • 16. Clustering methods k-Means. Example (Higher School of Economics) Clustering 16.11.2018 10 / 24
  • 17. Clustering methods Clustering quality and the number of clusters Elbow method For each k we can calculate J(C). Then, we find such k that further increasing it does not decrease J “too much”. Formally, we look for k that minimizes the following D(k): D(k) = |J(k) − J(k + 1)| |J(k − 1) − J(k)| (Higher School of Economics) Clustering 16.11.2018 11 / 24
  • 18. Clustering methods Clustering quality and the number of clusters Elbow method −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 2 3 4 5 6 7 8 9 10 0 500 1000 1500 2000 2500 3000 3500 4000 k J(R) Elbow Method (Higher School of Economics) Clustering 16.11.2018 11 / 24
  • 19. Clustering methods Clustering quality and the number of clusters Silhouette Silhouette for an instance xi in a cluster C is a function s(i) = bi − ai max(ai , bi ) , where a(i) — is the mean distance from xi to all other instances from C, а bm(i) — is the mean distance from xi to instances from other clusters. (Higher School of Economics) Clustering 16.11.2018 12 / 24
  • 20. Clustering methods Silhouette Acceptable number of clusters (Higher School of Economics) Clustering 16.11.2018 13 / 24
  • 21. Clustering methods Silhouette Bad number of clusters (Higher School of Economics) Clustering 16.11.2018 14 / 24
  • 22. Clustering methods Hierarchical methods Hierarchical methods From a feature matrix we can move to a pairwise distance matrix.     x1 1 x2 1 · · · xm 1 x1 2 x2 2 · · · xm 2 · · · · · · · · · · · · x1 n xm n · · · xm n     ⇒       d(x1, x1) d(x1, x2) . . . d(x1, xn) d(x2, x1) ... ... d(x2, xn) ... ... ... ... d(xn, x1) d(xn, x2) · · · d(xn, xn)       (Higher School of Economics) Clustering 16.11.2018 15 / 24
  • 23. Clustering methods Hierarchical methods Hierarchical methods From a feature matrix we can move to a pairwise distance matrix.     x1 1 x2 1 · · · xm 1 x1 2 x2 2 · · · xm 2 · · · · · · · · · · · · x1 n xm n · · · xm n     ⇒        0 d(x1, x2) d(x1, x3) · · · d(x1, xn) 0 d(x2, x3) · · · d(x2, xn) ... · · · · · · 0 d(xn−1, xn) 0        (Higher School of Economics) Clustering 16.11.2018 15 / 24
  • 24. Clustering methods Hierarchical methods Agglomerative clustering Sequential merging of similar clusters 0 Start with each cluster having only one instance 1 Find two closest clusters 2 Merge them Repeat steps 1-2 untill all instances are in the same cluster How to define distance between clusters? (Higher School of Economics) Clustering 16.11.2018 16 / 24
  • 25. Clustering methods Hierarchical methods Agglomerative clustering Linkage 1 Single Linkage d(A, B) = min x∈A,y∈B d(x, y) 2 Complete Linkage d(A, B) = max x∈A,y∈B d(x, y) (Higher School of Economics) Clustering 16.11.2018 17 / 24
  • 26. Clustering methods Hierarchical methods Agglomerative clustering Linkage 3 Average Linkage d(A, B) = 1 |A||B| i∈A j∈B d(xi , yj ) 4 Weighted Average Linkage Let clusterA be a union of clusters q и p. Then d(A, B) = d(p, B) + d(q, B) 2 5 Centroid Linkage d(A, B) = ∥cA − cB ∥2 (Higher School of Economics) Clustering 16.11.2018 18 / 24
  • 27. Clustering methods Hierarchical methods Agglomerative clustering Merging clusters can be depicted with a dendrogram. Let us take a look at a 1D sample: { 1, 2, 3, 7, 10, 12, 25, 29 } 1 2 3 7 10 12 25 29 0 5 10 15 20 25 Objects Clusterdistances B C A Distance between cluster A and B (Higher School of Economics) Clustering 16.11.2018 19 / 24
  • 28. Clustering methods Density-based methods Density-based methods DBSCAN DBSCAN stabds for Density Based Spatial Clustering of Applications with Noise. (Higher School of Economics) Clustering 16.11.2018 20 / 24
  • 29. Clustering methods Density-based methods DBSCAN algorithm All point can be divided into elements of dense regions, border points and noise (skipping formal definition here). (Higher School of Economics) Clustering 16.11.2018 21 / 24
  • 30. Clustering methods Density-based methods DBSCAN. Example Hyperparams: M = 4, Eps > 0 (Higher School of Economics) Clustering 16.11.2018 22 / 24
  • 31. Clustering methods Density-based methods DBSCAN. Example (Higher School of Economics) Clustering 16.11.2018 22 / 24
  • 32. Clustering methods Density-based methods DBSCAN. Example (Higher School of Economics) Clustering 16.11.2018 22 / 24
  • 33. Clustering methods Density-based methods DBSCAN. Example (Higher School of Economics) Clustering 16.11.2018 22 / 24
  • 34. Clustering methods Density-based methods DBSCAN. Pros and cons Pros + Can find clusters of any shape + Easy to implement + Can find noise in data + Nice complexity — O(n log(n)) with a good data sctructure (otherwise — O(n2 ) ) Cons - Parametric - Doesn’t work well when clusters differ in density - Depends on the chosen metric (Higher School of Economics) Clustering 16.11.2018 23 / 24
  • 35. Clustering methods Density-based methods Contacts Questions Thanks! Please ask your questions in OpenDataSciene Slack team. http://ods.ai (Higher School of Economics) Clustering 16.11.2018 24 / 24