SlideShare a Scribd company logo
1 of 22
Download to read offline
Prof. Pier Luca Lanzi
Clustering
Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
Prof. Pier Luca Lanzi
Readings
• Mining of Massive Datasets (Chapter 7, Section 3.5)
2
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
4
Prof. Pier Luca Lanzi
Clustering algorithms group a collection of data points
into “clusters” according to some distance measure
Data points in the same cluster should have
a small distance from one another
Data points in different clusters should be at
a large distance from one another.
Prof. Pier Luca Lanzi
Clustering algorithms group a collection of data
points into “clusters” according to some
distance measure
Data points in the same cluster should have
a small distance from one another
Data points in different clusters should be at
a large distance from one another
Prof. Pier Luca Lanzi
Clustering searches for “natural” grouping/structure in un-labeled data
(Unsupervised Learning)
Prof. Pier Luca Lanzi
What is Cluster Analysis?
• A cluster is a collection of data objects
§Similar to one another within the same cluster
§Dissimilar to the objects in other clusters
• Cluster analysis
§Given a set data points try to understand their structure
§Finds similarities between data according to the characteristics
found in the data
§Groups similar data objects into clusters
§It is unsupervised learning since there is no predefined classes
• Typical applications
§Stand-alone tool to get insight into data
§Preprocessing step for other algorithms
8
Prof. Pier Luca Lanzi
What Is Good Clustering?
• A good clustering consists of high quality clusters with
§High intra-class similarity
§Low inter-class similarity
• The quality of a clustering result depends on both the similarity
measure used by the method and its implementation
• The quality of a clustering method is also measured by its ability
to discover some or all of the hidden patterns
• Evaluation
§Various measure of intra/inter cluster similarity
§Manual inspection
§Benchmarking on existing labels
9
Prof. Pier Luca Lanzi
Measure the Quality of Clustering
• Dissimilarity/Similarity metric: Similarity is expressed in terms of a
distance function, typically metric d(i, j)
• There is a separate “quality” function that measures the “goodness” of
a cluster
• The definitions of distance functions are usually very different for
interval-scaled, Boolean, categorical, ordinal ratio, and vector variables
• Weights should be associated with different variables based on
applications and data semantics
• It is hard to define “similar enough” or “good enough” as the answer is
typically highly subjective
10
Prof. Pier Luca Lanzi
Clustering Applications
• Marketing
§Help marketers discover distinct groups in their customer bases,
and then use this knowledge to develop targeted marketing
programs
• Land use
§Identification of areas of similar land use in an earth observation
database
• Insurance
§Identifying groups of motor insurance policy holders with a high
average claim cost
• City-planning
§Identifying groups of houses according to their house type, value,
and geographical location
• Earth-quake studies
§Observed earth quake epicenters should be clustered along
continent faults
11
Prof. Pier Luca Lanzi
Clustering Methods
• Hierarchical vs point assignment
• Numeric and/or symbolic data
• Deterministic vs. probabilistic
• Exclusive vs. overlapping
• Hierarchical vs. flat
• Top-down vs. bottom-up
12
Prof. Pier Luca Lanzi
Data Structures
0
d(2,1) 0
d(3,1) d(3,2) 0
: : :
d(n,1) d(n,2) ... ... 0
!
"
#
#
#
#
#
#
$
%
&
&
&
&
&
&
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Sunny Hot	 High	 True No
Overcast	 Hot		 High False Yes
… … … … …
x
11
... x
1f
... x
1p
... ... ... ... ...
x
i1
... x
if
... x
ip
... ... ... ... ...
x
n1
... x
nf
... x
np
!
"
#
#
#
#
#
#
#
#
$
%
&
&
&
&
&
&
&
&
Data Matrix
13
Dis/Similarity Matrix
Prof. Pier Luca Lanzi
Distance Measures
Prof. Pier Luca Lanzi
Distance Measures
• Given a space and a set of points on this space, a distance
measure d(x,y) maps two points x and y to a real number,
and satisfies three axioms
• d(x,y) ≥ 0
• d(x,y) = 0 if and only x=y
• d(x,y) = d(y,x)
• d(x,y) ≤ d(x,z) + d(z,y)
17
Prof. Pier Luca Lanzi
Euclidean Distances 18
There are other distance measures that have been used for Euclidean spa
For any constant r, we can define the Lr-norm to be the distance measur
defined by:
d([x1, x2, . . . , xn], [y1, y2, . . . , yn]) = (
n
i=1
|xi − yi|r
)1/r
The case r = 2 is the usual L2-norm just mentioned. Another common dista
measure is the L1-norm, or Manhattan distance. There, the distance betw
wo points is the sum of the magnitudes of the differences in each dimens
t is called “Manhattan distance” because it is the distance one would have
• Lr-norm
• Euclidean distance (r=2)
• Manhattan distance (r=1)
• L∞-norm
3.5.2 Euclidean Distances
The most familiar distance measure is the one we normally think of as “dis-
ance.” An n-dimensional Euclidean space is one where points are vectors of n
eal numbers. The conventional distance measure in this space, which we shall
efer to as the L2-norm, is defined:
d([x1, x2, . . . , xn], [y1, y2, . . . , yn]) =
n
i=1
(xi − yi)2
That is, we square the distance in each dimension, sum the squares, and take
he positive square root.
It is easy to verify the first three requirements for a distance measure are
atisfied. The Euclidean distance between two points cannot be negative, be-
cause the positive square root is intended. Since all squares of real numbers are
nonnegative, any i such that xi ̸= yi forces the distance to be strictly positive.
On the other hand, if xi = yi for all i, then the distance is clearly 0. Symmetry
ollows because (xi − yi)2
= (yi − xi)2
. The triangle inequality requires a good
deal of algebra to verify. However, it is well understood to be a property of
Prof. Pier Luca Lanzi
Jaccard Distance
• Jaccard distance is defined as
d(x,y) = 1 – SIM(x,y)
• SIM is the Jaccard similarity,
• Which can also be interpreted as the percentage of identical
attributes
19
Prof. Pier Luca Lanzi
Cosine Distance
• The cosine distance between x, y is the angle that the vectors to
those points make
• This angle will be in the range 0 to 180 degrees, regardless of
how many dimensions the space has.
• Example: given x = (1,2,-1) and y = (2,1,1) the angle between the
two vectors is 60
20
Prof. Pier Luca Lanzi
Edit Distance
• The distance between a string x=x1x2…xn and y=y1y2…ym is
the smallest number of insertions and deletions of single
characters that will transform x into y
• Alternatively, the edit distance d(x, y) can be compute as the
longest common subsequence (LCS) of x and y and then,
d(x,y) = |x| + |y| - 2|LCS|
• Example
§The edit distance between x=abcde and y=acfdeg is 3
(delete b, insert f, insert g), the LCS is acde which is coherent
with the previous result
21
Prof. Pier Luca Lanzi
Hamming Distance
• Hamming distance between two vectors is the number of
components in which they differ
• Or equivalently, given the number of variables p, and the number
m of matching components, we define
• Example: the Hamming distance between the vectors 10101 and
11110 is 3/5.
22
Prof. Pier Luca Lanzi
Requisites for Clustering Algorithms
• Scalability
• Ability to deal with different types of attributes
• Ability to handle dynamic data
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine input
parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability
23
Prof. Pier Luca Lanzi
Curse of Dimensionality
in high dimensions, almost all pairs of points
are equally far away from one another
almost any two vectors are almost
orthogonal

More Related Content

What's hot

DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringPier Luca Lanzi
 
DMTM Lecture 03 Regression
DMTM Lecture 03 RegressionDMTM Lecture 03 Regression
DMTM Lecture 03 RegressionPier Luca Lanzi
 
DMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationDMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationPier Luca Lanzi
 
DMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringDMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringPier Luca Lanzi
 
DMTM 2015 - 04 Data Exploration
DMTM 2015 - 04 Data ExplorationDMTM 2015 - 04 Data Exploration
DMTM 2015 - 04 Data ExplorationPier Luca Lanzi
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationPier Luca Lanzi
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningPier Luca Lanzi
 
DMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesDMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesPier Luca Lanzi
 
DMTM 2015 - 10 Introduction to Classification
DMTM 2015 - 10 Introduction to ClassificationDMTM 2015 - 10 Introduction to Classification
DMTM 2015 - 10 Introduction to ClassificationPier Luca Lanzi
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningPier Luca Lanzi
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationPier Luca Lanzi
 
DMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based ClusteringDMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based ClusteringPier Luca Lanzi
 
DMTM 2015 - 16 Data Preparation
DMTM 2015 - 16 Data PreparationDMTM 2015 - 16 Data Preparation
DMTM 2015 - 16 Data PreparationPier Luca Lanzi
 
DMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification ModelsDMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification ModelsPier Luca Lanzi
 
DMTM 2015 - 12 Classification Rules
DMTM 2015 - 12 Classification RulesDMTM 2015 - 12 Classification Rules
DMTM 2015 - 12 Classification RulesPier Luca Lanzi
 
An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)ananth
 
L06 stemmer and edit distance
L06 stemmer and edit distanceL06 stemmer and edit distance
L06 stemmer and edit distanceananth
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLPSatyam Saxena
 
DMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision TreesDMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision TreesPier Luca Lanzi
 

What's hot (20)

DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clustering
 
DMTM Lecture 03 Regression
DMTM Lecture 03 RegressionDMTM Lecture 03 Regression
DMTM Lecture 03 Regression
 
DMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationDMTM Lecture 04 Classification
DMTM Lecture 04 Classification
 
DMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringDMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical Clustering
 
DMTM 2015 - 04 Data Exploration
DMTM 2015 - 04 Data ExplorationDMTM 2015 - 04 Data Exploration
DMTM 2015 - 04 Data Exploration
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluation
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text mining
 
DMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesDMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification Ensembles
 
DMTM 2015 - 10 Introduction to Classification
DMTM 2015 - 10 Introduction to ClassificationDMTM 2015 - 10 Introduction to Classification
DMTM 2015 - 10 Introduction to Classification
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph mining
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data exploration
 
DMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based ClusteringDMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based Clustering
 
DMTM 2015 - 16 Data Preparation
DMTM 2015 - 16 Data PreparationDMTM 2015 - 16 Data Preparation
DMTM 2015 - 16 Data Preparation
 
DMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification ModelsDMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification Models
 
DMTM 2015 - 12 Classification Rules
DMTM 2015 - 12 Classification RulesDMTM 2015 - 12 Classification Rules
DMTM 2015 - 12 Classification Rules
 
An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)
 
L06 stemmer and edit distance
L06 stemmer and edit distanceL06 stemmer and edit distance
L06 stemmer and edit distance
 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
 
DMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision TreesDMTM 2015 - 11 Decision Trees
DMTM 2015 - 11 Decision Trees
 

Similar to DMTM Lecture 11 Clustering

Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningKnoldus Inc.
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen
 
09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdfArafathJazeeb1
 
similarities-knn-1.ppt
similarities-knn-1.pptsimilarities-knn-1.ppt
similarities-knn-1.pptsatvikpatil5
 
UnSupervised Machincs4811-ch23a-clustering.ppt
UnSupervised Machincs4811-ch23a-clustering.pptUnSupervised Machincs4811-ch23a-clustering.ppt
UnSupervised Machincs4811-ch23a-clustering.pptRamanamurthy Banda
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsRebecca Bilbro
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
Cluster Analysis
Cluster Analysis Cluster Analysis
Cluster Analysis Baivab Nag
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
SPATIAL POINT PATTERNS
SPATIAL POINT PATTERNSSPATIAL POINT PATTERNS
SPATIAL POINT PATTERNSLiemNguyenDuy
 
Chap_10_Object_Recognition.pdf
Chap_10_Object_Recognition.pdfChap_10_Object_Recognition.pdf
Chap_10_Object_Recognition.pdfSsdSsd5
 
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...Adam Fausett
 

Similar to DMTM Lecture 11 Clustering (20)

Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
[PPT]
[PPT][PPT]
[PPT]
 
Cs345 cl
Cs345 clCs345 cl
Cs345 cl
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 
PR07.pdf
PR07.pdfPR07.pdf
PR07.pdf
 
09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf
 
Data Mining Lecture_5.pptx
Data Mining Lecture_5.pptxData Mining Lecture_5.pptx
Data Mining Lecture_5.pptx
 
Words in space
Words in spaceWords in space
Words in space
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Clique and sting
Clique and stingClique and sting
Clique and sting
 
similarities-knn-1.ppt
similarities-knn-1.pptsimilarities-knn-1.ppt
similarities-knn-1.ppt
 
UnSupervised Machincs4811-ch23a-clustering.ppt
UnSupervised Machincs4811-ch23a-clustering.pptUnSupervised Machincs4811-ch23a-clustering.ppt
UnSupervised Machincs4811-ch23a-clustering.ppt
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and Distributions
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
Cluster Analysis
Cluster Analysis Cluster Analysis
Cluster Analysis
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
similarities-knn.pptx
similarities-knn.pptxsimilarities-knn.pptx
similarities-knn.pptx
 
SPATIAL POINT PATTERNS
SPATIAL POINT PATTERNSSPATIAL POINT PATTERNS
SPATIAL POINT PATTERNS
 
Chap_10_Object_Recognition.pdf
Chap_10_Object_Recognition.pdfChap_10_Object_Recognition.pdf
Chap_10_Object_Recognition.pdf
 
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
 

More from Pier Luca Lanzi

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i VideogiochiPier Luca Lanzi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiPier Luca Lanzi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomePier Luca Lanzi
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaPier Luca Lanzi
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Pier Luca Lanzi
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesPier Luca Lanzi
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsPier Luca Lanzi
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesPier Luca Lanzi
 
DMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesDMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesPier Luca Lanzi
 
DMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationDMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationPier Luca Lanzi
 
DMTM Lecture 01 Introduction
DMTM Lecture 01 IntroductionDMTM Lecture 01 Introduction
DMTM Lecture 01 IntroductionPier Luca Lanzi
 
DMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningDMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningPier Luca Lanzi
 
VDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelineVDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelinePier Luca Lanzi
 
VDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 14 Procedural content generationVDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 14 Procedural content generationPier Luca Lanzi
 
VDP2016 - Lecture 13 Data driven game design
VDP2016 - Lecture 13 Data driven game designVDP2016 - Lecture 13 Data driven game design
VDP2016 - Lecture 13 Data driven game designPier Luca Lanzi
 

More from Pier Luca Lanzi (17)

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei Videogiochi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning Welcome
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di apertura
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rules
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethods
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rules
 
DMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesDMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision trees
 
DMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationDMTM Lecture 05 Data representation
DMTM Lecture 05 Data representation
 
DMTM Lecture 01 Introduction
DMTM Lecture 01 IntroductionDMTM Lecture 01 Introduction
DMTM Lecture 01 Introduction
 
DMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningDMTM Lecture 02 Data mining
DMTM Lecture 02 Data mining
 
VDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelineVDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipeline
 
VDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 14 Procedural content generationVDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 14 Procedural content generation
 
VDP2016 - Lecture 13 Data driven game design
VDP2016 - Lecture 13 Data driven game designVDP2016 - Lecture 13 Data driven game design
VDP2016 - Lecture 13 Data driven game design
 

Recently uploaded

HED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdfHED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdfMohonDas
 
In - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxIn - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxAditiChauhan701637
 
How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17Celine George
 
CapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapitolTechU
 
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptxClinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptxraviapr7
 
M-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxM-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxDr. Santhosh Kumar. N
 
5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...CaraSkikne1
 
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...Nguyen Thanh Tu Collection
 
Patterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxPatterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxMYDA ANGELICA SUAN
 
Ultra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxUltra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxDr. Asif Anas
 
How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17Celine George
 
UKCGE Parental Leave Discussion March 2024
UKCGE Parental Leave Discussion March 2024UKCGE Parental Leave Discussion March 2024
UKCGE Parental Leave Discussion March 2024UKCGE
 
Human-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming ClassesHuman-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming ClassesMohammad Hassany
 
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRADUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRATanmoy Mishra
 
Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...raviapr7
 
Prescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxPrescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxraviapr7
 
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptxPISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptxEduSkills OECD
 
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdfMaximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdfTechSoup
 

Recently uploaded (20)

HED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdfHED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdf
 
In - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxIn - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptx
 
How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17
 
CapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptx
 
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptxClinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
 
M-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxM-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptx
 
Finals of Kant get Marx 2.0 : a general politics quiz
Finals of Kant get Marx 2.0 : a general politics quizFinals of Kant get Marx 2.0 : a general politics quiz
Finals of Kant get Marx 2.0 : a general politics quiz
 
5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...
 
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
 
Patterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxPatterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptx
 
Ultra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxUltra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptx
 
How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17
 
Personal Resilience in Project Management 2 - TV Edit 1a.pdf
Personal Resilience in Project Management 2 - TV Edit 1a.pdfPersonal Resilience in Project Management 2 - TV Edit 1a.pdf
Personal Resilience in Project Management 2 - TV Edit 1a.pdf
 
UKCGE Parental Leave Discussion March 2024
UKCGE Parental Leave Discussion March 2024UKCGE Parental Leave Discussion March 2024
UKCGE Parental Leave Discussion March 2024
 
Human-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming ClassesHuman-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming Classes
 
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRADUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
 
Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...
 
Prescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxPrescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptx
 
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptxPISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
 
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdfMaximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
 

DMTM Lecture 11 Clustering

  • 1. Prof. Pier Luca Lanzi Clustering Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
  • 2. Prof. Pier Luca Lanzi Readings • Mining of Massive Datasets (Chapter 7, Section 3.5) 2
  • 4. Prof. Pier Luca Lanzi 4
  • 5. Prof. Pier Luca Lanzi Clustering algorithms group a collection of data points into “clusters” according to some distance measure Data points in the same cluster should have a small distance from one another Data points in different clusters should be at a large distance from one another.
  • 6. Prof. Pier Luca Lanzi Clustering algorithms group a collection of data points into “clusters” according to some distance measure Data points in the same cluster should have a small distance from one another Data points in different clusters should be at a large distance from one another
  • 7. Prof. Pier Luca Lanzi Clustering searches for “natural” grouping/structure in un-labeled data (Unsupervised Learning)
  • 8. Prof. Pier Luca Lanzi What is Cluster Analysis? • A cluster is a collection of data objects §Similar to one another within the same cluster §Dissimilar to the objects in other clusters • Cluster analysis §Given a set data points try to understand their structure §Finds similarities between data according to the characteristics found in the data §Groups similar data objects into clusters §It is unsupervised learning since there is no predefined classes • Typical applications §Stand-alone tool to get insight into data §Preprocessing step for other algorithms 8
  • 9. Prof. Pier Luca Lanzi What Is Good Clustering? • A good clustering consists of high quality clusters with §High intra-class similarity §Low inter-class similarity • The quality of a clustering result depends on both the similarity measure used by the method and its implementation • The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns • Evaluation §Various measure of intra/inter cluster similarity §Manual inspection §Benchmarking on existing labels 9
  • 10. Prof. Pier Luca Lanzi Measure the Quality of Clustering • Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric d(i, j) • There is a separate “quality” function that measures the “goodness” of a cluster • The definitions of distance functions are usually very different for interval-scaled, Boolean, categorical, ordinal ratio, and vector variables • Weights should be associated with different variables based on applications and data semantics • It is hard to define “similar enough” or “good enough” as the answer is typically highly subjective 10
  • 11. Prof. Pier Luca Lanzi Clustering Applications • Marketing §Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs • Land use §Identification of areas of similar land use in an earth observation database • Insurance §Identifying groups of motor insurance policy holders with a high average claim cost • City-planning §Identifying groups of houses according to their house type, value, and geographical location • Earth-quake studies §Observed earth quake epicenters should be clustered along continent faults 11
  • 12. Prof. Pier Luca Lanzi Clustering Methods • Hierarchical vs point assignment • Numeric and/or symbolic data • Deterministic vs. probabilistic • Exclusive vs. overlapping • Hierarchical vs. flat • Top-down vs. bottom-up 12
  • 13. Prof. Pier Luca Lanzi Data Structures 0 d(2,1) 0 d(3,1) d(3,2) 0 : : : d(n,1) d(n,2) ... ... 0 ! " # # # # # # $ % & & & & & & Outlook Temp Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes … … … … … x 11 ... x 1f ... x 1p ... ... ... ... ... x i1 ... x if ... x ip ... ... ... ... ... x n1 ... x nf ... x np ! " # # # # # # # # $ % & & & & & & & & Data Matrix 13 Dis/Similarity Matrix
  • 14. Prof. Pier Luca Lanzi Distance Measures
  • 15. Prof. Pier Luca Lanzi Distance Measures • Given a space and a set of points on this space, a distance measure d(x,y) maps two points x and y to a real number, and satisfies three axioms • d(x,y) ≥ 0 • d(x,y) = 0 if and only x=y • d(x,y) = d(y,x) • d(x,y) ≤ d(x,z) + d(z,y) 17
  • 16. Prof. Pier Luca Lanzi Euclidean Distances 18 There are other distance measures that have been used for Euclidean spa For any constant r, we can define the Lr-norm to be the distance measur defined by: d([x1, x2, . . . , xn], [y1, y2, . . . , yn]) = ( n i=1 |xi − yi|r )1/r The case r = 2 is the usual L2-norm just mentioned. Another common dista measure is the L1-norm, or Manhattan distance. There, the distance betw wo points is the sum of the magnitudes of the differences in each dimens t is called “Manhattan distance” because it is the distance one would have • Lr-norm • Euclidean distance (r=2) • Manhattan distance (r=1) • L∞-norm 3.5.2 Euclidean Distances The most familiar distance measure is the one we normally think of as “dis- ance.” An n-dimensional Euclidean space is one where points are vectors of n eal numbers. The conventional distance measure in this space, which we shall efer to as the L2-norm, is defined: d([x1, x2, . . . , xn], [y1, y2, . . . , yn]) = n i=1 (xi − yi)2 That is, we square the distance in each dimension, sum the squares, and take he positive square root. It is easy to verify the first three requirements for a distance measure are atisfied. The Euclidean distance between two points cannot be negative, be- cause the positive square root is intended. Since all squares of real numbers are nonnegative, any i such that xi ̸= yi forces the distance to be strictly positive. On the other hand, if xi = yi for all i, then the distance is clearly 0. Symmetry ollows because (xi − yi)2 = (yi − xi)2 . The triangle inequality requires a good deal of algebra to verify. However, it is well understood to be a property of
  • 17. Prof. Pier Luca Lanzi Jaccard Distance • Jaccard distance is defined as d(x,y) = 1 – SIM(x,y) • SIM is the Jaccard similarity, • Which can also be interpreted as the percentage of identical attributes 19
  • 18. Prof. Pier Luca Lanzi Cosine Distance • The cosine distance between x, y is the angle that the vectors to those points make • This angle will be in the range 0 to 180 degrees, regardless of how many dimensions the space has. • Example: given x = (1,2,-1) and y = (2,1,1) the angle between the two vectors is 60 20
  • 19. Prof. Pier Luca Lanzi Edit Distance • The distance between a string x=x1x2…xn and y=y1y2…ym is the smallest number of insertions and deletions of single characters that will transform x into y • Alternatively, the edit distance d(x, y) can be compute as the longest common subsequence (LCS) of x and y and then, d(x,y) = |x| + |y| - 2|LCS| • Example §The edit distance between x=abcde and y=acfdeg is 3 (delete b, insert f, insert g), the LCS is acde which is coherent with the previous result 21
  • 20. Prof. Pier Luca Lanzi Hamming Distance • Hamming distance between two vectors is the number of components in which they differ • Or equivalently, given the number of variables p, and the number m of matching components, we define • Example: the Hamming distance between the vectors 10101 and 11110 is 3/5. 22
  • 21. Prof. Pier Luca Lanzi Requisites for Clustering Algorithms • Scalability • Ability to deal with different types of attributes • Ability to handle dynamic data • Discovery of clusters with arbitrary shape • Minimal requirements for domain knowledge to determine input parameters • Able to deal with noise and outliers • Insensitive to order of input records • High dimensionality • Incorporation of user-specified constraints • Interpretability and usability 23
  • 22. Prof. Pier Luca Lanzi Curse of Dimensionality in high dimensions, almost all pairs of points are equally far away from one another almost any two vectors are almost orthogonal