SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Downloaden Sie, um offline zu lesen
K-means and
                               Hierarchical
                                Clustering
Note to other teachers and users of
these slides. Andrew would be
delighted if you found this source
                                               Andrew W. Moore
material useful in giving your own
lectures. Feel free to use these slides
verbatim, or to modify them to fit your
                                                    Professor
own needs. PowerPoint originals are
available. If you make use of a
significant portion of these slides in
                                           School of Computer Science
your own lecture, please include this
message, or the following link to the
source repository of Andrew’s tutorials:
                                           Carnegie Mellon University
http://www.cs.cmu.edu/~awm/tutorials
. Comments and corrections gratefully
                                                 www.cs.cmu.edu/~awm
received.

                                                   awm@cs.cmu.edu
                                                     412-268-7599


              Copyright © 2001, Andrew W. Moore                                   Nov 16th, 2001




          Some
          Data

This could easily be
modeled by a
Gaussian Mixture
(with 5 components)
But let’s look at an
satisfying, friendly and
infinitely popular
alternative…



 Copyright © 2001, 2004, Andrew W. Moore                          K-means and Hierarchical Clustering: Slide 2




                                                                                                                 1
Suppose you transmit the
                                                   Lossy Compression
coordinates of points drawn
randomly from this dataset.
You can install decoding
software at the receiver.
You’re only allowed to send
two bits per point.
It’ll have to be a “lossy
transmission”.
Loss = Sum Squared Error
between decoded coords and
original coords.
What encoder/decoder will
lose the least information?
Copyright © 2001, 2004, Andrew W. Moore                  K-means and Hierarchical Clustering: Slide 3




Suppose you transmit the
                                                         Idea One
coordinates of points Break into a grid,
                      drawn
                      decode each bit-pair
randomly from this dataset.
                                     as the middle of
You can install decoding grid-cell
                      each
software at the receiver.
                                                        00                            01
You’re only allowed to send
two bits per point.
It’ll have to be a “lossy
transmission”.
                                                        10                            11
Loss = Sum Squared Error
between decoded coords and
original coords.
                                                                            eas?
                                                                      er Id
What encoder/decoder will
                                                               Be t t
                                                        Any
lose the least information?
Copyright © 2001, 2004, Andrew W. Moore                  K-means and Hierarchical Clustering: Slide 4




                                                                                                        2
Suppose you transmit the
                                                           Idea Two
coordinates of points drawna grid, decode
                   Break into
randomly from thiseach bit-pair as the
                    dataset.
                                centroid of all data in
You can install decoding
                   that grid-cell
software at the receiver.
                                                                     00
                                                                                           01
You’re only allowed to send
two bits per point.
It’ll have to be a “lossy
transmission”.                                                                   11
                                                                10
Loss = Sum Squared Error
between decoded coords and
original coords.
                                                                              as?
                                                                       r I de
What encoder/decoder will
                                                                 urthe
                                                            ny F
lose the least information?                               A
Copyright © 2001, 2004, Andrew W. Moore                    K-means and Hierarchical Clustering: Slide 5




 K-means
1. Ask user how many
   clusters they’d like.
     (e.g. k=5)




Copyright © 2001, 2004, Andrew W. Moore                    K-means and Hierarchical Clustering: Slide 6




                                                                                                          3
K-means
1. Ask user how many
   clusters they’d like.
     (e.g. k=5)
2. Randomly guess k
   cluster Center
   locations




Copyright © 2001, 2004, Andrew W. Moore   K-means and Hierarchical Clustering: Slide 7




 K-means
1. Ask user how many
   clusters they’d like.
     (e.g. k=5)
2. Randomly guess k
   cluster Center
   locations
3. Each datapoint finds
   out which Center it’s
   closest to. (Thus
   each Center “owns”
   a set of datapoints)




Copyright © 2001, 2004, Andrew W. Moore   K-means and Hierarchical Clustering: Slide 8




                                                                                         4
K-means
1. Ask user how many
   clusters they’d like.
     (e.g. k=5)
2. Randomly guess k
   cluster Center
   locations
3. Each datapoint finds
   out which Center it’s
   closest to.
4. Each Center finds
   the centroid of the
   points it owns




Copyright © 2001, 2004, Andrew W. Moore    K-means and Hierarchical Clustering: Slide 9




 K-means
1. Ask user how many
   clusters they’d like.
     (e.g. k=5)
2. Randomly guess k
   cluster Center
   locations
3. Each datapoint finds
   out which Center it’s
   closest to.
4. Each Center finds
   the centroid of the
   points it owns…
5. …and jumps there
6. …Repeat until
   terminated!

Copyright © 2001, 2004, Andrew W. Moore   K-means and Hierarchical Clustering: Slide 10




                                                                                          5
K-means
    Start
Advance apologies: in
Black and White this
example will deteriorate
Example generated by
Dan Pelleg’s super-duper
fast K-means system:
  Dan Pelleg and Andrew
  Moore. Accelerating Exact
  k-means Algorithms with
  Geometric Reasoning.
  Proc. Conference on
  Knowledge Discovery in
  Databases 1999,
  (KDD99) (available on
  www.autonlab.org/pap.html)



Copyright © 2001, 2004, Andrew W. Moore   K-means and Hierarchical Clustering: Slide 11




 K-means
 continues
    …




Copyright © 2001, 2004, Andrew W. Moore   K-means and Hierarchical Clustering: Slide 12




                                                                                          6
K-means
continues
   …




Copyright © 2001, 2004, Andrew W. Moore   K-means and Hierarchical Clustering: Slide 13




K-means
continues
   …




Copyright © 2001, 2004, Andrew W. Moore   K-means and Hierarchical Clustering: Slide 14




                                                                                          7
K-means
continues
   …




Copyright © 2001, 2004, Andrew W. Moore   K-means and Hierarchical Clustering: Slide 15




K-means
continues
   …




Copyright © 2001, 2004, Andrew W. Moore   K-means and Hierarchical Clustering: Slide 16




                                                                                          8
K-means
continues
   …




Copyright © 2001, 2004, Andrew W. Moore   K-means and Hierarchical Clustering: Slide 17




K-means
continues
   …




Copyright © 2001, 2004, Andrew W. Moore   K-means and Hierarchical Clustering: Slide 18




                                                                                          9
K-means
continues
   …




Copyright © 2001, 2004, Andrew W. Moore   K-means and Hierarchical Clustering: Slide 19




 K-means
terminates




Copyright © 2001, 2004, Andrew W. Moore   K-means and Hierarchical Clustering: Slide 20




                                                                                          10
K-means Questions
• What is it trying to optimize?
• Are we sure it will terminate?
• Are we sure it will find an optimal
  clustering?
• How should we start it?
• How could we automatically choose the
  number of centers?

                         ….we’ll deal with these questions over the next few slides

Copyright © 2001, 2004, Andrew W. Moore                K-means and Hierarchical Clustering: Slide 21




                                          Distortion
Given..
•an encoder function: ENCODE : ℜm → [1..k]
•a decoder function: DECODE : [1..k] → ℜm
Define…
                                     R
            Distortion = ∑ (x i − DECODE[ENCODE(x i )])
                                                                              2

                                    i =1




Copyright © 2001, 2004, Andrew W. Moore                K-means and Hierarchical Clustering: Slide 22




                                                                                                       11
Distortion
Given..
•an encoder function: ENCODE : ℜm → [1..k]
•a decoder function: DECODE : [1..k] → ℜm
Define…
                                     R
            Distortion = ∑ (x i − DECODE[ENCODE(x i )])
                                                                                 2

                                    i =1


We may as well write
                                           DECODE[ j ] = c j
                                                    R
                     so Distortion = ∑ (x i − c ENCODE ( xi ) ) 2
                                                   i =1




Copyright © 2001, 2004, Andrew W. Moore                   K-means and Hierarchical Clustering: Slide 23




                  The Minimal Distortion
                                             R
                       Distortion = ∑ (x i − c ENCODE ( xi ) ) 2
                                            i =1

What properties must centers c1 , c2 , … , ck have
when distortion is minimized?




Copyright © 2001, 2004, Andrew W. Moore                   K-means and Hierarchical Clustering: Slide 24




                                                                                                          12
The Minimal Distortion (1)
                                            R
                       Distortion = ∑ (x i − c ENCODE ( xi ) ) 2
                                           i =1

What properties must centers c1 , c2 , … , ck have
when distortion is minimized?
(1) xi must be encoded by its nearest center
                               ….why?



                                     c ENCODE ( xi ) = arg min (x i − c j ) 2
                                                     c j ∈{c1 ,c 2 ,...c k }

                                                ..at the minimal distortion


Copyright © 2001, 2004, Andrew W. Moore                           K-means and Hierarchical Clustering: Slide 25




            The Minimal Distortion (1)
                                            R
                       Distortion = ∑ (x i − c ENCODE ( xi ) ) 2
                                           i =1

What properties must centers c1 , c2 , … , ck have
when distortion is minimized?
(1) xi must be encoded by its nearest center
                            Otherwise distortion could be
                  ….why?
                            reduced by replacing ENCODE[xi]
                            by the nearest center

                                     c ENCODE ( xi ) = arg min (x i − c j ) 2
                                                     c j ∈{c1 ,c 2 ,...c k }

                                                ..at the minimal distortion


Copyright © 2001, 2004, Andrew W. Moore                           K-means and Hierarchical Clustering: Slide 26




                                                                                                                  13
The Minimal Distortion (2)
                                                   R
                       Distortion = ∑ (x i − c ENCODE ( xi ) ) 2
                                                  i =1

What properties must centers c1 , c2 , … , ck have
when distortion is minimized?
(2) The partial derivative of Distortion with respect
to each center location must be zero.




Copyright © 2001, 2004, Andrew W. Moore                                     K-means and Hierarchical Clustering: Slide 27




(2) The partial derivative of Distortion with respect
to each center location must be zero.
                                 R

                               ∑ (x
                         =                     − c ENCODE ( xi ) ) 2
 Distortion                                i
                                i =1
                                   k

                                 ∑              ∑ (x
                         =                                       − c j )2      OwnedBy(cj ) = the set
                                                         i
                                                                               of records owned by
                                 j =1 i∈OwnedBy(c j )
                                                                               Center cj .

∂Distortion                      ∂
                                                 ∑ (x
                         =                                       − c j )2
                                                             i
   ∂c j                         ∂c j      i∈OwnedBy(c j )


                                                 ∑ (x
                         =       −2                              −cj)
                                                      i
                                          i∈OwnedBy(c j )
                         =       0 (for a minimum)


Copyright © 2001, 2004, Andrew W. Moore                                     K-means and Hierarchical Clustering: Slide 28




                                                                                                                            14
(2) The partial derivative of Distortion with respect
to each center location must be zero.
                                 R

                               ∑ (x
                         =                     − c ENCODE ( xi ) ) 2
 Distortion                                i
                                i =1
                                   k

                                 ∑              ∑ (x
                         =                                       − c j )2
                                                         i
                                 j =1 i∈OwnedBy(c j )



∂Distortion                      ∂
                                                 ∑ (x
                         =                                       − c j )2
                                                             i
   ∂c j                         ∂c j      i∈OwnedBy(c j )


                                                 ∑ (x
                         =       −2                              −cj)
                                                      i
                                          i∈OwnedBy(c j )
                         =  0 (for a minimum)
                                                                                  1
                                                                                                  ∑ xi
                     Thus, at a minimum: c j =
                                                                            | OwnedBy(c j ) | i∈OwnedBy(c j )
Copyright © 2001, 2004, Andrew W. Moore                                     K-means and Hierarchical Clustering: Slide 29




            At the minimum distortion
                                                   R
                       Distortion = ∑ (x i − c ENCODE ( xi ) ) 2
                                                  i =1

What properties must centers c1 , c2 , … , ck have when
distortion is minimized?
(1) xi must be encoded by its nearest center
(2) Each Center must be at the centroid of points it owns.




Copyright © 2001, 2004, Andrew W. Moore                                     K-means and Hierarchical Clustering: Slide 30




                                                                                                                            15
Improving a suboptimal configuration…
                                          R
                       Distortion = ∑ (x i − c ENCODE ( xi ) ) 2
                                          i =1

What properties can be changed for centers c1 , c2 , … , ck
have when distortion is not minimized?
(1) Change encoding so that xi is encoded by its nearest center
(2) Set each Center to the centroid of points it owns.
There’s no point applying either operation twice in succession.
But it can be profitable to alternate.
…And that’s K-means!
             Easy to prove this procedure will terminate in a state at
             which neither (1) or (2) change the configuration. Why?
Copyright © 2001, 2004, Andrew W. Moore                K-means and Hierarchical Clustering: Slide 31




  Improving a suboptimal sconfiguration…   R
                           of partitioning    of way
                            finite number
            re are only a              R
       The
                                          ∑
                                  .
       records Distortion = nux iberc ENCODE ( x i ) )
                                          ( − of possible 2
                 into k groups
                                           m
                        only a finite=1                         roids of
                                                rs are the cent
        So there are
                                     i
                               hich all Cente
                             w
What properties canin n.
        configurations be changed for centers c1 , c2 , … , ck
                                                                     t have
                   ts they ow
        the distortion is not minimized? ration, it mus
have when poin                           ges on an ite
                          ation chan
         If the configur
                          distortion. x is encoded by itsgo to a
         improved the
(1) Change encoding so that i ion changes it must nearest center
                                  nfigurat
                ch time the co
          So eaCenter to the centroid before.
(2) Set eachiguration it’s never been to of points tually run out
                                                             it owns.
                                                 would even
          conf
                                 on forever, it
                            go
There’s no pointied to . either operation twice in succession.
          So if it tr applying
                  figurations
           of con
But it can be profitable to alternate.
…And that’s K-means!
             Easy to prove this procedure will terminate in a state at
             which neither (1) or (2) change the configuration. Why?
Copyright © 2001, 2004, Andrew W. Moore                K-means and Hierarchical Clustering: Slide 32




                                                                                                       16
Will we find the optimal
                     configuration?
• Not necessarily.
• Can you invent a configuration that has
  converged, but does not have the minimum
  distortion?




Copyright © 2001, 2004, Andrew W. Moore   K-means and Hierarchical Clustering: Slide 33




                Will we find the optimal
                     configuration?
• Not necessarily.
• Can you invent a configuration that has
  converged, but does not have the minimum
  distortion? (Hint: try a fiendish k=3 configuration here…)




Copyright © 2001, 2004, Andrew W. Moore   K-means and Hierarchical Clustering: Slide 34




                                                                                          17
Will we find the optimal
                     configuration?
• Not necessarily.
• Can you invent a configuration that has
  converged, but does not have the minimum
  distortion? (Hint: try a fiendish k=3 configuration here…)




Copyright © 2001, 2004, Andrew W. Moore   K-means and Hierarchical Clustering: Slide 35




           Trying to find good optima
• Idea 1: Be careful about where you start
• Idea 2: Do many runs of k-means, each
  from a different random start configuration
• Many other ideas floating around.




Copyright © 2001, 2004, Andrew W. Moore   K-means and Hierarchical Clustering: Slide 36




                                                                                          18
Trying to find good optima
• Idea 1: Be careful about where you start
• Idea 2: Do many runs of k-means, each
  from a different random start configuration
      Neat trick:
• Many other center on top of randomly chosen datapoint.
      Place first ideas floating around.
              Place second center on datapoint that’s as far away as
              possible from first center
                 :
              Place j’th center on datapoint that’s as far away as
              possible from the closest of Centers 1 through j-1
                 :

Copyright © 2001, 2004, Andrew W. Moore            K-means and Hierarchical Clustering: Slide 37




  Choosing the number of Centers
• A difficult problem
• Most common approach is to try to find the
  solution that minimizes the Schwarz
  Criterion (also related to the BIC)
                 Distortion + λ (# parameters) log R
                          = Distortion + λmk log R

                    m=#dimensions                               R=#Records
                                          k=#Centers
Copyright © 2001, 2004, Andrew W. Moore            K-means and Hierarchical Clustering: Slide 38




                                                                                                   19
Common uses of K-means
• Often used as an exploratory data analysis tool
• In one-dimension, a good way to quantize real-
  valued variables into k non-uniform buckets
• Used on acoustic data in speech understanding to
  convert waveforms into one of k categories
  (known as Vector Quantization)
• Also used for choosing color palettes on old
  fashioned graphical display devices!



Copyright © 2001, 2004, Andrew W. Moore    K-means and Hierarchical Clustering: Slide 39




           Single Linkage Hierarchical
                    Clustering
                                          1. Say “Every point is its
                                             own cluster”




Copyright © 2001, 2004, Andrew W. Moore    K-means and Hierarchical Clustering: Slide 40




                                                                                           20
Single Linkage Hierarchical
                    Clustering
                                          1. Say “Every point is its
                                             own cluster”
                                          2. Find “most similar” pair
                                             of clusters




Copyright © 2001, 2004, Andrew W. Moore    K-means and Hierarchical Clustering: Slide 41




           Single Linkage Hierarchical
                    Clustering
                                          1. Say “Every point is its
                                             own cluster”
                                          2. Find “most similar” pair
                                             of clusters
                                          3. Merge it into a parent
                                             cluster




Copyright © 2001, 2004, Andrew W. Moore    K-means and Hierarchical Clustering: Slide 42




                                                                                           21
Single Linkage Hierarchical
                    Clustering
                                          1. Say “Every point is its
                                             own cluster”
                                          2. Find “most similar” pair
                                             of clusters
                                          3. Merge it into a parent
                                             cluster
                                          4. Repeat




Copyright © 2001, 2004, Andrew W. Moore    K-means and Hierarchical Clustering: Slide 43




           Single Linkage Hierarchical
                    Clustering
                                          1. Say “Every point is its
                                             own cluster”
                                          2. Find “most similar” pair
                                             of clusters
                                          3. Merge it into a parent
                                             cluster
                                          4. Repeat




Copyright © 2001, 2004, Andrew W. Moore    K-means and Hierarchical Clustering: Slide 44




                                                                                           22
Single Linkage Hierarchical
                            Clustering
How do we define similarity
 between clusters?

• Minimum distance between
                                                   1. Say “Every point is its
  points in clusters (in which
                                                      own cluster”
  case we’re simply doing
  Euclidian Minimum Spanning                       2. Find “most similar” pair
  Trees)                                              of clusters
• Maximum distance between                         3. Merge it into a parent
  points in clusters                                  cluster
• Average distance between
                                                   4. Repeat…until you’ve
  points in clusters
                                                      merged the whole
                                                      dataset into one cluster
                    You’re left with a nice
                dendrogram, or taxonomy, or
                 hierarchy of datapoints (not
                         shown here)

 Copyright © 2001, 2004, Andrew W. Moore            K-means and Hierarchical Clustering: Slide 45




                                             Also known in the trade as
                                             Hierarchical Agglomerative
                                           Clustering (note the acronym)


               Single Linkage Comments
 • It’s nice that you get a hierarchy instead of
   an amorphous collection of groups
 • If you want k groups, just cut the (k-1)
   longest links
 • There’s no real statistical or information-
   theoretic foundation to this. Makes your
   lecturer feel a bit queasy.

 Copyright © 2001, 2004, Andrew W. Moore            K-means and Hierarchical Clustering: Slide 46




                                                                                                    23
What you should know
• All the details of K-means
• The theory behind K-means as an
  optimization algorithm
• How K-means can get stuck
• The outline of Hierarchical clustering
• Be able to contrast between which problems
  would be relatively well/poorly suited to K-
  means vs Gaussian Mixtures vs Hierarchical
  clustering
Copyright © 2001, 2004, Andrew W. Moore   K-means and Hierarchical Clustering: Slide 47




                                                                                          24

Weitere ähnliche Inhalte

Andere mochten auch

3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methodsKrish_ver2
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based ClusteringSSA KPI
 
K means and dbscan
K means and dbscanK means and dbscan
K means and dbscanYan Xu
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster AnalysisSSA KPI
 
DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)Cory Cook
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data miningZHAO Sam
 
Computer Aided Molecular Modeling
Computer Aided Molecular ModelingComputer Aided Molecular Modeling
Computer Aided Molecular Modelingpkchoudhury
 
Molecular modelling for in silico drug discovery
Molecular modelling for in silico drug discoveryMolecular modelling for in silico drug discovery
Molecular modelling for in silico drug discoveryLee Larcombe
 
Theory and application of fluctuating-charge models
Theory and application of fluctuating-charge modelsTheory and application of fluctuating-charge models
Theory and application of fluctuating-charge modelsJiahao Chen
 
Cluster spss week7
Cluster spss week7Cluster spss week7
Cluster spss week7Birat Sharma
 
Gene Expression Data Analysis
Gene Expression Data AnalysisGene Expression Data Analysis
Gene Expression Data AnalysisJhoirene Clemente
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXBenjamin Bengfort
 
Introduction to simulation modeling
Introduction to simulation modelingIntroduction to simulation modeling
Introduction to simulation modelingbhupendra kumar
 

Andere mochten auch (20)

DBSCAN
DBSCANDBSCAN
DBSCAN
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
 
Db Scan
Db ScanDb Scan
Db Scan
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methods
 
08 clustering
08 clustering08 clustering
08 clustering
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
K means and dbscan
K means and dbscanK means and dbscan
K means and dbscan
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data mining
 
Computer Aided Molecular Modeling
Computer Aided Molecular ModelingComputer Aided Molecular Modeling
Computer Aided Molecular Modeling
 
Molecular modelling for in silico drug discovery
Molecular modelling for in silico drug discoveryMolecular modelling for in silico drug discovery
Molecular modelling for in silico drug discovery
 
Theory and application of fluctuating-charge models
Theory and application of fluctuating-charge modelsTheory and application of fluctuating-charge models
Theory and application of fluctuating-charge models
 
Cluster spss week7
Cluster spss week7Cluster spss week7
Cluster spss week7
 
Gene Expression Data Analysis
Gene Expression Data AnalysisGene Expression Data Analysis
Gene Expression Data Analysis
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkX
 
Image segmentation using wvlt trnsfrmtn and fuzzy logic. ppt
Image segmentation using wvlt trnsfrmtn and fuzzy logic. pptImage segmentation using wvlt trnsfrmtn and fuzzy logic. ppt
Image segmentation using wvlt trnsfrmtn and fuzzy logic. ppt
 
Introduction to simulation modeling
Introduction to simulation modelingIntroduction to simulation modeling
Introduction to simulation modeling
 

Mehr von guestfee8698

Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesguestfee8698
 
Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Modelsguestfee8698
 
Gaussian Mixture Models
Gaussian Mixture ModelsGaussian Mixture Models
Gaussian Mixture Modelsguestfee8698
 
Short Overview of Bayes Nets
Short Overview of Bayes NetsShort Overview of Bayes Nets
Short Overview of Bayes Netsguestfee8698
 
A Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian ClassifiersA Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian Classifiersguestfee8698
 
Learning Bayesian Networks
Learning Bayesian NetworksLearning Bayesian Networks
Learning Bayesian Networksguestfee8698
 
Inference in Bayesian Networks
Inference in Bayesian NetworksInference in Bayesian Networks
Inference in Bayesian Networksguestfee8698
 
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regressionPredicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regressionguestfee8698
 
Eight Regression Algorithms
Eight Regression AlgorithmsEight Regression Algorithms
Eight Regression Algorithmsguestfee8698
 
Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)guestfee8698
 
Gaussian Bayes Classifiers
Gaussian Bayes ClassifiersGaussian Bayes Classifiers
Gaussian Bayes Classifiersguestfee8698
 
Maximum Likelihood Estimation
Maximum Likelihood EstimationMaximum Likelihood Estimation
Maximum Likelihood Estimationguestfee8698
 
Probability for Data Miners
Probability for Data MinersProbability for Data Miners
Probability for Data Minersguestfee8698
 

Mehr von guestfee8698 (19)

PAC Learning
PAC LearningPAC Learning
PAC Learning
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
VC dimensio
VC dimensioVC dimensio
VC dimensio
 
Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Models
 
Gaussian Mixture Models
Gaussian Mixture ModelsGaussian Mixture Models
Gaussian Mixture Models
 
Short Overview of Bayes Nets
Short Overview of Bayes NetsShort Overview of Bayes Nets
Short Overview of Bayes Nets
 
A Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian ClassifiersA Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian Classifiers
 
Learning Bayesian Networks
Learning Bayesian NetworksLearning Bayesian Networks
Learning Bayesian Networks
 
Inference in Bayesian Networks
Inference in Bayesian NetworksInference in Bayesian Networks
Inference in Bayesian Networks
 
Bayesian Networks
Bayesian NetworksBayesian Networks
Bayesian Networks
 
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regressionPredicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
 
Eight Regression Algorithms
Eight Regression AlgorithmsEight Regression Algorithms
Eight Regression Algorithms
 
Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Cross-Validation
Cross-ValidationCross-Validation
Cross-Validation
 
Gaussian Bayes Classifiers
Gaussian Bayes ClassifiersGaussian Bayes Classifiers
Gaussian Bayes Classifiers
 
Maximum Likelihood Estimation
Maximum Likelihood EstimationMaximum Likelihood Estimation
Maximum Likelihood Estimation
 
Gaussians
GaussiansGaussians
Gaussians
 
Probability for Data Miners
Probability for Data MinersProbability for Data Miners
Probability for Data Miners
 

Kürzlich hochgeladen

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Kürzlich hochgeladen (20)

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

K-means and Hierarchical Clustering

  • 1. K-means and Hierarchical Clustering Note to other teachers and users of these slides. Andrew would be delighted if you found this source Andrew W. Moore material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your Professor own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in School of Computer Science your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: Carnegie Mellon University http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully www.cs.cmu.edu/~awm received. awm@cs.cmu.edu 412-268-7599 Copyright © 2001, Andrew W. Moore Nov 16th, 2001 Some Data This could easily be modeled by a Gaussian Mixture (with 5 components) But let’s look at an satisfying, friendly and infinitely popular alternative… Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 2 1
  • 2. Suppose you transmit the Lossy Compression coordinates of points drawn randomly from this dataset. You can install decoding software at the receiver. You’re only allowed to send two bits per point. It’ll have to be a “lossy transmission”. Loss = Sum Squared Error between decoded coords and original coords. What encoder/decoder will lose the least information? Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 3 Suppose you transmit the Idea One coordinates of points Break into a grid, drawn decode each bit-pair randomly from this dataset. as the middle of You can install decoding grid-cell each software at the receiver. 00 01 You’re only allowed to send two bits per point. It’ll have to be a “lossy transmission”. 10 11 Loss = Sum Squared Error between decoded coords and original coords. eas? er Id What encoder/decoder will Be t t Any lose the least information? Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 4 2
  • 3. Suppose you transmit the Idea Two coordinates of points drawna grid, decode Break into randomly from thiseach bit-pair as the dataset. centroid of all data in You can install decoding that grid-cell software at the receiver. 00 01 You’re only allowed to send two bits per point. It’ll have to be a “lossy transmission”. 11 10 Loss = Sum Squared Error between decoded coords and original coords. as? r I de What encoder/decoder will urthe ny F lose the least information? A Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 5 K-means 1. Ask user how many clusters they’d like. (e.g. k=5) Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 6 3
  • 4. K-means 1. Ask user how many clusters they’d like. (e.g. k=5) 2. Randomly guess k cluster Center locations Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 7 K-means 1. Ask user how many clusters they’d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints) Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 8 4
  • 5. K-means 1. Ask user how many clusters they’d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it’s closest to. 4. Each Center finds the centroid of the points it owns Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 9 K-means 1. Ask user how many clusters they’d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it’s closest to. 4. Each Center finds the centroid of the points it owns… 5. …and jumps there 6. …Repeat until terminated! Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 10 5
  • 6. K-means Start Advance apologies: in Black and White this example will deteriorate Example generated by Dan Pelleg’s super-duper fast K-means system: Dan Pelleg and Andrew Moore. Accelerating Exact k-means Algorithms with Geometric Reasoning. Proc. Conference on Knowledge Discovery in Databases 1999, (KDD99) (available on www.autonlab.org/pap.html) Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 11 K-means continues … Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 12 6
  • 7. K-means continues … Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 13 K-means continues … Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 14 7
  • 8. K-means continues … Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 15 K-means continues … Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 16 8
  • 9. K-means continues … Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 17 K-means continues … Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 18 9
  • 10. K-means continues … Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 19 K-means terminates Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 20 10
  • 11. K-means Questions • What is it trying to optimize? • Are we sure it will terminate? • Are we sure it will find an optimal clustering? • How should we start it? • How could we automatically choose the number of centers? ….we’ll deal with these questions over the next few slides Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 21 Distortion Given.. •an encoder function: ENCODE : ℜm → [1..k] •a decoder function: DECODE : [1..k] → ℜm Define… R Distortion = ∑ (x i − DECODE[ENCODE(x i )]) 2 i =1 Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 22 11
  • 12. Distortion Given.. •an encoder function: ENCODE : ℜm → [1..k] •a decoder function: DECODE : [1..k] → ℜm Define… R Distortion = ∑ (x i − DECODE[ENCODE(x i )]) 2 i =1 We may as well write DECODE[ j ] = c j R so Distortion = ∑ (x i − c ENCODE ( xi ) ) 2 i =1 Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 23 The Minimal Distortion R Distortion = ∑ (x i − c ENCODE ( xi ) ) 2 i =1 What properties must centers c1 , c2 , … , ck have when distortion is minimized? Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 24 12
  • 13. The Minimal Distortion (1) R Distortion = ∑ (x i − c ENCODE ( xi ) ) 2 i =1 What properties must centers c1 , c2 , … , ck have when distortion is minimized? (1) xi must be encoded by its nearest center ….why? c ENCODE ( xi ) = arg min (x i − c j ) 2 c j ∈{c1 ,c 2 ,...c k } ..at the minimal distortion Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 25 The Minimal Distortion (1) R Distortion = ∑ (x i − c ENCODE ( xi ) ) 2 i =1 What properties must centers c1 , c2 , … , ck have when distortion is minimized? (1) xi must be encoded by its nearest center Otherwise distortion could be ….why? reduced by replacing ENCODE[xi] by the nearest center c ENCODE ( xi ) = arg min (x i − c j ) 2 c j ∈{c1 ,c 2 ,...c k } ..at the minimal distortion Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 26 13
  • 14. The Minimal Distortion (2) R Distortion = ∑ (x i − c ENCODE ( xi ) ) 2 i =1 What properties must centers c1 , c2 , … , ck have when distortion is minimized? (2) The partial derivative of Distortion with respect to each center location must be zero. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 27 (2) The partial derivative of Distortion with respect to each center location must be zero. R ∑ (x = − c ENCODE ( xi ) ) 2 Distortion i i =1 k ∑ ∑ (x = − c j )2 OwnedBy(cj ) = the set i of records owned by j =1 i∈OwnedBy(c j ) Center cj . ∂Distortion ∂ ∑ (x = − c j )2 i ∂c j ∂c j i∈OwnedBy(c j ) ∑ (x = −2 −cj) i i∈OwnedBy(c j ) = 0 (for a minimum) Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 28 14
  • 15. (2) The partial derivative of Distortion with respect to each center location must be zero. R ∑ (x = − c ENCODE ( xi ) ) 2 Distortion i i =1 k ∑ ∑ (x = − c j )2 i j =1 i∈OwnedBy(c j ) ∂Distortion ∂ ∑ (x = − c j )2 i ∂c j ∂c j i∈OwnedBy(c j ) ∑ (x = −2 −cj) i i∈OwnedBy(c j ) = 0 (for a minimum) 1 ∑ xi Thus, at a minimum: c j = | OwnedBy(c j ) | i∈OwnedBy(c j ) Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 29 At the minimum distortion R Distortion = ∑ (x i − c ENCODE ( xi ) ) 2 i =1 What properties must centers c1 , c2 , … , ck have when distortion is minimized? (1) xi must be encoded by its nearest center (2) Each Center must be at the centroid of points it owns. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 30 15
  • 16. Improving a suboptimal configuration… R Distortion = ∑ (x i − c ENCODE ( xi ) ) 2 i =1 What properties can be changed for centers c1 , c2 , … , ck have when distortion is not minimized? (1) Change encoding so that xi is encoded by its nearest center (2) Set each Center to the centroid of points it owns. There’s no point applying either operation twice in succession. But it can be profitable to alternate. …And that’s K-means! Easy to prove this procedure will terminate in a state at which neither (1) or (2) change the configuration. Why? Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 31 Improving a suboptimal sconfiguration… R of partitioning of way finite number re are only a R The ∑ . records Distortion = nux iberc ENCODE ( x i ) ) ( − of possible 2 into k groups m only a finite=1 roids of rs are the cent So there are i hich all Cente w What properties canin n. configurations be changed for centers c1 , c2 , … , ck t have ts they ow the distortion is not minimized? ration, it mus have when poin ges on an ite ation chan If the configur distortion. x is encoded by itsgo to a improved the (1) Change encoding so that i ion changes it must nearest center nfigurat ch time the co So eaCenter to the centroid before. (2) Set eachiguration it’s never been to of points tually run out it owns. would even conf on forever, it go There’s no pointied to . either operation twice in succession. So if it tr applying figurations of con But it can be profitable to alternate. …And that’s K-means! Easy to prove this procedure will terminate in a state at which neither (1) or (2) change the configuration. Why? Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 32 16
  • 17. Will we find the optimal configuration? • Not necessarily. • Can you invent a configuration that has converged, but does not have the minimum distortion? Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 33 Will we find the optimal configuration? • Not necessarily. • Can you invent a configuration that has converged, but does not have the minimum distortion? (Hint: try a fiendish k=3 configuration here…) Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 34 17
  • 18. Will we find the optimal configuration? • Not necessarily. • Can you invent a configuration that has converged, but does not have the minimum distortion? (Hint: try a fiendish k=3 configuration here…) Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 35 Trying to find good optima • Idea 1: Be careful about where you start • Idea 2: Do many runs of k-means, each from a different random start configuration • Many other ideas floating around. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 36 18
  • 19. Trying to find good optima • Idea 1: Be careful about where you start • Idea 2: Do many runs of k-means, each from a different random start configuration Neat trick: • Many other center on top of randomly chosen datapoint. Place first ideas floating around. Place second center on datapoint that’s as far away as possible from first center : Place j’th center on datapoint that’s as far away as possible from the closest of Centers 1 through j-1 : Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 37 Choosing the number of Centers • A difficult problem • Most common approach is to try to find the solution that minimizes the Schwarz Criterion (also related to the BIC) Distortion + λ (# parameters) log R = Distortion + λmk log R m=#dimensions R=#Records k=#Centers Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 38 19
  • 20. Common uses of K-means • Often used as an exploratory data analysis tool • In one-dimension, a good way to quantize real- valued variables into k non-uniform buckets • Used on acoustic data in speech understanding to convert waveforms into one of k categories (known as Vector Quantization) • Also used for choosing color palettes on old fashioned graphical display devices! Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 39 Single Linkage Hierarchical Clustering 1. Say “Every point is its own cluster” Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 40 20
  • 21. Single Linkage Hierarchical Clustering 1. Say “Every point is its own cluster” 2. Find “most similar” pair of clusters Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 41 Single Linkage Hierarchical Clustering 1. Say “Every point is its own cluster” 2. Find “most similar” pair of clusters 3. Merge it into a parent cluster Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 42 21
  • 22. Single Linkage Hierarchical Clustering 1. Say “Every point is its own cluster” 2. Find “most similar” pair of clusters 3. Merge it into a parent cluster 4. Repeat Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 43 Single Linkage Hierarchical Clustering 1. Say “Every point is its own cluster” 2. Find “most similar” pair of clusters 3. Merge it into a parent cluster 4. Repeat Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 44 22
  • 23. Single Linkage Hierarchical Clustering How do we define similarity between clusters? • Minimum distance between 1. Say “Every point is its points in clusters (in which own cluster” case we’re simply doing Euclidian Minimum Spanning 2. Find “most similar” pair Trees) of clusters • Maximum distance between 3. Merge it into a parent points in clusters cluster • Average distance between 4. Repeat…until you’ve points in clusters merged the whole dataset into one cluster You’re left with a nice dendrogram, or taxonomy, or hierarchy of datapoints (not shown here) Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 45 Also known in the trade as Hierarchical Agglomerative Clustering (note the acronym) Single Linkage Comments • It’s nice that you get a hierarchy instead of an amorphous collection of groups • If you want k groups, just cut the (k-1) longest links • There’s no real statistical or information- theoretic foundation to this. Makes your lecturer feel a bit queasy. Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 46 23
  • 24. What you should know • All the details of K-means • The theory behind K-means as an optimization algorithm • How K-means can get stuck • The outline of Hierarchical clustering • Be able to contrast between which problems would be relatively well/poorly suited to K- means vs Gaussian Mixtures vs Hierarchical clustering Copyright © 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 47 24