SlideShare ist ein Scribd-Unternehmen logo
1 von 61
Downloaden Sie, um offline zu lesen
CSE 634
Data Mining Concepts &
      Techniques
     Professor Anita Wasilewska
         Stony Brook University




    Cluster Analysis


   Harpreet Singh – 100891995
  Densel Santhmayor – 105229333
  Sudipto Mukherjee – 105303644
References

   Jiawei Han and Michelle Kamber. Data Mining Concept and
    Techniques (Chapter 8, Sections 1- 4). Morgan Kaufman, 2002
   Prof. Stanley L. Sclove, Statistics for Information Systems and
    Data Mining, Univerity of Illinois at Chicago
    (http://www.uic.edu/classes/idsc/ids472/clustering.htm)
   G. David Garson, Quantitative Research in Public
    Administration, NC State University
    (http://www2.chass.ncsu.edu/garson/PA765/cluster.htm)
Overview

   What is Clustering/Cluster Analysis?

   Applications of Clustering

   Data Types and Distance Metrics

   Major Clustering Methods
What is Cluster Analysis?

   Cluster: Collection of data objects
        (Intraclass similarity) - Objects are similar to objects in same
         cluster
        (Interclass dissimilarity) - Objects are dissimilar to objects in other
         clusters

   Examples of clusters?
   Cluster Analysis – Statistical method to identify and group sets
    of similar objects into classes
        Good clustering methods produce high quality clusters with high
         intraclass similarity and interclass dissimilarity

   Unlike classification, it is unsupervised learning
What is Cluster Analysis?

   Fields of use
        Data Mining
        Pattern recognition
        Image analysis
        Bioinformatics
        Machine Learning
Overview

   What is Clustering/Cluster Analysis?

   Applications of Clustering

   Data Types and Distance Metrics

   Major Clustering Methods
Applications of Clustering

   Why is clustering useful?
        Can identify dense and sparse patterns, correlation among
         attributes and overall distribution patterns
        Identify outliers and thus useful to detect anomalies

   Examples:
        Marketing Research: Help marketers to identify and classify
         groups of people based on spending patterns and therefore develop
         more focused campaigns
        Biology: Categorize genes with similar functionality, derive plant
         and animal taxonomies
Applications of Clustering

   More Examples:
       Image processing: Help in identifying borders or recognizing
        different objects in an image
       City Planning: Identify groups of houses and separate them into
        different clusters according to similar characteristics – type, size,
        geographical location
Overview

   What is Clustering/Cluster Analysis?
   Applications of Clustering
   Data Types and Distance Metrics
   Major Clustering Methods
Data Types and Distance Metrics

                                Data Structures
   Data Matrix (object-by-variable structure)
        n records, each with p attributes
        n-by-p matrix structure (two mode)
        xab – value for ath record and bth attribute
                                              Attributes
                             record 1  x     ... x      ... x 
                                       11         1f         1p 
                                       ...   ... ...    ... ... 
                                                        ... x 
                             record i  xi1   ... x
                                                    if        ip 
                                       ...   ... ...    ... ... 
                                                                
                                      x      ... x      ... x 
                             record n  n1          nf        np 
Data Types and Distance Metrics

                              Data Structures
   Dissimilarity Matrix (object-by-object structure)
        n-by-n table (one mode)
        d(i,j) is the measured difference or dissimilarity between record i
         and j

                        0                         
                        d(2,1)      0             
                                                  
                        d(3,1) d ( 3,2) 0         
                                                  
                           :        :     :       
                       d ( n,1) d ( n,2) ... ... 0
                                                  
Data Types and Distance Metrics

   Interval-Scaled Attributes
   Binary Attributes
   Nominal Attributes
   Ordinal Attributes
   Ratio-Scaled Attributes
   Attributes of Mixed Type
Data Types and Distance Metrics

                        Interval-Scaled Attributes
   Continuous measurements on a roughly linear scale

                                    Example

                Height Scale                        Weight Scale

       1. Scale ranges over the
                                                  40kg          80kg           120kg
       metre or foot scale                 20kg          60kg          100kg

       2. Need to standardize           1. Scale ranges over the
       heights as different scale          kilogram or pound scale

       can be used to express
       same absolute
       measurement
Data Types and Distance Metrics

                         Interval-Scaled Attributes
   Using Interval-Scaled Values
       Step 1: Standardize the data
            To ensure they all have equal weight
            To match up different scales into a uniform, single scale
            Not always needed! Sometimes we require unequal weights for an
             attribute
       Step 2: Compute dissimilarity between records
            Use Euclidean, Manhattan or Minkowski distance
Data Types and Distance Metrics

                       Interval-Scaled Attributes
   Minkowski distance

     d (i, j) = q (| x − x |q + | x − x |q +...+ | x − x | q )
                      i1  j1       i2  j2           ip  jp
   Euclidean distance
        q=2

   Manhattan distance
        q=1

   What are the shapes of these clusters?
        Spherical in shape.
Data Types and Distance Metrics

                         Interval-Scaled Attributes
   Properties of d(i,j)
        d(i,j) >= 0: Distance is non-negative. Why?
        d(i,i) = 0: Distance of an object to itself is 0. Why?
        d(i,j) = d(j,i): Symmetric. Why?
        d(i,j) <= d(i,h) + d(h,j): Triangle Inequality rule

   Weighted distance calculation also simple to compute
Data Types and Distance Metrics

                             Binary Attributes
   Has only two states – 0 or 1
   Compute dissimilarity between records (equal weightage)
        Contingency Table                Object j

                                          1     0

                                      1   a     b
                         Object i
                                      0   c     d

        Symmetric Values: A binary attribute is symmetric if the outcomes
         are both equally important
        Asymmetric Values: A binary attribute is asymmetric if the
         outcomes of the states are not equally important
Data Types and Distance Metrics

                         Binary Attributes
     Simple matching coefficient (Symmetric)
                                         b+c
                        d (i, j ) =
                                      a +b+c+d
     Jaccard coefficient (Asymmetric)

                                          b+c
                          d (i, j ) =
                                        a +b+c
Data Types and Distance Metrics

   Ex:

    Name       Gender    Fever   Cough     Test-1          Test-2           Test-3          Test-4
    Jack       M         Y       N         P               N                N               N
    Mary       F         Y       N         P               N                P               N
    Jim        M         Y       P         N               N                N               N


         Gender attribute is symmetric
         All others aren’t. If Y and P are 1 and N is 0, then
                                           0+  1
                     d ( jack , mary ) =          =0.33
                                         2 +0 +1
                                         1+  1
                     d ( jack , jim ) =          =0.67
                                       1+ +1 1
                                          1 +2
                     d ( jim , mary ) =          =0.75
                                        1 + +2
                                            1

                                           Cluster Analysis By: Arthy Krishnamurthy & Jing Tun, Spring 2005
Data Types and Distance Metrics

                                Nominal Attributes
   Extension of a binary attribute – can have more than two
    states
   Ex: figure_colour is a attribute which has, say, 4 values:
    yellow, green, red and blue
   Let number of values be M
   Compute dissimilarity between two records i and j
        d(i,j) = (p – m) / p
        m -> number of attributes for which i and j have the same value
        p -> total number of attributes
Nominal Attributes
   Can be encoded by using asymmetric binary attributes for
    each of the M values
   For a record with a given value, the binary attribute value
    representing that value is set to 1, while the remaining binary
    values are set to 0
   Ex:
                                                    Yellow   Green   Red   Blue

                                         Record 1   0        0       1     0
          Object 1      Object 2
                                         Record 2   0        1       0     0

                                         Record 3   1        0       0     0

                 Object 3
Data Types and Distance Metrics

                            Ordinal Attributes
   Discrete Ordinal Attributes
        Nominal attributes with values arranged in a meaningful manner

   Continuous Ordinal Attributes
        Continuous data on unknown scale. Ex: the order of ranking in a
         sport (gold, silver, bronze) is more essential than their values
        Relative ranking

   Used to record subjective assessment of certain characteristics
    which cannot be measured objectively
Data Types and Distance Metrics

                                Ordinal Attributes
   Compute dissimilarity between records
       Step 1: Replace each value by its corresponding rank
            Ex: Gold, Silver, Bronze with 1, 2, 3
       Step 2: Map the range of each variable onto [0.0,1.0]
            If the rank of the ith object in the fth ordinal variable is rif, then replace the

             rank with zif = (rif – 1) / (Mf – 1) where Mf is the total number of states of
             the ordinal variable f
       Step 3: Use distance methods for interval-scaled attributes to
        compute the dissimilarity between objects
Data Types and Distance Metrics

                         Ratio-Scaled Attributes
   Makes a positive measurement on a non-linear scale
   Compute dissimilarity between records
       Treat them like interval-scaled attributes. Not a good choice since
        scale might be distorted
       Apply logarithmic transformation and then use interval-scaled
        methods.
       Treat the values as continuous ordinal data and their ranks as
        interval-based
Data Types and Distance Metrics

                        Attributes of mixed types
   Real databases usually contain a number of different types of
    attributes
   Compute dissimilarity between records
        Method 1: Group each type of attribute together and then perform
         separate cluster analysis on each type. Doesn’t generate
         compatible results
        Method 2: Process all types of attributes by using a weighted
         formula to combine all their effects.
Overview

   What is Clustering/Cluster Analysis?
   Applications of Clustering
   Data Types and Distance Metrics
   Major Clustering Methods
Clustering Methods

   Partitioning methods
   Hierarchical methods
   Density-based methods
   Grid-based methods
   Model-based methods


   Choice of algorithm depends on type of data available and the
    nature and purpose of the application
Clustering Methods

   Partitioning methods
       Divide the objects into a set of partitions based on some criteria
       Improve the partitions by shifting objects between them for higher
        intraclass similarity, interclass dissimilarity and other such
        criteria
       Two popular heuristic methods
            k-means algorithm
            k-medoids algorithm
Clustering Methods

   Hierarchical methods
       Build up or break down groups of objects in a recursive manner
       Two main approaches
            Agglomerative approach
            Divisive approach




                                  © Wikipedia
Clustering Methods

   Density-based methods
       Grow a given cluster until the density decreases below a certain
        threshold

   Grid-based methods
       Form a grid structure by quantizing the object space into a finite
        number of grid cells

   Model-based methods
       Hypothesize a model and find the best fit of the data to the chosen
        model
Constrained K-means Clustering with
      Background Knowledge
       K. Wagsta, C. Cardie, S. Rogers, & S. Schroedl




                 Proceedings of 18th
   International Conference on Machine Learning
                 2001. (pp. 577-584).
      Morgan Kaufmann, San Francisco, CA.
Introduction

   Clustering is an unsupervised method of data analysis
   Data instances grouped according to some notion of similarity
        Multi-attribute based distance function
        Access to only the set of features describing each object
        No information as to where each instance should be placed with
         partition

   However there might be background knowledge about the
    domain or data set that could be useful to algorithm
   In this paper the authors try to integrate this background
    knowledge into clustering algorithms.
K-Means Clustering Algorithm

   K-Means algorithm is a type of partitioning method
   Group instances based on attributes into k groups
        High intra-cluster similarity; Low inter-cluster similarity
        Cluster similarity is measured in regards to the mean value of
         objects in the cluster.
   How does K-means work ?
        First, select K random instances from the data – initial cluster centers
        Second, each instance is assigned to its closest (most similar) cluster center
        Third, each cluster center is updated to the mean of its constituent
         instances
        Repeat steps two and three till there is no further change in assignment of
         instances to clusters

   How is K selected ?
K-Means Clustering Algorithm
Constrained K-Means Clustering

   Instance level constraints to express a priori knowledge about
    the instances which should or should not be grouped together
   Two pair-wise constraints
        Must-link: constraints which specify that two instances have to be
         in the same cluster
        Cannot-link: constraints which specify that two instances must
         not be placed in the same cluster
        When using a set of constraints we have to take the transitive
         closure

   Constraints may be derived from
        Partially labeled data
        Background knowledge about the domain or data set
Constrained Algorithm

   First, select K random instances from the data – initial cluster centers
   Second, each instance is assigned to its closest (most similar) cluster
    center such that VIOLATE-CONSTRAINT(I, K, M, C) is false. If no
    such cluster exists , fail
   Third, each cluster center is updated to the mean of its constituent
    instances
   Repeat steps two and three till there is no further change in
    assignment of instances to clusters
   VIOLATE-CONSTRAINT(instance I, cluster K, must-link constraints M,
    cannot-link constraints C)
        For each (i, i=) in M: if i= is not in K, return true.
        For each (i, i≠) in C : if i≠ is in K, return true
        Otherwise return false
Experimental Results on
GPS Lane Finding
   Large database of digital road maps available
        These maps contain only coarse information about the location of
         the road
        By refining maps down to the lane level we can enable a host of
         more sophisticated applications such as lane departure detection

   Collect data about the location of cars as they drive along a
    given road
        Collect data once per second from several drivers using GPS
         receivers affixed to top of their vehicles
        Each data instance has two features: 1. Distance along the road
         segment and 2. Perpendicular offset from the road centerline
        For evaluation purposes drivers were asked to indicate which lane
         they occupied and any lane changes
GPS Lane Finding

   Cluster data to automatically determine where the individual
    lanes are located
        Based on the observation that drivers tend to drive within lane
         boundaries.
        Domain specific heuristics for generating constraints.
             Trace contiguity means that, in the absence of lane changes, all of the
              points generated from the same vehicle in a single pass over a road
              segment should end up in the same lane.
             Maximum separation refers to a limit on how far apart two points can
              be (perpendicular to the centerline) while still being in the same lane. If
              two points are separated by at least four meters, then we generate a
              constraint that will prevent those two points from being placed in the
              same cluster.


   To better suit domain cluster center representation had to be
    changed.
Performance
     Segment (size)   K-means   COP-Kmeans   Constraints Alone

     1 (699)          49.8      100          36.8

     2 (116)          47.2      100          31.5

     3 (521)          56.5      100          44.2

     4 (526)          49.4      100          47.1

     5 (426)          50.2      100          29.6

     6 (502)          75.0      100          56.3

     7 (623)          73.5      100          57.8

     8 (149)          74.7      100          53.6

     9 (496)          58.6      100          46.8

     10 (634)         50.2      100          63.4

     11 (1160)        56.5      100          72.3

     12 (427)         48.8      96.6         59.2

     13 (587)         69.0      100          51.5

     14 (678)         65.9      100          59.9

     15 (400)         58.8      100          39.7

     16 (115)         64.0      76.6         52.4

     17 (383)         60.8      98.9         51.4

     18 (786)         50.2      100          73.7

     19 (880)         50.4      100          42.1

     20 (570)         50.1      100          38.3

     Average          58.0      98.6         50.4
Conclusion

   Measurable improvement in accuracy
   The use of constraints while clustering means that, unlike the
    regular k-means algorithm, the assignment of instances to
    clusters can be order-sensitive.
        If a poor decision is made early on, the algorithm may later
         encounter an instance i that has no possible valid cluster
        Ideally, the algorithm would be able to backtrack, rearranging
         some of the instances so that i could then be validly assigned to a
         cluster.

   Could be extended to hierarchical algorithms
CSE 634
Data Mining Concepts &
      Techniques
     Professor Anita Wasilewska
         Stony Brook University




Ligand Pose Clustering
Abstract

   Detailed atomic-level structural and energetic information from
    computer calculations is important for understanding how
    compounds interact with a given target and for the discovery
    and design of new drugs. Computational high-throughput
    screening (docking) provides an efficient and practical means
    with which to screen candidate compounds prior to
    experiment. Current scoring functions for docking use
    traditional Molecular Mechanics (MM) terms (Van der Waals
    and Electrostatics).

   To develop and test new scoring functions that include ligand
    desolvation (MM-GBSA), we are building a docking test set
    focused on medicinal chemistry targets. Docked complexes are
    rescored on the receptor coordinates, clustered into diverse
    binding poses and the top five representative poses are
    reported for analysis. Known receptor-ligand complexes are
    retrieved from the protein databank and are used to identify
    novel receptor-ligand complexes of potential drug leads.
References
   Kuntz, I. D. (1992). "Structure-based strategies for drug design and
    discovery." Science 257(5073): 1078-1082.

   Nissink, J. W. M., C. Murray, et al. (2002). "A new test set for
    validating predictions of protein-ligand interaction." Proteins-Structure
    Function and Genetics 49(4): 457-471.

   Mozziconacci, J. C., E. Arnoult, et al. (2005). "Optimization and
    validation of a docking-scoring protocol; Application to virtual
    screening for COX-2 inhibitors." Journal of Medicinal Chemistry 48(4):
    1055-1068.

   Mohan, V., A. C. Gibbs, et al. (2005). "Docking: Successes and
    challenges." Current Pharmaceutical Design 11(3): 323-333.

   Hu, L. G., M. L. Benson, et al. (2005). "Binding MOAD (Mother of All
    Databases)." Proteins-Structure Function and Bioinformatics 60(3):
    333-340.
Docking

   Computational search for the most energetically favorable
    binding pose of a ligand with a receptor.
        Ligand    → small organic molecules
        Receptor → proteins, nucleic acids




    Receptor: Trypsin Ligand: Benzamidine        Complex
Receptor - Ligand
                  Complex Crystal Structure



     Ligand                              Receptor


                               dms                  Inspection   mbondi
Add                                                 Leap         radii
hydrogens               Molecular                   Sander       Disulfide
                         Surface                    Convert      bonds
   Processed                  sphgen
    Ligand                                             mol2 receptor
                     Docking Spheres
Gaussian                                               6-12 LJ GRID
ab initio         keep max 75 within
charges              spheres 8A                         Receptor grid
  mol2 ligand      Active site spheres

                                DOCK
                         Docked
                Receptor – Ligand Complex
Improved Scoring Function (MM-
GBSA)




                    R = receptor, L = ligand, RL = receptor-ligand complex

           - MM (molecular mechanics: VDW + Coul)
           - GB (Generalized Born)
           - SA (Solvent Accessible Surface Area)
 *Srinivasan, J. ; et al. J. Am. Chem. Soc. 1998, 120, 9401-9409
Clustering Methods used

                              Initially, we clustered on a single dimension, i.e. RMSD. All ligand
                               poses within 2A RMSD of each other were retained.
                              Better results were obtained using agglomerative clustering using the
                               R statistical package.
                                                                                                                              1BCD (Carbonic Anh II/FMS)
                                          1BCD (Carbonic Anh II/FMS)
                                                                                                              50
                         50

                                                                                                              40




                                                                                     GBSA Energy (kcal/mol)
                         40
                                                                                                              30
GBSA Energy (kcal/mol)




                         30
                                                                                                              20

                         20
                                                                                                              10

                         10
                                                                                                               0
                                                                                                                    0   0.5      1          1.5            2   2.5   3
                          0
                                                                                                              -10
                               0    0.5      1          1.5            2   2.5   3
                                                                                                                                         RMSD (A)
                         -10
                                                     RMSD (A)




                                                                                                                              Agglomerative
                                    RMSD clustering
                                                                                                                               clustering
Agglomerative Clustering

   Agglomerative Clustering, each object is initially placed into its
    own group. A threshold distance is selected.


   Compare all pairs of groups and mark the pair that is closest.


   The distance between this closest pair of groups is compared
    to the threshold value.
        If (distance between this closest pair <= threshold distance) then
         merge groups. Repeat.
        Else If (distance between the closest pair > threshold)
         then (clustering is done)
R Project for Statistical
Computing
   R is a free software environment for statistical computing and
    graphics.
   Available at http://www.r-project.org/
   Developed by Statistics Department, University of Auckland
   R 2.2.1 is used in my research
     plotacpclust =
     function(data,xax=1,yax=2,hcut,cor=TRUE,clustermethod="ave",colbacktitle="#e8c9c1",wcos=3,Rpower
     ed=FALSE,...)
     { # data: data.frame to analyze
       # xax, yax: Factors to select for graphs
       # Parameters for hclust # hcut # clustermethod require(ade4)
       pcr=princomp(data,cor=cor) datac=t((t(data)-pcr$center )/pcr$scale)
     hc=hclust(dist(data),method=clustermethod) if (missing(hcut)) hcut=quantile(hc$height,c(0.97))
     def.par <- par(no.readonly = TRUE) on.exit(par(def.par))
     mylayout=layout(matrix(c(1,2,3,4,5,1,2,3,4,6,7,7,7,8,9,7,7,7,10,11),ncol=4),widths=c(4/18,2/18,6
     /18,6/18),heights=c(lcm(1),3/6,1/6,lcm(1),1/3)) par(mar = c(0.1, 0.1, 0.1, 0.1)) par(oma =
     rep(1,4)) ltitle(paste("PCA ",dim(unclass(pcr$loadings))[2], "vars"),cex=1.6,ypos=0.7)
     text(x=0,y=0.2,pos=4,cex=1,labels=deparse(pcr$call),col="black") pcl=unclass(pcr$loadings)
     pclperc=100*(pcr$sdev)/sum(pcr$sdev)
     s.corcircle(pcl[,c(xax,yax)],1,2,sub=paste("(",xax,"-",yax,")
     ",round(sum(pclperc[c(xax,yax)]),0),"%",sep=""),possub="bottomright",csub=3,clabel=2)
     wsel=c(xax,yax) scatterutil.eigen(pcr$sdev,wsel=wsel,sub="")
Clustered Poses




     Peptide ligand bound to GP-41 receptor
RMSD vs. Energy Score Plots

                                   1YDA (Sulfonamide bound to Human Carbonic Anhydrase II)

                         40


                         30
GBSA Energy (kcal/mol)




                         20


                         10


                          0
                               0          1          2           3          4           5    6
                         -10


                         -20


                         -30
                                                             RMSD (A)
RMSD vs. Energy Score Plots

                                          1YDA

                         0
                              0   1   2          3    4   5   6
                         -5

                        -10
DDD energy (kcal/mol)




                        -15

                        -20

                        -25

                        -30

                        -35

                        -40

                        -45
                                           RMSD (A)
RMSD vs. Energy Score Plots

                                         1BCD (Carbonic Anh II/FMS)

                         50


                         40
GBSA Energy (kcal/mol)




                         30


                         20


                         10


                          0
                               0   0.5      1          1.5            2   2.5   3

                         -10
                                                    RMSD (A)
RMSD vs. Energy Score Plots

                                        1BCD (Carbonic Anh II/FMS)

                         0
                              0   0.5      1          1.5            2   2.5   3

                         -5
DDD Energy (kcal/mol)




                        -10



                        -15



                        -20



                        -25
                                                   RMSD (A)
RMSD vs. Energy Score Plots

                                               1EHL

                         120


                         100
GBSA Energy (kcal/mol)




                         80


                         60


                         40


                         20


                          0
                               0   1   2   3          4    5   6   7   8
                                                RMSD (A)
RMSD vs. Energy Score Plots

                                    1DWB

                  120


                  100


                  80
GBSA (kcal/mol)




                  60


                  40


                  20


                   0
                        0   1   2   3              4   5   6   7
                                        RMSD (A)
RMSD vs. Energy Score Plots

                                               1ABE

                         40


                         30
GBSA Energy (kcal/mol)




                         20


                         10


                          0
                               0   1   2   3          4    5   6   7   8
                         -10


                         -20


                         -30
                                                RMSD (A)
1ABE Clustered Poses
RMSD vs. Energy Score Plots

                                              1EHL

                        120


                        100
GBSA Score (kcal/mol)




                         80


                         60


                         40


                         20


                         0
                              0   1   2   3          4    5   6   7   8
                                               RMSD (A)
Peramivir clustered poses
Peptide mimetic inhibitor HIV-1
Protease

Weitere ähnliche Inhalte

Andere mochten auch

A survey on ant colony clustering papers
A survey on ant colony clustering papersA survey on ant colony clustering papers
A survey on ant colony clustering papersZahra Sadeghi
 
Clustering training
Clustering trainingClustering training
Clustering trainingGabor Veress
 
3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methodsKrish_ver2
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithmhadifar
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based ClusteringSSA KPI
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysissaba khan
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsPrashanth Guntal
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysisDataminingTools Inc
 
Optics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureOptics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureRajesh Piryani
 

Andere mochten auch (14)

A survey on ant colony clustering papers
A survey on ant colony clustering papersA survey on ant colony clustering papers
A survey on ant colony clustering papers
 
Clustering training
Clustering trainingClustering training
Clustering training
 
Clusteryanam
ClusteryanamClusteryanam
Clusteryanam
 
3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methods
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
Clustering
ClusteringClustering
Clustering
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
 
Optics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureOptics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structure
 

Ähnlich wie Clustering

4_22865_IS465_2019_1__2_1_02Data-2.ppt
4_22865_IS465_2019_1__2_1_02Data-2.ppt4_22865_IS465_2019_1__2_1_02Data-2.ppt
4_22865_IS465_2019_1__2_1_02Data-2.pptPaoloOchengco
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster AnalysisSSA KPI
 
3.1 clustering
3.1 clustering3.1 clustering
3.1 clusteringKrish_ver2
 
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kambererror007
 
Unit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptxUnit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptxavinashBajpayee1
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysisAcad
 
similarities-knn-1.ppt
similarities-knn-1.pptsimilarities-knn-1.ppt
similarities-knn-1.pptsatvikpatil5
 
Principal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionPrincipal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionJordan McBain
 
Visual Odomtery(2)
Visual Odomtery(2)Visual Odomtery(2)
Visual Odomtery(2)Ian Sa
 
Machine Learning ebook.pdf
Machine Learning ebook.pdfMachine Learning ebook.pdf
Machine Learning ebook.pdfHODIT12
 

Ähnlich wie Clustering (20)

Dbm630 lecture09
Dbm630 lecture09Dbm630 lecture09
Dbm630 lecture09
 
4_22865_IS465_2019_1__2_1_02Data-2.ppt
4_22865_IS465_2019_1__2_1_02Data-2.ppt4_22865_IS465_2019_1__2_1_02Data-2.ppt
4_22865_IS465_2019_1__2_1_02Data-2.ppt
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
Cluster
ClusterCluster
Cluster
 
3.1 clustering
3.1 clustering3.1 clustering
3.1 clustering
 
Cluster analysis (2)
Cluster analysis (2)Cluster analysis (2)
Cluster analysis (2)
 
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
09Evaluation_Clustering.pdf
09Evaluation_Clustering.pdf09Evaluation_Clustering.pdf
09Evaluation_Clustering.pdf
 
Unit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptxUnit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptx
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
TunUp final presentation
TunUp final presentationTunUp final presentation
TunUp final presentation
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
PR07.pdf
PR07.pdfPR07.pdf
PR07.pdf
 
similarities-knn-1.ppt
similarities-knn-1.pptsimilarities-knn-1.ppt
similarities-knn-1.ppt
 
Principal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionPrincipal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty Detection
 
Lect4
Lect4Lect4
Lect4
 
Visual Odomtery(2)
Visual Odomtery(2)Visual Odomtery(2)
Visual Odomtery(2)
 
[PPT]
[PPT][PPT]
[PPT]
 
Machine Learning ebook.pdf
Machine Learning ebook.pdfMachine Learning ebook.pdf
Machine Learning ebook.pdf
 
Clustering: A Survey
Clustering: A SurveyClustering: A Survey
Clustering: A Survey
 

Kürzlich hochgeladen

BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxSayali Powar
 
Satirical Depths - A Study of Gabriel Okara's Poem - 'You Laughed and Laughed...
Satirical Depths - A Study of Gabriel Okara's Poem - 'You Laughed and Laughed...Satirical Depths - A Study of Gabriel Okara's Poem - 'You Laughed and Laughed...
Satirical Depths - A Study of Gabriel Okara's Poem - 'You Laughed and Laughed...HetalPathak10
 
How to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command LineHow to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command LineCeline George
 
Objectives n learning outcoms - MD 20240404.pptx
Objectives n learning outcoms - MD 20240404.pptxObjectives n learning outcoms - MD 20240404.pptx
Objectives n learning outcoms - MD 20240404.pptxMadhavi Dharankar
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseCeline George
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxAnupam32727
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...DhatriParmar
 
Indexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfIndexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfChristalin Nelson
 
6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroom6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroomSamsung Business USA
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...Nguyen Thanh Tu Collection
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 

Kürzlich hochgeladen (20)

BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
 
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 
Satirical Depths - A Study of Gabriel Okara's Poem - 'You Laughed and Laughed...
Satirical Depths - A Study of Gabriel Okara's Poem - 'You Laughed and Laughed...Satirical Depths - A Study of Gabriel Okara's Poem - 'You Laughed and Laughed...
Satirical Depths - A Study of Gabriel Okara's Poem - 'You Laughed and Laughed...
 
How to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command LineHow to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command Line
 
Objectives n learning outcoms - MD 20240404.pptx
Objectives n learning outcoms - MD 20240404.pptxObjectives n learning outcoms - MD 20240404.pptx
Objectives n learning outcoms - MD 20240404.pptx
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 Database
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
 
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
Chi-Square Test Non Parametric Test Categorical Variable
Chi-Square Test Non Parametric Test Categorical VariableChi-Square Test Non Parametric Test Categorical Variable
Chi-Square Test Non Parametric Test Categorical Variable
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
 
Indexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfIndexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdf
 
Introduction to Research ,Need for research, Need for design of Experiments, ...
Introduction to Research ,Need for research, Need for design of Experiments, ...Introduction to Research ,Need for research, Need for design of Experiments, ...
Introduction to Research ,Need for research, Need for design of Experiments, ...
 
6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroom6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroom
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 

Clustering

  • 1. CSE 634 Data Mining Concepts & Techniques Professor Anita Wasilewska Stony Brook University Cluster Analysis Harpreet Singh – 100891995 Densel Santhmayor – 105229333 Sudipto Mukherjee – 105303644
  • 2. References  Jiawei Han and Michelle Kamber. Data Mining Concept and Techniques (Chapter 8, Sections 1- 4). Morgan Kaufman, 2002  Prof. Stanley L. Sclove, Statistics for Information Systems and Data Mining, Univerity of Illinois at Chicago (http://www.uic.edu/classes/idsc/ids472/clustering.htm)  G. David Garson, Quantitative Research in Public Administration, NC State University (http://www2.chass.ncsu.edu/garson/PA765/cluster.htm)
  • 3. Overview  What is Clustering/Cluster Analysis?  Applications of Clustering  Data Types and Distance Metrics  Major Clustering Methods
  • 4. What is Cluster Analysis?  Cluster: Collection of data objects  (Intraclass similarity) - Objects are similar to objects in same cluster  (Interclass dissimilarity) - Objects are dissimilar to objects in other clusters  Examples of clusters?  Cluster Analysis – Statistical method to identify and group sets of similar objects into classes  Good clustering methods produce high quality clusters with high intraclass similarity and interclass dissimilarity  Unlike classification, it is unsupervised learning
  • 5. What is Cluster Analysis?  Fields of use  Data Mining  Pattern recognition  Image analysis  Bioinformatics  Machine Learning
  • 6. Overview  What is Clustering/Cluster Analysis?  Applications of Clustering  Data Types and Distance Metrics  Major Clustering Methods
  • 7. Applications of Clustering  Why is clustering useful?  Can identify dense and sparse patterns, correlation among attributes and overall distribution patterns  Identify outliers and thus useful to detect anomalies  Examples:  Marketing Research: Help marketers to identify and classify groups of people based on spending patterns and therefore develop more focused campaigns  Biology: Categorize genes with similar functionality, derive plant and animal taxonomies
  • 8. Applications of Clustering  More Examples:  Image processing: Help in identifying borders or recognizing different objects in an image  City Planning: Identify groups of houses and separate them into different clusters according to similar characteristics – type, size, geographical location
  • 9. Overview  What is Clustering/Cluster Analysis?  Applications of Clustering  Data Types and Distance Metrics  Major Clustering Methods
  • 10. Data Types and Distance Metrics Data Structures  Data Matrix (object-by-variable structure)  n records, each with p attributes  n-by-p matrix structure (two mode)  xab – value for ath record and bth attribute Attributes record 1  x ... x ... x   11 1f 1p   ... ... ... ... ...   ... x  record i  xi1 ... x if ip   ... ... ... ... ...    x ... x ... x  record n  n1 nf np 
  • 11. Data Types and Distance Metrics Data Structures  Dissimilarity Matrix (object-by-object structure)  n-by-n table (one mode)  d(i,j) is the measured difference or dissimilarity between record i and j  0   d(2,1) 0     d(3,1) d ( 3,2) 0     : : :  d ( n,1) d ( n,2) ... ... 0  
  • 12. Data Types and Distance Metrics  Interval-Scaled Attributes  Binary Attributes  Nominal Attributes  Ordinal Attributes  Ratio-Scaled Attributes  Attributes of Mixed Type
  • 13. Data Types and Distance Metrics Interval-Scaled Attributes  Continuous measurements on a roughly linear scale Example Height Scale Weight Scale 1. Scale ranges over the 40kg 80kg 120kg metre or foot scale 20kg 60kg 100kg 2. Need to standardize 1. Scale ranges over the heights as different scale kilogram or pound scale can be used to express same absolute measurement
  • 14. Data Types and Distance Metrics Interval-Scaled Attributes  Using Interval-Scaled Values  Step 1: Standardize the data  To ensure they all have equal weight  To match up different scales into a uniform, single scale  Not always needed! Sometimes we require unequal weights for an attribute  Step 2: Compute dissimilarity between records  Use Euclidean, Manhattan or Minkowski distance
  • 15. Data Types and Distance Metrics Interval-Scaled Attributes  Minkowski distance d (i, j) = q (| x − x |q + | x − x |q +...+ | x − x | q ) i1 j1 i2 j2 ip jp  Euclidean distance  q=2  Manhattan distance  q=1  What are the shapes of these clusters?  Spherical in shape.
  • 16. Data Types and Distance Metrics Interval-Scaled Attributes  Properties of d(i,j)  d(i,j) >= 0: Distance is non-negative. Why?  d(i,i) = 0: Distance of an object to itself is 0. Why?  d(i,j) = d(j,i): Symmetric. Why?  d(i,j) <= d(i,h) + d(h,j): Triangle Inequality rule  Weighted distance calculation also simple to compute
  • 17. Data Types and Distance Metrics Binary Attributes  Has only two states – 0 or 1  Compute dissimilarity between records (equal weightage)  Contingency Table Object j 1 0 1 a b Object i 0 c d  Symmetric Values: A binary attribute is symmetric if the outcomes are both equally important  Asymmetric Values: A binary attribute is asymmetric if the outcomes of the states are not equally important
  • 18. Data Types and Distance Metrics Binary Attributes  Simple matching coefficient (Symmetric) b+c d (i, j ) = a +b+c+d  Jaccard coefficient (Asymmetric) b+c d (i, j ) = a +b+c
  • 19. Data Types and Distance Metrics  Ex: Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N  Gender attribute is symmetric  All others aren’t. If Y and P are 1 and N is 0, then 0+ 1 d ( jack , mary ) = =0.33 2 +0 +1 1+ 1 d ( jack , jim ) = =0.67 1+ +1 1 1 +2 d ( jim , mary ) = =0.75 1 + +2 1 Cluster Analysis By: Arthy Krishnamurthy & Jing Tun, Spring 2005
  • 20. Data Types and Distance Metrics Nominal Attributes  Extension of a binary attribute – can have more than two states  Ex: figure_colour is a attribute which has, say, 4 values: yellow, green, red and blue  Let number of values be M  Compute dissimilarity between two records i and j  d(i,j) = (p – m) / p  m -> number of attributes for which i and j have the same value  p -> total number of attributes
  • 21. Nominal Attributes  Can be encoded by using asymmetric binary attributes for each of the M values  For a record with a given value, the binary attribute value representing that value is set to 1, while the remaining binary values are set to 0  Ex: Yellow Green Red Blue Record 1 0 0 1 0 Object 1 Object 2 Record 2 0 1 0 0 Record 3 1 0 0 0 Object 3
  • 22. Data Types and Distance Metrics Ordinal Attributes  Discrete Ordinal Attributes  Nominal attributes with values arranged in a meaningful manner  Continuous Ordinal Attributes  Continuous data on unknown scale. Ex: the order of ranking in a sport (gold, silver, bronze) is more essential than their values  Relative ranking  Used to record subjective assessment of certain characteristics which cannot be measured objectively
  • 23. Data Types and Distance Metrics Ordinal Attributes  Compute dissimilarity between records  Step 1: Replace each value by its corresponding rank  Ex: Gold, Silver, Bronze with 1, 2, 3  Step 2: Map the range of each variable onto [0.0,1.0]  If the rank of the ith object in the fth ordinal variable is rif, then replace the rank with zif = (rif – 1) / (Mf – 1) where Mf is the total number of states of the ordinal variable f  Step 3: Use distance methods for interval-scaled attributes to compute the dissimilarity between objects
  • 24. Data Types and Distance Metrics Ratio-Scaled Attributes  Makes a positive measurement on a non-linear scale  Compute dissimilarity between records  Treat them like interval-scaled attributes. Not a good choice since scale might be distorted  Apply logarithmic transformation and then use interval-scaled methods.  Treat the values as continuous ordinal data and their ranks as interval-based
  • 25. Data Types and Distance Metrics Attributes of mixed types  Real databases usually contain a number of different types of attributes  Compute dissimilarity between records  Method 1: Group each type of attribute together and then perform separate cluster analysis on each type. Doesn’t generate compatible results  Method 2: Process all types of attributes by using a weighted formula to combine all their effects.
  • 26. Overview  What is Clustering/Cluster Analysis?  Applications of Clustering  Data Types and Distance Metrics  Major Clustering Methods
  • 27. Clustering Methods  Partitioning methods  Hierarchical methods  Density-based methods  Grid-based methods  Model-based methods  Choice of algorithm depends on type of data available and the nature and purpose of the application
  • 28. Clustering Methods  Partitioning methods  Divide the objects into a set of partitions based on some criteria  Improve the partitions by shifting objects between them for higher intraclass similarity, interclass dissimilarity and other such criteria  Two popular heuristic methods  k-means algorithm  k-medoids algorithm
  • 29. Clustering Methods  Hierarchical methods  Build up or break down groups of objects in a recursive manner  Two main approaches  Agglomerative approach  Divisive approach © Wikipedia
  • 30. Clustering Methods  Density-based methods  Grow a given cluster until the density decreases below a certain threshold  Grid-based methods  Form a grid structure by quantizing the object space into a finite number of grid cells  Model-based methods  Hypothesize a model and find the best fit of the data to the chosen model
  • 31. Constrained K-means Clustering with Background Knowledge K. Wagsta, C. Cardie, S. Rogers, & S. Schroedl Proceedings of 18th International Conference on Machine Learning 2001. (pp. 577-584). Morgan Kaufmann, San Francisco, CA.
  • 32. Introduction  Clustering is an unsupervised method of data analysis  Data instances grouped according to some notion of similarity  Multi-attribute based distance function  Access to only the set of features describing each object  No information as to where each instance should be placed with partition  However there might be background knowledge about the domain or data set that could be useful to algorithm  In this paper the authors try to integrate this background knowledge into clustering algorithms.
  • 33. K-Means Clustering Algorithm  K-Means algorithm is a type of partitioning method  Group instances based on attributes into k groups  High intra-cluster similarity; Low inter-cluster similarity  Cluster similarity is measured in regards to the mean value of objects in the cluster.  How does K-means work ?  First, select K random instances from the data – initial cluster centers  Second, each instance is assigned to its closest (most similar) cluster center  Third, each cluster center is updated to the mean of its constituent instances  Repeat steps two and three till there is no further change in assignment of instances to clusters  How is K selected ?
  • 35. Constrained K-Means Clustering  Instance level constraints to express a priori knowledge about the instances which should or should not be grouped together  Two pair-wise constraints  Must-link: constraints which specify that two instances have to be in the same cluster  Cannot-link: constraints which specify that two instances must not be placed in the same cluster  When using a set of constraints we have to take the transitive closure  Constraints may be derived from  Partially labeled data  Background knowledge about the domain or data set
  • 36. Constrained Algorithm  First, select K random instances from the data – initial cluster centers  Second, each instance is assigned to its closest (most similar) cluster center such that VIOLATE-CONSTRAINT(I, K, M, C) is false. If no such cluster exists , fail  Third, each cluster center is updated to the mean of its constituent instances  Repeat steps two and three till there is no further change in assignment of instances to clusters  VIOLATE-CONSTRAINT(instance I, cluster K, must-link constraints M, cannot-link constraints C)  For each (i, i=) in M: if i= is not in K, return true.  For each (i, i≠) in C : if i≠ is in K, return true  Otherwise return false
  • 37. Experimental Results on GPS Lane Finding  Large database of digital road maps available  These maps contain only coarse information about the location of the road  By refining maps down to the lane level we can enable a host of more sophisticated applications such as lane departure detection  Collect data about the location of cars as they drive along a given road  Collect data once per second from several drivers using GPS receivers affixed to top of their vehicles  Each data instance has two features: 1. Distance along the road segment and 2. Perpendicular offset from the road centerline  For evaluation purposes drivers were asked to indicate which lane they occupied and any lane changes
  • 38. GPS Lane Finding  Cluster data to automatically determine where the individual lanes are located  Based on the observation that drivers tend to drive within lane boundaries.  Domain specific heuristics for generating constraints.  Trace contiguity means that, in the absence of lane changes, all of the points generated from the same vehicle in a single pass over a road segment should end up in the same lane.  Maximum separation refers to a limit on how far apart two points can be (perpendicular to the centerline) while still being in the same lane. If two points are separated by at least four meters, then we generate a constraint that will prevent those two points from being placed in the same cluster.  To better suit domain cluster center representation had to be changed.
  • 39. Performance Segment (size) K-means COP-Kmeans Constraints Alone 1 (699) 49.8 100 36.8 2 (116) 47.2 100 31.5 3 (521) 56.5 100 44.2 4 (526) 49.4 100 47.1 5 (426) 50.2 100 29.6 6 (502) 75.0 100 56.3 7 (623) 73.5 100 57.8 8 (149) 74.7 100 53.6 9 (496) 58.6 100 46.8 10 (634) 50.2 100 63.4 11 (1160) 56.5 100 72.3 12 (427) 48.8 96.6 59.2 13 (587) 69.0 100 51.5 14 (678) 65.9 100 59.9 15 (400) 58.8 100 39.7 16 (115) 64.0 76.6 52.4 17 (383) 60.8 98.9 51.4 18 (786) 50.2 100 73.7 19 (880) 50.4 100 42.1 20 (570) 50.1 100 38.3 Average 58.0 98.6 50.4
  • 40. Conclusion  Measurable improvement in accuracy  The use of constraints while clustering means that, unlike the regular k-means algorithm, the assignment of instances to clusters can be order-sensitive.  If a poor decision is made early on, the algorithm may later encounter an instance i that has no possible valid cluster  Ideally, the algorithm would be able to backtrack, rearranging some of the instances so that i could then be validly assigned to a cluster.  Could be extended to hierarchical algorithms
  • 41. CSE 634 Data Mining Concepts & Techniques Professor Anita Wasilewska Stony Brook University Ligand Pose Clustering
  • 42. Abstract  Detailed atomic-level structural and energetic information from computer calculations is important for understanding how compounds interact with a given target and for the discovery and design of new drugs. Computational high-throughput screening (docking) provides an efficient and practical means with which to screen candidate compounds prior to experiment. Current scoring functions for docking use traditional Molecular Mechanics (MM) terms (Van der Waals and Electrostatics).  To develop and test new scoring functions that include ligand desolvation (MM-GBSA), we are building a docking test set focused on medicinal chemistry targets. Docked complexes are rescored on the receptor coordinates, clustered into diverse binding poses and the top five representative poses are reported for analysis. Known receptor-ligand complexes are retrieved from the protein databank and are used to identify novel receptor-ligand complexes of potential drug leads.
  • 43. References  Kuntz, I. D. (1992). "Structure-based strategies for drug design and discovery." Science 257(5073): 1078-1082.  Nissink, J. W. M., C. Murray, et al. (2002). "A new test set for validating predictions of protein-ligand interaction." Proteins-Structure Function and Genetics 49(4): 457-471.  Mozziconacci, J. C., E. Arnoult, et al. (2005). "Optimization and validation of a docking-scoring protocol; Application to virtual screening for COX-2 inhibitors." Journal of Medicinal Chemistry 48(4): 1055-1068.  Mohan, V., A. C. Gibbs, et al. (2005). "Docking: Successes and challenges." Current Pharmaceutical Design 11(3): 323-333.  Hu, L. G., M. L. Benson, et al. (2005). "Binding MOAD (Mother of All Databases)." Proteins-Structure Function and Bioinformatics 60(3): 333-340.
  • 44. Docking  Computational search for the most energetically favorable binding pose of a ligand with a receptor.  Ligand → small organic molecules  Receptor → proteins, nucleic acids Receptor: Trypsin Ligand: Benzamidine Complex
  • 45. Receptor - Ligand Complex Crystal Structure Ligand Receptor dms Inspection mbondi Add Leap radii hydrogens Molecular Sander Disulfide Surface Convert bonds Processed sphgen Ligand mol2 receptor Docking Spheres Gaussian 6-12 LJ GRID ab initio keep max 75 within charges spheres 8A Receptor grid mol2 ligand Active site spheres DOCK Docked Receptor – Ligand Complex
  • 46. Improved Scoring Function (MM- GBSA) R = receptor, L = ligand, RL = receptor-ligand complex - MM (molecular mechanics: VDW + Coul) - GB (Generalized Born) - SA (Solvent Accessible Surface Area) *Srinivasan, J. ; et al. J. Am. Chem. Soc. 1998, 120, 9401-9409
  • 47. Clustering Methods used  Initially, we clustered on a single dimension, i.e. RMSD. All ligand poses within 2A RMSD of each other were retained.  Better results were obtained using agglomerative clustering using the R statistical package. 1BCD (Carbonic Anh II/FMS) 1BCD (Carbonic Anh II/FMS) 50 50 40 GBSA Energy (kcal/mol) 40 30 GBSA Energy (kcal/mol) 30 20 20 10 10 0 0 0.5 1 1.5 2 2.5 3 0 -10 0 0.5 1 1.5 2 2.5 3 RMSD (A) -10 RMSD (A) Agglomerative RMSD clustering clustering
  • 48. Agglomerative Clustering  Agglomerative Clustering, each object is initially placed into its own group. A threshold distance is selected.  Compare all pairs of groups and mark the pair that is closest.  The distance between this closest pair of groups is compared to the threshold value.  If (distance between this closest pair <= threshold distance) then merge groups. Repeat.  Else If (distance between the closest pair > threshold) then (clustering is done)
  • 49. R Project for Statistical Computing  R is a free software environment for statistical computing and graphics.  Available at http://www.r-project.org/  Developed by Statistics Department, University of Auckland  R 2.2.1 is used in my research plotacpclust = function(data,xax=1,yax=2,hcut,cor=TRUE,clustermethod="ave",colbacktitle="#e8c9c1",wcos=3,Rpower ed=FALSE,...) { # data: data.frame to analyze # xax, yax: Factors to select for graphs # Parameters for hclust # hcut # clustermethod require(ade4) pcr=princomp(data,cor=cor) datac=t((t(data)-pcr$center )/pcr$scale) hc=hclust(dist(data),method=clustermethod) if (missing(hcut)) hcut=quantile(hc$height,c(0.97)) def.par <- par(no.readonly = TRUE) on.exit(par(def.par)) mylayout=layout(matrix(c(1,2,3,4,5,1,2,3,4,6,7,7,7,8,9,7,7,7,10,11),ncol=4),widths=c(4/18,2/18,6 /18,6/18),heights=c(lcm(1),3/6,1/6,lcm(1),1/3)) par(mar = c(0.1, 0.1, 0.1, 0.1)) par(oma = rep(1,4)) ltitle(paste("PCA ",dim(unclass(pcr$loadings))[2], "vars"),cex=1.6,ypos=0.7) text(x=0,y=0.2,pos=4,cex=1,labels=deparse(pcr$call),col="black") pcl=unclass(pcr$loadings) pclperc=100*(pcr$sdev)/sum(pcr$sdev) s.corcircle(pcl[,c(xax,yax)],1,2,sub=paste("(",xax,"-",yax,") ",round(sum(pclperc[c(xax,yax)]),0),"%",sep=""),possub="bottomright",csub=3,clabel=2) wsel=c(xax,yax) scatterutil.eigen(pcr$sdev,wsel=wsel,sub="")
  • 50. Clustered Poses Peptide ligand bound to GP-41 receptor
  • 51. RMSD vs. Energy Score Plots 1YDA (Sulfonamide bound to Human Carbonic Anhydrase II) 40 30 GBSA Energy (kcal/mol) 20 10 0 0 1 2 3 4 5 6 -10 -20 -30 RMSD (A)
  • 52. RMSD vs. Energy Score Plots 1YDA 0 0 1 2 3 4 5 6 -5 -10 DDD energy (kcal/mol) -15 -20 -25 -30 -35 -40 -45 RMSD (A)
  • 53. RMSD vs. Energy Score Plots 1BCD (Carbonic Anh II/FMS) 50 40 GBSA Energy (kcal/mol) 30 20 10 0 0 0.5 1 1.5 2 2.5 3 -10 RMSD (A)
  • 54. RMSD vs. Energy Score Plots 1BCD (Carbonic Anh II/FMS) 0 0 0.5 1 1.5 2 2.5 3 -5 DDD Energy (kcal/mol) -10 -15 -20 -25 RMSD (A)
  • 55. RMSD vs. Energy Score Plots 1EHL 120 100 GBSA Energy (kcal/mol) 80 60 40 20 0 0 1 2 3 4 5 6 7 8 RMSD (A)
  • 56. RMSD vs. Energy Score Plots 1DWB 120 100 80 GBSA (kcal/mol) 60 40 20 0 0 1 2 3 4 5 6 7 RMSD (A)
  • 57. RMSD vs. Energy Score Plots 1ABE 40 30 GBSA Energy (kcal/mol) 20 10 0 0 1 2 3 4 5 6 7 8 -10 -20 -30 RMSD (A)
  • 59. RMSD vs. Energy Score Plots 1EHL 120 100 GBSA Score (kcal/mol) 80 60 40 20 0 0 1 2 3 4 5 6 7 8 RMSD (A)
  • 61. Peptide mimetic inhibitor HIV-1 Protease

Hinweis der Redaktion

  1. What is a cluster? In conventional terminology and in Data Mining terms. Data objects in a cluster have two properties - Intraclass and Interclass. These are properties that a cluster tries to improve. Examples of clusters: Stars in a galaxy, Planets in the solar system Explain cluster analysis Why is it unsupervised? Because it does not rely on predefined classes or trained data. It is learning by observation, not learning by examples.
  2. How does clustering help us in general and then in specific applications? Biology example: Taxonomic systems group organisms according to structure and physiological connections between organisms
  3. Image processing – Magic wand
  4. Data Matrix – Examples of attributes are age, gender, race etc
  5. Why should we standardize the data of all the attributes? 1a. To ensure that they all have equal weight. 1b. Expressing a variable in smaller units will lead to a larger range for that variable, and thus a larger effect on the resulting clustering structure How do you standardise the data? Mean absolute deviation,
  6. Ans. The clusters are usually spherical with about the same density and size (E and Man distances) The reason is that a bunch of objects which are clustered together can be thought to be averaged out at a point and that point is the center of the sphere/circle
  7. Example for Symmetric and Asymmetric variables 1. One example for symmetric variables is gender – male or female 2. A test that tells you whether you have a particular disease has two outcomes. A positive which means you do and a negative which means you don’t. If we take the positive to be 1 and the negative 0 (because positive is the rarer case) then two variables having 1s are more significant than two variables having 0s.
  8. Nominal values are an extension or generalisation of a binary variable
  9. Examples of discrete ordinal variables are ranks in class, or rank in the army.
  10. 1. Example, it follows the formula Ae Bt or Ae -Bt 2. Ratio-scaled variable f having value x if for object i by using the formula y if = log(x if )
  11. Pg 346 of the textbook
  12. The K-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest. The center is the average of all the points in the cluster — that is, its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster.
  13. As we saw in the previous slides clustering is an unsupervised method of data analysis. The various clustering analysis mentioned group data instances according to some notion of similarity. Similarity between two instances is usually quantified using some function which takes as input the values of attributes describing each object. For example if we to partition or cluster students in this class based on age, gender and nationality we could devise a function which puts two students in the same group if they were born within 12 months of each other and in the same country. This function would produce a distance and based on that value the algorithm will make a decision as to whether the two students should be in the same group or different groups. If the distance is small the students will be in the same group. However it is often the case that the implementer possesses some background knowledge about the domain or the data set that could be useful in clustering the data. For instance you might have some partially labeled data, training set. In this paper the authors are using gps data to refine road maps to the lane level. So in this domain they have access to some background knowledge, e.g. a constraint can be that if two points are separated by more than 4 meters they can belong to the same lane and cannot be in the same group. Traditionally clustering algorithms have no way to take advantage of this information even when it does exist. This paper tries to integrate background information into clustering algorithms. There might be a question here about why can’t this background information be encapsulated as a attribute? I have the same question maybe the professor could explain  . One possible explanation the papers hints at is that this is information about the domain and not specific to any one data instance and thus cannot be made an attribute value. It is knowledge about why two instance should or should not be grouped together.
  14. The inputs for the modified algorithm are different, it takes in a data set, a set of must link constraints M, a set of cannot-link constraints C. It returns a partition of the instances of the set that satisfies all specified constraints. The major modification is that, when updating cluster assignments, we ensure that none of the specified constraints are violated. We attempt to assign each point di to its closest cluster Cj . As with the regular k-means algorithm the modified version starts by selecting k random instances from the data, these become the initial cluster centers. Second each is instance is assigned to its closest cluster center as long as a constraint is not violated. If there is another point d= that must be assigned to the same cluster as d, but that is already in some other cluster, or there is another point d= that cannot be grouped with d but is already in C, then d cannot be placed in C. The algorithm continues down the sorted list of clusters until we find one that can legally host d. Constraints are never broken; if a legal cluster cannot be found for d, the empty partition ({}) is returned.
  15. The authors tested the modified algorithm on a variety of different data sets. The constraints were generated as follows: for each constraint, we randomly picked two instances from the data set and checked their labels (which are available for evaluation purposes but not visible to the clustering algorithm). If they had the same label, we generated a must-link constraint. Otherwise, we generated a cannot-link constraint. THE CONSTRAINTS WERE RANDOMLY GENERATED FROM TRUE DATA LABELS To demonstrate the utility of constrained clustering with real domain knowledge, they applied the modified k-means to the problem of lane finding in GPS data.
  16. “Based on the observation that drivers tend to drive within lane boundaries “ -- These were obviously not long island drivers To better analyze performance in this domain, the authors modified the cluster center representation. The usual way to compute the center of a cluster is to average all of its constituent points. There are two significant drawbacks of this representation. First, the center of a lane is a point halfway along its extent, which commonly means that points inside the lane but at the far ends of the road appear to be extremely far from the cluster center. Second, applications that make use of the clustering results need more than a single point to define a lane. Consequently, we instead represented each lane cluster with a line segment parallel to the centerline. This more accurately models what we conceptualize as “the center of the lane&quot;, provides a better basis for measuring the distance from a point to its lane cluster center, and provides useful output for other applications.
  17. The modified algorithm (third column) consistently outperformed the unconstrained k-means algorithm (first column), attaining 100% accuracy for all but three data sets and averaging 98.6% overall. The unconstrained version of k-means performed much worse, averaging 58.0% accuracy. The final column in Table 2 is a measure of how much is known after generating the constraints and before doing any clustering. It shows that an average accuracy of 50.4% can be achieved using the background information alone. What this demonstrates is that neither general similarity information (k-means clustering) nor domain-specific information (constraints) alone perform very well, but that combining the two sources of information effectively can produce excellent results.
  18. (e.g., it has a cannot-link constraint to at least one item in each of the k clusters). This occasionally occurred in our experiments (for some of the random data orderings).
  19. To study the binding and selectivity of the thidiazoles with MMPS we used a relatively new method to quantify molecular recognition termed MM-GBSA that was championed by researchers in Peter Kollman&apos;s group at UCSF and David Case&apos;s group at Scripps. This method aims to include important desolvation effects that are expected to be be particularly important for our simulations given the fact that the MMPs contain several Calcium and two Zinc ions. In MM-GBSA a thermodynamic cycle is employed to represent the molecular recognition event. Experimental binding energies measured in the condensed-phase are related to the sum of energetic contributions computed from gas-phase interaction energies and free energy of hydration calculations to account for desolvation. Whereas the extended linear response method approaches ligand binding from the point of view of the ligand and its different environment, the MM-GBSA methods take a different approach and include calculations that consider changes in the total system not only the ligand.