SlideShare a Scribd company logo
1 of 5
Download to read offline
Outline
Birch: An efficient data clustering
method for very large databases          What is data clustering
                                         Data clustering applications
                                         Previous Approaches
    Tian Zhang, Raghu Ramakrishnan,      Birch’s Goal
    Miron Livny                          Clustering Feature
                                         Birch clustering algorithm
          CPSC 504
                                         Clustering example
          Presenter: Joel Lanir
          Discussion: Dan Li




What is Data Clustering?                Data Clustering
                                         Helps understand the natural
A cluster is a closely-packed group.     grouping or structure in a dataset
A collection of data objects that are    Large set of multidimensional data
  similar to one another and treated     Data space is usually not uniformly
  collectively as a group.               occupied
                                          Identify the sparse and crowded
Data Clustering is the partitioning      places
  of a dataset into clusters             Helps visualization




Discussion                              Some Clustering applications
  Can you give some examples for very    Biology – building groups of genes with
                                         related patterns
  large databases? What applications     Marketing – partition the population of
  can you imagine that require such      consumers to market segments
  large databases for clustering?        Division of WWW pages into genres.
                                         Image segmentations – for object
  What are the special requirements      recognition
  that “large” databases pose on         Land use – Identification of areas of similar
  clustering, or more general on data    land use from satellite images
  mining?                                Insurance – Identify groups of policy
                                         holders with high average claim cost




                                                                                         1
Data Clustering – previous
approaches                                  Approaches
                                            Distance Based (statistics)
                                                 Must be a distance metric between two items
  probability based (Machine learning):          assumes that all data points are in memory and
  make wrong assumption that                     can be scanned frequently
  distributions on attributes are                Ignores the fact that not all data points are
                                                 equally important
  independent on each other                      Close data points are not gathered together
  Probability representations of clusters        Inspects all data points on multiple iterations
  is expensive
                                            These approaches do not deal with dataset
                                              and memory size issues!




Clustering parameters                       Clustering parameters
  Centroid – Euclidian center                 Other measurements (like the
  Radius – average distance to center         Euclidean distance of the centroids of
                                              two clusters) will measure how far
  Diameter – average pairwise                 away two clusters are.
  difference within a cluster
                                            A good quality clustering will produce
Radius and diameter are measures of           high intra-clustering and low inter-
  the tightness of a cluster around its       clustering
  center. We wish to keep these low.        A good quality clustering can help find
                                              hidden patterns




Birch’s goals:                              Clustering Feature (CF)
  Minimize running time and data              CF is a compact storage for data on
  scans, thus formulating the problem         points in a cluster
  for large databases                         Has enough information to calculate
  Clustering decisions made without           the intra-cluster distances
  scanning the whole data                     Additivity theorem allows us to merge
  Exploit the non uniformity of data –        sub-clusters
  treat dense areas as one, and remove
  outliers (noise)




                                                                                                   2
Clustering Feature (CF)                                                                    CF Additivity Theorem
Given N d-dimensional data points in a                                                     If CF1 = (N1, LS1, SS1), and
  cluster: {Xi} where i = 1, 2, …, N,                                                      CF2 = (N2 ,LS2, SS2) are the CF entries of
     CF = (N, LS, SS)                                                                         two disjoint subclusters.
N is the number of data points in the
  cluster,                                                                                 The CF entry of the subcluster formed by
LS is the linear sum of the N data points,                                                   merging the two disjoin subclusters is:
SS is the square sum of the N data
  points.                                                                                  CF1 + CF2 = (N1 + N2 , LS1 + LS2, SS1 + SS2)




                                                   B = Max. no. of CF in a non-leaf node
CF Tree                                            L = Max. no. of CF in a leaf node

                                 Root                                                      CF TREE
               CF1      CF2 CF3                           CFb
               child1   child2 child3                     childb                             T is the threshold for the diameter or
                                                                                             radius of the leaf nodes
                     Non-leaf node                                                           The tree size is a function of T. The
    CF1        CF2 CF3                          CFb                                          bigger T is, the smaller the tree will
    child1     child2 child3                    childb
                                                                                             be.
                                                                                             The CF tree is built dynamically as
                    Leaf node                                          Leaf node             data is scanned.
prev CF1 CF2            CFL next               prev CF1 CF2              CFL next

             T= Max. radius of a sub-cluster




CF Tree Insertion                                                                          Birch Clustering Algorithm
   Identifying the appropriate leaf: recursively                                             Phase 1: Scan all data and build an
   descending the CF tree and choosing the                                                   initial in-memory CF tree.
   closest child node according to a chosen
   distance metric                                                                           Phase 2: condense into desirable
   Modifying the leaf: test whether the leaf                                                 length by building a smaller CF tree.
   can absorb the node without violating the                                                 Phase 3: Global clustering
   threshold. If there is no room, split the                                                 Phase 4: Cluster refining – this is
   node
                                                                                             optional, and requires more passes
   Modifying the path: update CF information
                                                                                             over the data to refine the results
   up the path.




                                                                                                                                          3
Birch – Phase 1                                         Birch - Phase 2
     Start with initial threshold and insert points         Optional
     into the tree
     If run out of memory, increase threshold               Phase 3 sometime have minimum
     value, and rebuild a smaller tree by                   size which performs well, so phase 2
     reinserting values from older tree and then            prepares the tree for phase 3.
     other values
                                                            Removes outliers, and grouping
     Good initial threshold is important but hard
     to figure out                                          clusters.
     Outlier removal – when rebuilding tree
     remove outliers




Birch – Phase 3                                         Birch – Phase 4
     Problems after phase 1:                                Optional
         Input order affects results                        Additional scan/s of the dataset,
         Splitting triggered by node size                   attaching each item to the centroids
     Phase 3:                                               found.
         cluster all leaf nodes on the CF values            Recalculating the centroids and
         according to an existing algorithm                 redistributing the items.
         Algorithm used here: agglomerative                 Always converges
         hierarchical clustering




Clustering example                                                      Clustering example
                                                      band224
                                                                                     K-means Clustering
                                                                                        to 5 classes


Pixel classification in images
From top to bottom:
  BIRCH classification
  Visible wavelength band
  Near-infrared band


                                                                band2

                                                                         band1




                                                                                                          4
Conclusions                              Discussion
 Birch performs faster than then          After reading the two papers for data
 existing algorithms on large datasets    mining, what do you think is the
                                          criteria to say if a data mining
 Scans whole data only once               algorithm is “good”?
 Handles outliers                           Efficiency?
                                            I/O cost?
                                            Memory/disk requirement?
                                            Stability?
                                            Immunity to abnormal data?




 Thanks for listening




                                                                                  5

More Related Content

What's hot

3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methodsKrish_ver2
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering AlgorithmLino Possamai
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering methodrajshreemuthiah
 
Optics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureOptics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureRajesh Piryani
 
Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningHouw Liong The
 
K-Means clustring @jax
K-Means clustring @jaxK-Means clustring @jax
K-Means clustring @jaxAjay Iet
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning ProjectAdeyemi Fowe
 
3.6 constraint based cluster analysis
3.6 constraint based cluster analysis3.6 constraint based cluster analysis
3.6 constraint based cluster analysisKrish_ver2
 
K means and dbscan
K means and dbscanK means and dbscan
K means and dbscanYan Xu
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsPrashanth Guntal
 
Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Yan Xu
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyijpla
 

What's hot (20)

3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methods
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
 
Clique and sting
Clique and stingClique and sting
Clique and sting
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
Clique
Clique Clique
Clique
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering method
 
Data miningpresentation
Data miningpresentationData miningpresentation
Data miningpresentation
 
Chapter8
Chapter8Chapter8
Chapter8
 
Optics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureOptics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structure
 
Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text mining
 
K-Means clustring @jax
K-Means clustring @jaxK-Means clustring @jax
K-Means clustring @jax
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning Project
 
3.6 constraint based cluster analysis
3.6 constraint based cluster analysis3.6 constraint based cluster analysis
3.6 constraint based cluster analysis
 
K means and dbscan
K means and dbscanK means and dbscan
K means and dbscan
 
Kmeans
KmeansKmeans
Kmeans
 
Lect4
Lect4Lect4
Lect4
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
 
Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering
 
Data clustering
Data clustering Data clustering
Data clustering
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracy
 

Similar to Birch

Community detection in social networks[1]
Community detection in social networks[1]Community detection in social networks[1]
Community detection in social networks[1]sdnumaygmailcom
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Zihui Li
 
A comprehensive survey of contemporary
A comprehensive survey of contemporaryA comprehensive survey of contemporary
A comprehensive survey of contemporaryprjpublications
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clusteringmobius.cn
 
Data Mining: Cluster Analysis
Data Mining: Cluster AnalysisData Mining: Cluster Analysis
Data Mining: Cluster AnalysisSuman Mia
 
Tree Based Collaboration For Target Tracking
Tree Based Collaboration For Target TrackingTree Based Collaboration For Target Tracking
Tree Based Collaboration For Target TrackingChuka Okoye
 
iiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfiiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfVIKASGUPTA127897
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...
Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...
Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...Sigma web solutions pvt. ltd.
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
A Survey on Clustering Techniques for Wireless Sensor Network
A Survey on Clustering Techniques for Wireless Sensor Network A Survey on Clustering Techniques for Wireless Sensor Network
A Survey on Clustering Techniques for Wireless Sensor Network IJORCS
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learningAnil Yadav
 

Similar to Birch (20)

Birch1
Birch1Birch1
Birch1
 
My8clst
My8clstMy8clst
My8clst
 
dm_clustering2.ppt
dm_clustering2.pptdm_clustering2.ppt
dm_clustering2.ppt
 
Community detection in social networks[1]
Community detection in social networks[1]Community detection in social networks[1]
Community detection in social networks[1]
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
 
A comprehensive survey of contemporary
A comprehensive survey of contemporaryA comprehensive survey of contemporary
A comprehensive survey of contemporary
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clustering
 
Data Mining: Cluster Analysis
Data Mining: Cluster AnalysisData Mining: Cluster Analysis
Data Mining: Cluster Analysis
 
Tree Based Collaboration For Target Tracking
Tree Based Collaboration For Target TrackingTree Based Collaboration For Target Tracking
Tree Based Collaboration For Target Tracking
 
iiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfiiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdf
 
Data Applied: Clustering
Data Applied: ClusteringData Applied: Clustering
Data Applied: Clustering
 
Data Applied: Clustering
Data Applied: ClusteringData Applied: Clustering
Data Applied: Clustering
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Clustering
ClusteringClustering
Clustering
 
Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...
Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...
Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
A Survey on Clustering Techniques for Wireless Sensor Network
A Survey on Clustering Techniques for Wireless Sensor Network A Survey on Clustering Techniques for Wireless Sensor Network
A Survey on Clustering Techniques for Wireless Sensor Network
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 

Recently uploaded

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 

Recently uploaded (20)

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 

Birch

  • 1. Outline Birch: An efficient data clustering method for very large databases What is data clustering Data clustering applications Previous Approaches Tian Zhang, Raghu Ramakrishnan, Birch’s Goal Miron Livny Clustering Feature Birch clustering algorithm CPSC 504 Clustering example Presenter: Joel Lanir Discussion: Dan Li What is Data Clustering? Data Clustering Helps understand the natural A cluster is a closely-packed group. grouping or structure in a dataset A collection of data objects that are Large set of multidimensional data similar to one another and treated Data space is usually not uniformly collectively as a group. occupied Identify the sparse and crowded Data Clustering is the partitioning places of a dataset into clusters Helps visualization Discussion Some Clustering applications Can you give some examples for very Biology – building groups of genes with related patterns large databases? What applications Marketing – partition the population of can you imagine that require such consumers to market segments large databases for clustering? Division of WWW pages into genres. Image segmentations – for object What are the special requirements recognition that “large” databases pose on Land use – Identification of areas of similar clustering, or more general on data land use from satellite images mining? Insurance – Identify groups of policy holders with high average claim cost 1
  • 2. Data Clustering – previous approaches Approaches Distance Based (statistics) Must be a distance metric between two items probability based (Machine learning): assumes that all data points are in memory and make wrong assumption that can be scanned frequently distributions on attributes are Ignores the fact that not all data points are equally important independent on each other Close data points are not gathered together Probability representations of clusters Inspects all data points on multiple iterations is expensive These approaches do not deal with dataset and memory size issues! Clustering parameters Clustering parameters Centroid – Euclidian center Other measurements (like the Radius – average distance to center Euclidean distance of the centroids of two clusters) will measure how far Diameter – average pairwise away two clusters are. difference within a cluster A good quality clustering will produce Radius and diameter are measures of high intra-clustering and low inter- the tightness of a cluster around its clustering center. We wish to keep these low. A good quality clustering can help find hidden patterns Birch’s goals: Clustering Feature (CF) Minimize running time and data CF is a compact storage for data on scans, thus formulating the problem points in a cluster for large databases Has enough information to calculate Clustering decisions made without the intra-cluster distances scanning the whole data Additivity theorem allows us to merge Exploit the non uniformity of data – sub-clusters treat dense areas as one, and remove outliers (noise) 2
  • 3. Clustering Feature (CF) CF Additivity Theorem Given N d-dimensional data points in a If CF1 = (N1, LS1, SS1), and cluster: {Xi} where i = 1, 2, …, N, CF2 = (N2 ,LS2, SS2) are the CF entries of CF = (N, LS, SS) two disjoint subclusters. N is the number of data points in the cluster, The CF entry of the subcluster formed by LS is the linear sum of the N data points, merging the two disjoin subclusters is: SS is the square sum of the N data points. CF1 + CF2 = (N1 + N2 , LS1 + LS2, SS1 + SS2) B = Max. no. of CF in a non-leaf node CF Tree L = Max. no. of CF in a leaf node Root CF TREE CF1 CF2 CF3 CFb child1 child2 child3 childb T is the threshold for the diameter or radius of the leaf nodes Non-leaf node The tree size is a function of T. The CF1 CF2 CF3 CFb bigger T is, the smaller the tree will child1 child2 child3 childb be. The CF tree is built dynamically as Leaf node Leaf node data is scanned. prev CF1 CF2 CFL next prev CF1 CF2 CFL next T= Max. radius of a sub-cluster CF Tree Insertion Birch Clustering Algorithm Identifying the appropriate leaf: recursively Phase 1: Scan all data and build an descending the CF tree and choosing the initial in-memory CF tree. closest child node according to a chosen distance metric Phase 2: condense into desirable Modifying the leaf: test whether the leaf length by building a smaller CF tree. can absorb the node without violating the Phase 3: Global clustering threshold. If there is no room, split the Phase 4: Cluster refining – this is node optional, and requires more passes Modifying the path: update CF information over the data to refine the results up the path. 3
  • 4. Birch – Phase 1 Birch - Phase 2 Start with initial threshold and insert points Optional into the tree If run out of memory, increase threshold Phase 3 sometime have minimum value, and rebuild a smaller tree by size which performs well, so phase 2 reinserting values from older tree and then prepares the tree for phase 3. other values Removes outliers, and grouping Good initial threshold is important but hard to figure out clusters. Outlier removal – when rebuilding tree remove outliers Birch – Phase 3 Birch – Phase 4 Problems after phase 1: Optional Input order affects results Additional scan/s of the dataset, Splitting triggered by node size attaching each item to the centroids Phase 3: found. cluster all leaf nodes on the CF values Recalculating the centroids and according to an existing algorithm redistributing the items. Algorithm used here: agglomerative Always converges hierarchical clustering Clustering example Clustering example band224 K-means Clustering to 5 classes Pixel classification in images From top to bottom: BIRCH classification Visible wavelength band Near-infrared band band2 band1 4
  • 5. Conclusions Discussion Birch performs faster than then After reading the two papers for data existing algorithms on large datasets mining, what do you think is the criteria to say if a data mining Scans whole data only once algorithm is “good”? Handles outliers Efficiency? I/O cost? Memory/disk requirement? Stability? Immunity to abnormal data? Thanks for listening 5