SlideShare a Scribd company logo
1 of 36
Author
Rakesh Agrawal , Johannes Gehrke, Dimitrios Gunopulos,
Prabhakar Raghavan
Prepared by : Raed T Aldahdooh
 Introduction
 Motivation
 Contributions Of The Paper
 Subspace Clustering
 CLIQUE(Clustering in Quest)
 Performance Experiments
 Conclusions
 Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)
 CLIQUE can be considered as both density-based and grid-
based
 Clustering high-dimensional data.
 Automatically identifying subspaces of a high dimensional data space that
allow better clustering than original space
 Many irrelevant dimensions may mask clusters.
 Distance measure becomes meaningless—due to
equi-distance.
 Clusters may exist only in some subspaces.
 Only data in one dimension is relatively packed.
 Adding a dimension “stretch” the points across that dimension, making
them further apart.
 Density decrease dramatically.
 Distance measure becomes meaningless—due to equi-distance.
 Methods
◦ Feature transformation: only effective if most dimensions are relevant
 PCA “Principal component analysis” & SVD “Singular
value decomposition” useful only when features are highly
correlated/redundant
◦ Feature selection: wrapper or filter approaches
 useful to find a subspace where the data have nice clusters
◦ Subspace-clustering: find clusters in all the possible subspaces
 CLIQUE, ProClus, and frequent pattern-based clustering
The need for developing new algorithms
 Effective treatment of high dimensionality:
◦ To effectively extract information from a huge amount of data in databases. In
other words. The running time of algorithms must be predictable and usable in
large database.
 Interpretability of results:
◦ User expect clustering results in the high dimensional data to be interpretable,
comprehensible.
 Scalability and usability:
◦ Many clustering algorithms don’t well in a large database may contain millions
of objects, Clustering on a sample of a given data set may lead to biased results.
In other words, The clustering technique should be fast and scale with the
number of dimensions and the size of input and insensitive to the order of input
data.
 CLIQUE satisfies the above desiderata
( Effective , interpretability, Scalability and Usability).
 CLIQUE can automatically finds subspaces with high density
clusters.
 CLIQUE generates a minimal description for each cluster in
DNF expressions.
 Empirical evaluation shows that CLIQUE scales linearly with
the number of input records and has good scalability as the
number of dimension in the dimensionality of the hidden
cluster.
 a disjunctive normal form (DNF) is a
standardization (or normalization) of a logical
formula which is a disjunction of conjunctive clauses.
 A disjunction of conjunctions where every variable or
its negation is represented once in each conjunction
(a minterm)
◦ each minterm appears only once
Example: DNF of pq is
(pq)(pq).
 Clusters may exist only in some
subspaces.
 Subspace-clustering: find clusters in
all the subspaces.
 What’s (a)unit (b)dense unit (c)a cluster (d)a minimal description of a cluster.
 In Figure 1,the two dim space(age , salary) has been partitioned by a 10x10 grid.ξ=10
 The unit u=(30≤age<35)Λ(1≤salary<2)
 A and B are both region
 A=(30≤age<50)Λ(4≤salary<8)
 B =(40≤age<60)Λ(2≤salary<6)
 Assuming the dense units have been shaded,
 AUB is a cluster( A,B are connected regions)
 A∩B is not a maximal region.
 The minimal description for this cluster AUB is the
 DNF expression: ( (30≤age<50)Λ(4≤salary<8))v
 ( (40≤age<60)Λ(2≤salary<6))
 In Figure2. Assuming T=20%
(density threshold _ 3 point) If
selectivity(u)>T then u is a dense
unit.
 Where selectivity in the fraction of
total data points contained in the
unit.
 No 2-dimen unit is dense and
there are no clusters In the
original data space.
The points are projected on the salary dimension , there are three 1-dim dense
units, and there are two clusters in the 1-dim salary subspace,
C=(5≤salary<7 )and D=(2≤salary<3)
But there is no dense unit and cluster in 1-dim age subspace
3.
Generation of
minimal
description for
the clusters.
 CLIQUE consists of the following three steps:
1) Identification of subspace that contain clusters.
2) Identification of clusters .
3) Generation of minimal description for the clusters.
Title in here
2.
Identification of
clusters.
1.
Identification of
subspace that
contain clusters.
CLIQUE consists
of the following
three steps:
 Downward closure (DC) property: If a cluster
is satisfied in a k-dimensional space, it is
also satisfied in all of its (k-1)-dimensional
subspaces.
 Due to the DC property, identification of
subspaces is carried out in an iterative
bottom-up fashion (from lower to higher
dimensional subspaces).
 The difficulty in identifying subspaces that contain clusters
lies in finding dense units in different subspaces.
 A. using a bottom-up algorithm to find dense units that
exploits the monotonicity of the clustering criterion with
respect to dimensionality to prune the search space.
 Lemma1 (monotonicity):If k-dim unit is dense ,then so are
it’s projections in (k-1)-dim space.
 The bottom-up algorithm process
 Determines 1-dim dense unit and interaction(self-join) to get 2-dim dense unit.
Until having (k-1)dim dense units, We can self-join DK-1 to get the candidate k-dim units.
 we discard those dense units from Ck which have a projection (k-1)-dim that
isn't included in Ck-1 .
 B. Making the bottom-up algorithm faster with MDL-base
pruning.
 A. Determination of dense units
◦ Determine the set D1 of all one-dimensional dense units.
◦ k=1
◦ While Dk ≠  do
 k=k+1
 Determine the set Dk as the set of all the k-dimensional dense units
all of whose (k-1)-dimensional projections, belong to Dk-1.
◦ End while
 B. Determination of high coverage subspaces.
◦ Determine all the subspaces that contain at least one dense
unit.
◦ Sort these subspaces in descending order according to their
coverage (fraction of the num. of points of the original data set
they contain).
◦ Optimize a suitably defined Minimum Description Length
criterion function and determine a threshold under which a
coverage is considered “low”.
◦ Select the subspaces with “high” coverage.
 The input to the step of Finding Clusters is a set of dense units
D all in the same k-dim space.
 Depth-first search algorithm
◦ Using a Depth –first search algorithm to find the connected components
of the graph, By starting with some U in D, Assign it the first cluster
number and find all the units it is connected to, then if there still are
units in D that have not yet been visited , we find one and repeat the
procedure.
 For each high coverage subspace S do
◦ Consider the set E of all the dense units in S.
◦ While E ≠ 
◦ m´ =1
◦ Select a randomly chosen unit u from E.
◦ Assign to Cm´, u and all units of E that are connected to u.
◦ E=E-Cm´
◦ End while
 End for
 The input to this step consists of disjoint clusters in k-
dim subspace.
 The goal is to generate a minimal description of each
cluster with two steps:
◦ Covering with maximal region.
◦ Minimal cover.
 The CLIQUE Algorithm (cont.)
3. Minimal description of clusters
The minimal description of a cluster C, produced by the Last
procedure, is the minimum possible union of hyper rectangular
regions.
For example
 A  B is the minimum cluster description of the shaded region.
 C  D  E is a non-minimal cluster description of the same
region.
 The CLIQUE Algorithm (cont.)
3. Minimal description of clusters (algorithm)
For each cluster C do
1st stage
• c=0
• While C ≠ 
 c=c+1
 Choose a dense unit in C
 For i=1 to l
o Grow the unit in both directions along the i-th dimension, trying to cover as
many units in C as possible (boxes that are not belong to C should not be
covered).
 End for
 Define the set I containing all the units covered by the above procedure
 C=C-I
• End while
2nd stage
• Remove all covers whose units are covered by at least another cover.
 A two dimensional grid of lines of edge size ξ applied in the
two-dimensional feature space.
 Two-dimensional and one-dimensional units are defined:
◦ ui
q denotes the i-th one dimensional unit along xq
◦ uij denotes the two dimensional unit resulting from the Cartesian
product of the i-th and j-th intervals along x1 and x2, respectively.
 ξ=10 and τ=8% (thus, each unit containing more than
5 points is considered to be dense).
 The points in u48 and u58, u75 and u76, u83 and u93 are
collinear.
One-dimensional dense units:
D1={u2
1, u3
1, u4
1, u5
1, u8
1, u9
1, u1
2, u2
2, u3
2, u5
2, u6
2}
Two-dimensional dense units:
D2={u21, u22, u32, u33, u83, u93}
Notes:
•Although each one of the u48, u75, u76
contains more that 5 points, they are not
included in D2.
•Although it seems unnatural for u83 and
u93 to be included in D2, they are
included since u3
2 is dense.
• All subspaces of the two-dimensional
space contain clusters.
One-dimensional clusters:
C1={u2
1, u3
1, u4
1, u5
1}
C2={u8
1, u9
1}
C3={u1
2, u2
2, u3
2}
C4={u5
2, u6
2}
Two-dimensional clusters:
C5={u21, u22, u32, u33}
C6={u83, u93}
One-dimensional dense units:
D1={u2
1, u3
1, u4
1, u5
1, u8
1, u9
1, u1
2,
u2
2, u3
2, u5
2, u6
2}
Two-dimensional dense units:
D2={u21, u22, u32, u33, u83, u93}
C1={(x1): 1 x1<5}
C2={(x1): 7 x1<9}
C3={(x2): 0 x2<3}
C4={(x2): 4 x2<6}
C5={(x1, x2): 1 x1<2, 0 x2<2}{(x1, x2): 2 x1<3, 1 x2<3}
C6={(x1, x2): 7 x1<9, 2 x2<3}
Note that C2 and C6 are
essentially the same cluster,
which is reported twice by
the algorithm.
 We now empirically evaluate CLIQUE using synthetic data (Generator
from M.Zait and H.Messatfa. a comparative study of clustering methods)
 The goals of the experiments are to assess the efficiency of
CLIQUE:
 Efficiency :Determine how the running time scales with
◦ Dimensionality of the data space.
◦ Dimensionality of clusters.
◦ Size of data.
 Accuracy:Test if CLIQUE recovers known clusters in some
subspaces of a high dimensional data space.
Using clusters embedded in 5-dim subspaces while varying
the dimensional of the space from 5 to50.
CLIQUE was able to recover all clusters in every case.
 Strength
◦ automatically finds subspaces of the highest dimensionality such that
high density clusters exist in those subspaces
◦ insensitive to the order of records in input and does not presume some
canonical data distribution
◦ scales linearly with the size of input and has good scalability as the
number of dimensions in the data increases
 Weakness
◦ The accuracy of the clustering result may be degraded at the expense of
simplicity of the method
 The problem of high dimensionality is often tackled by requiring the
user to specify the subspace for cluster analysis. But user-identification
of quite error-prone. CLIQUE can find clusters embedded in subspaces of
high dimensional data without requiring the user to guess subspaces that
might have interesting clusters.
 CLIQUE generates cluster descriptions in the form of DNF expressions
that are minimized for ease of comprehension.
 CLIQUE is insensitive to the order of input records, Some clustering
algorithms are sensitive to the order of input data.
 Empirical evolution shows that CLIQUE scales linearly with the size of
input and has good scalability as the number of dimension in the data.
 CLIQUE can accurately discover clusters embedded in lower dimensional
subspaces.
CLIQUE Automatic subspace clustering of high dimensional data for data mining application

More Related Content

What's hot

CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
butest
 
DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)
Cory Cook
 

What's hot (20)

Data clustering
Data clustering Data clustering
Data clustering
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
Clique and sting
Clique and stingClique and sting
Clique and sting
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
 
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
 
Clustering
ClusteringClustering
Clustering
 
Clusters techniques
Clusters techniquesClusters techniques
Clusters techniques
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
K MEANS CLUSTERING
K MEANS CLUSTERINGK MEANS CLUSTERING
K MEANS CLUSTERING
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
 
Hierarchical clustering.pptx
Hierarchical clustering.pptxHierarchical clustering.pptx
Hierarchical clustering.pptx
 
Lect12 graph mining
Lect12 graph miningLect12 graph mining
Lect12 graph mining
 
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
Density based methods
Density based methodsDensity based methods
Density based methods
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
 

Similar to CLIQUE Automatic subspace clustering of high dimensional data for data mining application

8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm
Laura Petrosanu
 

Similar to CLIQUE Automatic subspace clustering of high dimensional data for data mining application (20)

CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
dm_clustering2.ppt
dm_clustering2.pptdm_clustering2.ppt
dm_clustering2.ppt
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Project PPT
Project PPTProject PPT
Project PPT
 
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
 
Clustering of graphs and search of assemblages
Clustering of graphs and search of assemblagesClustering of graphs and search of assemblages
Clustering of graphs and search of assemblages
 
Chapter 11. Cluster Analysis Advanced Methods.ppt
Chapter 11. Cluster Analysis Advanced Methods.pptChapter 11. Cluster Analysis Advanced Methods.ppt
Chapter 11. Cluster Analysis Advanced Methods.ppt
 
Lect4
Lect4Lect4
Lect4
 
11 clusadvanced
11 clusadvanced11 clusadvanced
11 clusadvanced
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
Clustering Algorithms.pdf
Clustering Algorithms.pdfClustering Algorithms.pdf
Clustering Algorithms.pdf
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial Dataset
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm
 
kmean clustering
kmean clusteringkmean clustering
kmean clustering
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
 
11ClusAdvanced.ppt
11ClusAdvanced.ppt11ClusAdvanced.ppt
11ClusAdvanced.ppt
 
Clustering
ClusteringClustering
Clustering
 
Db Scan
Db ScanDb Scan
Db Scan
 
K means clustering
K means clusteringK means clustering
K means clustering
 

Recently uploaded

Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
dharasingh5698
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 

Recently uploaded (20)

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 

CLIQUE Automatic subspace clustering of high dimensional data for data mining application

  • 1. Author Rakesh Agrawal , Johannes Gehrke, Dimitrios Gunopulos, Prabhakar Raghavan Prepared by : Raed T Aldahdooh
  • 2.  Introduction  Motivation  Contributions Of The Paper  Subspace Clustering  CLIQUE(Clustering in Quest)  Performance Experiments  Conclusions
  • 3.  Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)  CLIQUE can be considered as both density-based and grid- based  Clustering high-dimensional data.  Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space
  • 4.  Many irrelevant dimensions may mask clusters.  Distance measure becomes meaningless—due to equi-distance.  Clusters may exist only in some subspaces.
  • 5.  Only data in one dimension is relatively packed.  Adding a dimension “stretch” the points across that dimension, making them further apart.  Density decrease dramatically.  Distance measure becomes meaningless—due to equi-distance.
  • 6.  Methods ◦ Feature transformation: only effective if most dimensions are relevant  PCA “Principal component analysis” & SVD “Singular value decomposition” useful only when features are highly correlated/redundant ◦ Feature selection: wrapper or filter approaches  useful to find a subspace where the data have nice clusters ◦ Subspace-clustering: find clusters in all the possible subspaces  CLIQUE, ProClus, and frequent pattern-based clustering
  • 7. The need for developing new algorithms  Effective treatment of high dimensionality: ◦ To effectively extract information from a huge amount of data in databases. In other words. The running time of algorithms must be predictable and usable in large database.  Interpretability of results: ◦ User expect clustering results in the high dimensional data to be interpretable, comprehensible.  Scalability and usability: ◦ Many clustering algorithms don’t well in a large database may contain millions of objects, Clustering on a sample of a given data set may lead to biased results. In other words, The clustering technique should be fast and scale with the number of dimensions and the size of input and insensitive to the order of input data.
  • 8.  CLIQUE satisfies the above desiderata ( Effective , interpretability, Scalability and Usability).  CLIQUE can automatically finds subspaces with high density clusters.  CLIQUE generates a minimal description for each cluster in DNF expressions.  Empirical evaluation shows that CLIQUE scales linearly with the number of input records and has good scalability as the number of dimension in the dimensionality of the hidden cluster.
  • 9.  a disjunctive normal form (DNF) is a standardization (or normalization) of a logical formula which is a disjunction of conjunctive clauses.  A disjunction of conjunctions where every variable or its negation is represented once in each conjunction (a minterm) ◦ each minterm appears only once Example: DNF of pq is (pq)(pq).
  • 10.  Clusters may exist only in some subspaces.  Subspace-clustering: find clusters in all the subspaces.
  • 11.  What’s (a)unit (b)dense unit (c)a cluster (d)a minimal description of a cluster.  In Figure 1,the two dim space(age , salary) has been partitioned by a 10x10 grid.ξ=10  The unit u=(30≤age<35)Λ(1≤salary<2)  A and B are both region  A=(30≤age<50)Λ(4≤salary<8)  B =(40≤age<60)Λ(2≤salary<6)  Assuming the dense units have been shaded,  AUB is a cluster( A,B are connected regions)  A∩B is not a maximal region.  The minimal description for this cluster AUB is the  DNF expression: ( (30≤age<50)Λ(4≤salary<8))v  ( (40≤age<60)Λ(2≤salary<6))
  • 12.  In Figure2. Assuming T=20% (density threshold _ 3 point) If selectivity(u)>T then u is a dense unit.  Where selectivity in the fraction of total data points contained in the unit.  No 2-dimen unit is dense and there are no clusters In the original data space. The points are projected on the salary dimension , there are three 1-dim dense units, and there are two clusters in the 1-dim salary subspace, C=(5≤salary<7 )and D=(2≤salary<3) But there is no dense unit and cluster in 1-dim age subspace
  • 13. 3. Generation of minimal description for the clusters.  CLIQUE consists of the following three steps: 1) Identification of subspace that contain clusters. 2) Identification of clusters . 3) Generation of minimal description for the clusters. Title in here 2. Identification of clusters. 1. Identification of subspace that contain clusters. CLIQUE consists of the following three steps:
  • 14.  Downward closure (DC) property: If a cluster is satisfied in a k-dimensional space, it is also satisfied in all of its (k-1)-dimensional subspaces.  Due to the DC property, identification of subspaces is carried out in an iterative bottom-up fashion (from lower to higher dimensional subspaces).
  • 15.  The difficulty in identifying subspaces that contain clusters lies in finding dense units in different subspaces.  A. using a bottom-up algorithm to find dense units that exploits the monotonicity of the clustering criterion with respect to dimensionality to prune the search space.  Lemma1 (monotonicity):If k-dim unit is dense ,then so are it’s projections in (k-1)-dim space.  The bottom-up algorithm process  Determines 1-dim dense unit and interaction(self-join) to get 2-dim dense unit. Until having (k-1)dim dense units, We can self-join DK-1 to get the candidate k-dim units.  we discard those dense units from Ck which have a projection (k-1)-dim that isn't included in Ck-1 .  B. Making the bottom-up algorithm faster with MDL-base pruning.
  • 16.  A. Determination of dense units ◦ Determine the set D1 of all one-dimensional dense units. ◦ k=1 ◦ While Dk ≠  do  k=k+1  Determine the set Dk as the set of all the k-dimensional dense units all of whose (k-1)-dimensional projections, belong to Dk-1. ◦ End while
  • 17.  B. Determination of high coverage subspaces. ◦ Determine all the subspaces that contain at least one dense unit. ◦ Sort these subspaces in descending order according to their coverage (fraction of the num. of points of the original data set they contain). ◦ Optimize a suitably defined Minimum Description Length criterion function and determine a threshold under which a coverage is considered “low”. ◦ Select the subspaces with “high” coverage.
  • 18.  The input to the step of Finding Clusters is a set of dense units D all in the same k-dim space.  Depth-first search algorithm ◦ Using a Depth –first search algorithm to find the connected components of the graph, By starting with some U in D, Assign it the first cluster number and find all the units it is connected to, then if there still are units in D that have not yet been visited , we find one and repeat the procedure.
  • 19.  For each high coverage subspace S do ◦ Consider the set E of all the dense units in S. ◦ While E ≠  ◦ m´ =1 ◦ Select a randomly chosen unit u from E. ◦ Assign to Cm´, u and all units of E that are connected to u. ◦ E=E-Cm´ ◦ End while  End for
  • 20.  The input to this step consists of disjoint clusters in k- dim subspace.  The goal is to generate a minimal description of each cluster with two steps: ◦ Covering with maximal region. ◦ Minimal cover.
  • 21.  The CLIQUE Algorithm (cont.) 3. Minimal description of clusters The minimal description of a cluster C, produced by the Last procedure, is the minimum possible union of hyper rectangular regions. For example  A  B is the minimum cluster description of the shaded region.  C  D  E is a non-minimal cluster description of the same region.
  • 22.  The CLIQUE Algorithm (cont.) 3. Minimal description of clusters (algorithm) For each cluster C do 1st stage • c=0 • While C ≠   c=c+1  Choose a dense unit in C  For i=1 to l o Grow the unit in both directions along the i-th dimension, trying to cover as many units in C as possible (boxes that are not belong to C should not be covered).  End for  Define the set I containing all the units covered by the above procedure  C=C-I • End while 2nd stage • Remove all covers whose units are covered by at least another cover.
  • 23.
  • 24.  A two dimensional grid of lines of edge size ξ applied in the two-dimensional feature space.  Two-dimensional and one-dimensional units are defined: ◦ ui q denotes the i-th one dimensional unit along xq ◦ uij denotes the two dimensional unit resulting from the Cartesian product of the i-th and j-th intervals along x1 and x2, respectively.  ξ=10 and τ=8% (thus, each unit containing more than 5 points is considered to be dense).
  • 25.  The points in u48 and u58, u75 and u76, u83 and u93 are collinear.
  • 26. One-dimensional dense units: D1={u2 1, u3 1, u4 1, u5 1, u8 1, u9 1, u1 2, u2 2, u3 2, u5 2, u6 2} Two-dimensional dense units: D2={u21, u22, u32, u33, u83, u93} Notes: •Although each one of the u48, u75, u76 contains more that 5 points, they are not included in D2. •Although it seems unnatural for u83 and u93 to be included in D2, they are included since u3 2 is dense. • All subspaces of the two-dimensional space contain clusters.
  • 27. One-dimensional clusters: C1={u2 1, u3 1, u4 1, u5 1} C2={u8 1, u9 1} C3={u1 2, u2 2, u3 2} C4={u5 2, u6 2} Two-dimensional clusters: C5={u21, u22, u32, u33} C6={u83, u93} One-dimensional dense units: D1={u2 1, u3 1, u4 1, u5 1, u8 1, u9 1, u1 2, u2 2, u3 2, u5 2, u6 2} Two-dimensional dense units: D2={u21, u22, u32, u33, u83, u93}
  • 28. C1={(x1): 1 x1<5} C2={(x1): 7 x1<9} C3={(x2): 0 x2<3} C4={(x2): 4 x2<6} C5={(x1, x2): 1 x1<2, 0 x2<2}{(x1, x2): 2 x1<3, 1 x2<3} C6={(x1, x2): 7 x1<9, 2 x2<3} Note that C2 and C6 are essentially the same cluster, which is reported twice by the algorithm.
  • 29.  We now empirically evaluate CLIQUE using synthetic data (Generator from M.Zait and H.Messatfa. a comparative study of clustering methods)  The goals of the experiments are to assess the efficiency of CLIQUE:  Efficiency :Determine how the running time scales with ◦ Dimensionality of the data space. ◦ Dimensionality of clusters. ◦ Size of data.  Accuracy:Test if CLIQUE recovers known clusters in some subspaces of a high dimensional data space.
  • 30.
  • 31.
  • 32. Using clusters embedded in 5-dim subspaces while varying the dimensional of the space from 5 to50. CLIQUE was able to recover all clusters in every case.
  • 33.
  • 34.  Strength ◦ automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces ◦ insensitive to the order of records in input and does not presume some canonical data distribution ◦ scales linearly with the size of input and has good scalability as the number of dimensions in the data increases  Weakness ◦ The accuracy of the clustering result may be degraded at the expense of simplicity of the method
  • 35.  The problem of high dimensionality is often tackled by requiring the user to specify the subspace for cluster analysis. But user-identification of quite error-prone. CLIQUE can find clusters embedded in subspaces of high dimensional data without requiring the user to guess subspaces that might have interesting clusters.  CLIQUE generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension.  CLIQUE is insensitive to the order of input records, Some clustering algorithms are sensitive to the order of input data.  Empirical evolution shows that CLIQUE scales linearly with the size of input and has good scalability as the number of dimension in the data.  CLIQUE can accurately discover clusters embedded in lower dimensional subspaces.