SlideShare ist ein Scribd-Unternehmen logo
1 von 171
CLIQUE and STING
Dr S.Natarajan
Professor and Key Resource Person
Department of Information Science and
Engineering
PES Institute of Technology
Bengaluru
natarajan@pes.edu
995280225
High-dimensional integration
• High-dimensional integrals in statistics, ML, physics
• Expectations / model averaging
• Marginalization
• Partition function / rank models / parameter learning
• Curse of dimensionality:
• Quadrature involves weighted sum over exponential
number of items (e.g., units of volume)
L L2 L3 Ln
n dimensional
hypercube
L4
2
High Dimensional Indexing Techniques
• Index trees (e.g., X-tree, TV-tree, SS-tree, SR-tree, M-
tree, Hybrid Tree)
– Sequential scan better at high dim. (Dimensionality Curse)
• Dimensionality reduction (e.g., Principal Component
Analysis (PCA)), then build index on reduced space
Datasets
• Synthetic dataset:
– 64-d data, 100,000 points, generates clusters in different
subspaces (cluster sizes and subspace dimensionalities follow
Zipf distribution), contains noise
• Real dataset:
– 64-d data (8X8 color histograms extracted from 70,000
images in Corel collection), available at
http://kdd.ics.uci.edu/databases/CorelFeatures
5
Preliminaries – Nearest
Neighbor Search
• Given a collection of data points and a query
point in m-dimensional metric space, find the
data point that is closest to the query point
• Variation: k-nearest neighbor
• Relevant to clustering and similarity search
• Applications: Geographical Information
Systems, similarity search in multimedia
databases
6
NN Search Con’t
Source: [2]
7
Problems with
High Dimensional Data
• A point’s nearest neighbor (NN) loses
meaning
Source: [2]
8
Problems Con’t
• NN query cost degrades – more strong
candidates to compare with
• In as few as 10 dimensions, linear scan
outperforms some multidimensional indexing
structures (e.g. SS tree, R* tree, SR tree)
• Biology and genomic data can have
dimensions in the 1000’s.
9
Problems Con’t
• The presence of irrelevant attributes
decreases the tendency for clusters to form
• Points in high dimensional space have high
degree of freedom; they could be so
scattered that they appear uniformly
distributed
10
Problems Con’t
• In which cluster does the query point fall?
11
The Curse
• Refers to the decrease in performance of query
processing when the dimensionality increases
• The focus of this talk will be on quality issues of
NN search and on not performance issues
• In particular, under certain conditions, the distance
between the nearest point and the query point
equals the distance between the farthest and
query point as dimensionality approaches infinity
12
Curse Con’t
Source: N. Katayama, S. Satoh. Distinctiveness Sensitive Nearest Neighbor Search for Efficient Similarity
Retrieval of Multimedia Information. ICDE Conference, 2001.
13
Unstable NN-Query
A nearest neighbor query is unstable for a given
 > 0 if the distance from the query point to
most data points is less than (1+) times the
distance from the query point to its nearest
neighbor
Source: [2]
14
Theorem Con’t
Source: [2]
15
Theorem Con’t
Source: [1]
16
Rate of Convergence
• At what dimensionality does NN-queries
become unstable. Not easy to answer, so
experiments were performed on real and
synthetic data.
• If conditions of theorem are met,
DMAXm/DMINm should decrease with
increasing dimensionality
17
Conclusions
• Make sure enough contrast between query and
data points. If distance to NN is not much
different from average distance, the NN may not
be meaningful
• When evaluating high-dimensional indexing
techniques, should use data that do not satisfy
Theorem 1 and should compare with linear scan
• Meaningfulness also depends on how you
describe the object that is represented by the
data point (i.e., the feature vector)
18
Other Issues
• After selecting relevant attributes, the
dimensionality could still be high
• Reporting cases when data does not yield any
meaningful nearest neighbor, i.e. indistinctive
nearest neighbors
Sudoku
• How many ways to fill a valid sudoku square?
• Sum over 981 ~ 1077 possible squares (items)
• w(x)=1 if it is a valid square, w(x)=0 otherwise
• Accurate solution within seconds:
• 1.634×1021 vs 6.671×1021
1 2
….
?
19
MDL
Minimum Description Length Principle
 Occam’s razor: prefer the simplest hypothesis
 Simplest hypothesis  hypothesis with shortest
description length
 Minimum description length
 Prefer shortest hypothesis
 LC (x) is the description length for message x
under coding scheme c
1 2
argmin ( ) ( | )MDL C C
h H
h L h L D h

 
# of bits to encode
hypothesis h
# of bits to encode
data D given h
Complexity of
Model
# of Mistakes
MDL: Interpretation of –logP(D|H)+K(H)
 Interpreting –logP(D|H)+K(H)
K(H) is mimimum description length of H
-logP(D|H) is the mimimum description length of D
(experimental data) given H. That is, if H perfectly
explains D, then P(D|H)=1, then this term is 0. If not
perfect, then this is interpreted as the number of bits
needed to encode errors.
 MDL: Minimum Description Length principle (J.
Rissanen): given data D, the best theory for D is
the theory H which minimizes the sum of
Length of encoding H
Length of encoding D, based on H (encoding errors)
CLIQUE: A Dimension-Growth
Subspace Clustering Method
 Firstdimensiongrowth subspaceclustering algorithm
 Clusteringstarts atsingle-dimensionsubspaceand
moveupwardstowards higherdimension subspace
 Thisalgorithmcan beviewedas the integration
Densitybasedand Grid based algorithm
CLIQUE (CLstering In QUEst)
• Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).
• Automatically identifying subspaces of a high dimensional data
space that allow better clustering than original space
• CLIQUE can be considered as both density-based and grid-based
– It partitions each dimension into the same number of equal length intervals
– It partitions an m-dimensional data space into non-overlapping rectangular
units
– A unit is dense if the fraction of total data points contained in the unit
exceeds the input model parameter
– A cluster is a maximal set of connected dense units within a subspace
Definitions That Need to Be Known
 Unit : After forming a grid structure on
the space, each rectangular cell is
called a Unit.
 Dense: A unit is dense, if the fraction of
total data points contained in the
unit exceeds the input model
parameter.
 Cluster: A cluster is defined as a maximal
set of connected dense units.
Informal problem statement
 Given a large set of multidimensional data points, the
data space is usually not uniformly occupied by the
datapoints.
 CLIQUE’s clustering identifies
“crowded” areas in space
the sparse and the
(or units), thereby
discovering the overall distribution patterns of the
dataset.
 A unit is dense if the fraction of total data points
contained in itexceedsan input model parameter.
 In CLIQUE, a clusteris defined as a maximal setof
connected denseunits.
Formal Problem Statement
 LetA= {A1, A2, . . . , Ad } bea setof bounded, totally
ordered domains and S = A1× A2× · · · × Ad a d-
dimensional numericalspace.
 Wewill referto A1, . . . , Ad as the dimensions
(attributes) of S.
 The inputconsistsof a setof d-dimensional pointsV =
{v1, v2, . . . ,vm}
 Wherevi = vi1, vi2, . . . , vid . The j th componentof vi is
drawn from domainAj .
28
 The CLIQUE Algorithm (cont.)
3. Minimal description of clusters
The minimal description of a cluster C, produced by the above
procedure, is the minimum possible union of hyperrectangular regions.
For example
• A  B is the minimum cluster description of the shaded region.
• C  D  E is a non-minimal cluster description of the same region.
Clique Working
 2 StepProcess
 1st step – Partitioning the d- dimensional dataspace
 2nd step- Generates the minimal descriptionof each
cluster.
1st step- Partitioning
 Partitioning is done foreach dimension.
Example continue….
continue….
 The subspaces representing these dense units are
intersected to forma candidatesearch space in which
denseunitsof higherdimensionality mayexist.
 Thisapproachof selecting candidates is quite similar
toApiori Gen processof generating candidates.
 Here it is expected that if some thing is dense in
higherdimensional space itcant besparse in lower
dimensionstate.
More formally
 If a k-dimensional unit is dense, then soare its projections
in (k-1)-dimensionalspace.
 Given a k-dimensional candidate dense unit, if any of it’s
(k-1)th projection unit is not dense then kth dimensional
unitcannot be dense
 So,we can generate candidate dense units in k-dimensional
space from the dense units found in (k-1)-dimensional
space
 Theresulting space searched is much smaller than the
originalspace.
 Thedense unitsare thenexamined inorder todetermine
theclusters.
Intersection
Denseunits found with respect to age for the dimensionssalary
and vacation are intersected in order to provide a candidate
search space fordense units of higherdimensionality.
2nd stage- Minimal Description
 Foreach cluster, Clique determines the maximal
region that covers the cluster of connected dense units.
 It then determines a minimal cover (logicdescription)
foreach cluster.
Effectiveness of Clique-
 CLIQUE automatically finds subspaces of the highest
dimensionalitysuch that high-densityclustersexist in
thosesubspaces.
 It is insensitive to the orderof inputobjects
 Itscales linearlywith thesizeof input
 Easily scalablewith the numberof dimensions in the
data
GRID-BASED CLUSTERING
METHODS
 This is the approach in which we
quantize space into a finite number of
cells that form a grid structure on which
all of the operations for clustering is
performed.
 So, for example assume that we have a
set of records and we want to cluster with
respect to two attributes, then, we divide
the related space (plane), into a grid
structure and then we find the clusters.
Age
Salary (10,000)
Our “space” is this
plane
20 30 40 50 60
8
7
6
5
4
3
2
1
0
Techniques for Grid-Based Clustering
The following are some techniques
that are used to perform Grid-Based
Clustering:
 CLIQUE (CLustering In QUest.)
 STING (STatistical Information Grid.)
 WaveCluster
Looking at CLIQUE as an Example
 CLIQUE is used for the clustering of high-
dimensional data present in large tables.
By high-dimensional data we mean
records that have many attributes.
 CLIQUE identifies the dense units in the
subspaces of high dimensional data
space, and uses these subspaces to
provide more efficient clustering.
How Does CLIQUE Work?
 Let us say that we have a set of records
that we would like to cluster in terms of
n-attributes.
 So, we are dealing with an n-
dimensional space.
 MAJOR STEPS :
 CLIQUE partitions each subspace that has
dimension 1 into the same number of equal
length intervals.
 Using this as basis, it partitions the n-
dimensional data space into non-overlapping
rectangular units.
CLIQUE: Major Steps (Cont.)
 Now CLIQUE’S goal is to identify the dense n-
dimensional units.
 It does this in the following way:
 CLIQUE finds dense units of higher
dimensionality by finding the dense units in the
subspaces.
 So, for example if we are dealing with a 3-
dimensional space, CLIQUE finds the dense
units in the 3 related PLANES (2-dimensional
subspaces.)
 It then intersects the extension of the
subspaces representing the dense units to
form a candidate search space in which dense
units of higher dimensionality would exist.
CLIQUE: Major Steps. (Cont.)
 Each maximal set of connected dense units is
considered a cluster.
 Using this definition, the dense units in the
subspaces are examined in order to find
clusters in the subspaces.
 The information of the subspaces is then used
to find clusters in the n-dimensional space.
 It must be noted that all cluster boundaries are
either horizontal or vertical. This is due to the
nature of the rectangular grid cells.
Example for CLIQUE
 Let us say that we want to cluster a set
of records that have three attributes,
namely, salary, vacation and age.
 The data space for the this data would
be 3-dimensional.
age
salary
vacation
Example (Cont.)
 After plotting the data objects,
each dimension, (i.e., salary,
vacation and age) is split into
intervals of equal length.
 Then we form a 3-dimensional grid
on the space, each unit of which
would be a 3-D rectangle.
 Now, our goal is to find the dense
3-D rectangular units.
Example (Cont.)
 To do this, we find the dense units
of the subspaces of this 3-d space.
 So, we find the dense units with
respect to age for salary. This
means that we look at the salary-
age plane and find all the 2-D
rectangular units that are dense.
 We also find the dense 2-D
rectangular units for the vacation-
age plane.
Example 1
Salary
(10,000)
20 30 40 50 60
age
54312670
20 30 40 50 60
age
54312670
Vacation
(week) 20 30 40 50 60
age
54312670
Vacation
(week)
Example (Cont.)
 Now let us try to visualize the
dense units of the two planes on the
following 3-d figure :
age
Vacation
Salary 30 50
age
Vacation
Salary 30 50
 = 3
Example (Cont.)
 We can extend the dense areas in the
vacation-age plane inwards.
 We can extend the dense areas in the
salary-age plane upwards.
 The intersection of these two spaces
would give us a candidate search space in
which 3-dimensional dense units exist.
 We then find the dense units in the
salary-vacation plane and we form an
extension of the subspace that represents
these dense units.
Example (Cont.)
 Now, we perform an intersection of
the candidate search space with the
extension of the dense units of the
salary-vacation plane, in order to
get all the 3-d dense units.
 So, What was the main idea?
 We used the dense units in
subspaces in order to find the dense
units in the 3-dimensional space.
 After finding the dense units, it is
very easy to find clusters.
Reflecting upon CLIQUE
 Why does CLIQUE confine its search for
dense units in high dimensions to the
intersection of dense units in subspaces?
 Because the Apriori property employs
prior knowledge of the items in the search
space so that portions of the space can be
pruned.
 The property for CLIQUE says that if a k-
dimensional unit is dense then so are its
projections in the (k-1) dimensional
space.
Strength and Weakness of CLIQUE
 Strength
 It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in
those subspaces.
 It is quite efficient.
 It is insensitive to the order of records in input and
does not presume some canonical data distribution.
 It scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases.
 Weakness
 The accuracy of the clustering result may be
degraded at the expense of simplicity of the simplicity
of this method.
CLIQUE: The Major Steps
• Partition the data space and find the number of points
that lie inside each cell of the partition.
• Identify the subspaces that contain clusters using the
Apriori principle
• Identify clusters:
– Determine dense units in all subspaces of interests
– Determine connected dense units in all subspaces of interests.
• Generate minimal description for the clusters
– Determine maximal regions that cover a cluster of connected
dense units for each cluster
– Determination of minimal cover for each cluster
Salary
(10,000)
20 30 40 50 60
age
54312670
20 30 40 50 60
age
54312670
Vacation
(week)
age
Vacation
30 50
 = 3
Strength and Weakness of
CLIQUE
• Strength
– It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in those
subspaces
– It is insensitive to the order of records in input and does not
presume some canonical data distribution
– It scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
• Weakness
– The accuracy of the clustering result may be degraded at
the expense of simplicity of the method
Global Dimensionality Reduction (GDR)
First Principal
Component (PC) First PC
•works well only when data is globally correlated
•otherwise too many false positives result in high query cost
•solution: find local correlations instead of global correlation
Local Dimensionality Reduction (LDR)
First PC
GDR LDR
First PC of
Cluster1
Cluster1
Cluster2
First PC of
Cluster2
Correlated Cluster
Second PC
(eliminated
dim.)
Centroid of
cluster
(projection of
mean on
eliminated dim)
First PC
(retained dim.)
Mean of all
points in cluster
A set of locally correlated points = <PCs, subspace dim,
centroid, points>
Reconstruction Distance
Centroid of
cluster
First PC
(retained dim)
Second PC
(eliminated dim)
Point Q
Projection of Q
on eliminated
dim
Reconstruction
Distance(Q,S)
Reconstruction Distance Bound
Centroid
First PC
(retained dim)
Second PC
(eliminated dim)
MaxReconDist
MaxReconDist
ReconDist(P, S) MaxReconDist, " P in S
Other constraints
• Dimensionality bound: A cluster must not retain
any more dimensions necessary and subspace
dimensionality MaxDim
• Size bound: number of points in the cluster 
MinSize
Clustering Algorithm
Step 1: Construct Spatial Clusters
• Choose a set of well-
scattered points as
centroids (piercing set)
from random sample
• Group each point P in the
dataset with its closest
centroid C if the
Dist(P,C) 
Clustering Algorithm
Step 2: Choose PCs for each cluster
• Compute PCs
Clustering Algorithm
Step 3: Compute Subspace Dimensionality
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14 16
#dims retained
Fracpointsobeying
recons.bound
• Assign each point to
cluster that needs
min dim. to
accommodate it
• Subspace dim. for
each cluster is the
min # dims to retain
to keep most points
Clustering Algorithm
Step 4: Recluster points
• Assign each point P to
the cluster S such that
ReconDist(P,S) 
MaxReconDist
• If multiple such
clusters, assign to first
cluster (overcomes
“splitting” problem)
Empty
clusters
Clustering algorithm
Step 5: Map points
• Eliminate small
clusters
• Map each point to
subspace (also store
reconstruction dist.)
Map
Clustering algorithm
Step 6: Iterate
• Iterate for more clusters as long as new
clusters are being found among outliers
• Overall Complexity: 3 passes, O(ND2K)
Experiments (Part 1)
• Precision Experiments:
– Compare information loss in GDR and LDR for same reduced
dimensionality
– Precision = |Orig. Space Result|/|Reduced Space Result| (for
range queries)
– Note: precision measures efficiency, not answer quality
Datasets
• Synthetic dataset:
– 64-d data, 100,000 points, generates clusters in different
subspaces (cluster sizes and subspace dimensionalities follow
Zipf distribution), contains noise
• Real dataset:
– 64-d data (8X8 color histograms extracted from 70,000
images in Corel collection), available at
http://kdd.ics.uci.edu/databases/CorelFeatures
Precision Experiments (1)
0
0.5
1
Prec.
0 0.5 1 2
Skew in cluster size
Sensitivity of prec. to skew
LDR GDR
0
0.5
1
Prec.
1 2 5 10
Number of clusters
Sensitivity of prec. to num clus
LDR GDR
Precision Experiments (2)
0
0.5
1
Prec.
0 0.02 0.05 0.1 0.2
Degree of Correlation
Sensitivity of prec. to correlation
LDR GDR
0
0.5
1
Prec.
7 10 12 14 23 42
Reduced dim
Sensitivity of prec. to reduced dim
LDR GDR
Index structure
Root containing pointers to root of
each cluster index (also stores PCs and
subspace dim.)
Index
on
Cluster 1
Index
on
Cluster K
Set of outliers
(no index:
sequential scan)
Properties: (1) disk based
(2) height  1 + height(original space index)
(3) almost balanced
Cluster Indices
• For each cluster S, multidimensional index on (d+1)-dimensional space instead
of d-dimensional space:
– NewImage(P,S)[j] = projection of P along jth PC for 1 j d
= ReconDist(P,S) for j= d+1
• Better estimate:
 D(NewImage(P,S), NewImage(Q,S)) 
D(Image(P,S), Image(Q,S))
• Correctness: Lower Bounding Lemma
 D(NewImage(P,S), NewImage(Q,S)) D(P,Q)
Effect of Extra dimension
I/O cost
0
200
400
600
800
1000
12 14 15 17 19 30 34
Reduced dimensionality
#randdisk
accesses
d-dim
(d+1)-dim
Outlier Index
• Retain all dimensions
• May build an index, else use sequential scan
(we use sequential scan for our
experiments)
Query Support
• Correctness:
– Query result same as original space index
• Point query, Range Query, k-NN query
– similar to algorithms in multidimensional index structures
– see paper for details
• Dynamic insertions and deletions
– see paper for details
Experiments (Part 2)• Cost Experiments:
– Compare linear scan, Original Space Index(OSI), GDR and LDR in
terms of I/O and CPU costs. We used hybrid tree index structure for
OSI, GDR and LDR.
• Cost Formulae:
– Linear Scan: I/O cost (#rand accesses)=file_size/10, CPU cost
– OSI: I/O cost=num index nodes visited, CPU cost
– GDR: I/O cost=index cost+post processing cost (to eliminate false
positives), CPU cost
– LDR: I/O cost=index cost+post processing cost+outlier_file_size/10,
CPU cost
I/O Cost (#random disk accesses)
I/O cost comparison
0
500
1000
1500
2000
2500
3000
7 10 12 14 23 42 50 60
Reduced dim
#rand disk
acc
LDR
GDR
OSI
Lin Scan
CPU Cost (only computation time)
CPU cost comparison
0
20
40
60
80
7 10 12 14 23 42
Reduced dim
CPU cost
(sec)
LDR
GDR
OSI
Lin Scan
Conclusion
• LDR is a powerful dimensionality reduction technique
for high dimensional data
– reduces dimensionality with lower loss in distance
information compared to GDR
– achieves significantly lower query cost compared to linear
scan, original space index and GDR
• LDR has applications beyond high dimensional indexing
10/8/2016 CLIQUE clustering algorithm 81
Motivation
 An object typically has dozens of attributes, the
domain for each attribute can be large
 Require the user to specify the subspace for cluster
analysis
 User-identification of subspaces is quite error-
prone.
10/8/2016 CLIQUE clustering algorithm 82
The Contribution of CLIQUE
 Automatically find subspaces with
high-density clusters in high
dimensional attribute space
10/8/2016 CLIQUE clustering algorithm 83
Background
 A1, A2, ﹍, Ad is the dimensions of S:
 A = {A1, A2, ﹍, Ad}
 S = A1 × A2 × ﹍ × Ad
 units:
 Partition every dimension into ξ intervals of equal
length
 unit u: {u1, u2, ﹍, ud} where ui = [ li, hi )
10/8/2016 CLIQUE clustering algorithm 84
Background(Cont.)
 Selectivity: the fraction of total data points
contained in the unit
 Dense unit: selectivity(u) > 
 Cluster: a maximal set of connected dense units
10/8/2016 CLIQUE clustering algorithm 85
1
Example
10/8/2016 CLIQUE clustering algorithm 86
Background(Cont.)
 region: axis-parallel rectangular set
 RC=R: R is contained in C
 maximal region: no proper superset of R is contained in C
 minimal description: a non-redundant covering of the
cluster with maximal regions
10/8/2016 CLIQUE clustering algorithm 87
Example
((30age50)(4 salary8))  ((40age60)(2 salary6))
2
10/8/2016 CLIQUE clustering algorithm 88
CLIQUE Algorithm
1. Identification of dense units
2. Identification of clusters.
3. Generation of minimal description
10/8/2016 CLIQUE clustering algorithm 89
Identification of dense units
 bottom-up algorithm:
 like Apriori algorithm
 Monotonicity:
 If a collection of points S is a cluster in a k-dimensional
space, then S is also part of a cluster in any (k–1)-
dimensional projections of this space.
10/8/2016 CLIQUE clustering algorithm 90
Algorithm
1. determine 1-dimensional dense units
2. k = 2
3. generate candidate k-dimensional units from
(k-1)-dimensional dense units
4. if candidates are not empty
find dense units
k = k + 1
go to step 3
10/8/2016 CLIQUE clustering algorithm 91
Algorithm - Candidate Generation
 Self-joining
insert into Ck
select u1.[l1, h1), u1.[l2, h2), ﹍, u1.[lk-1, hk-1), u2.[lk-1, hk-1)
from Dk-1 u1, Dk-1 u2
where u1.a1 = u2.a1, u1.l1 = u2.l1, u2.h1 = u2.h1,
u1.a2 = u2.a2, u1.l2 = u2.l2, u2.h2 = u2.h2, ﹍,
u1.ak-2 = u2.ak-2, u1.lk-2 = u2.lk-2, u2.hk-2 = u2.hk-2,
u1.ak-1 < u2.ak-1
 Pruning
10/8/2016 CLIQUE clustering algorithm 92
10/8/2016 CLIQUE clustering algorithm 93
Prune subspaces
 Objective: use only the dense units that lie in
“interesting” subspaces
 MDL principle:
 encode the input data under a given model and
select the encoding that minimizes the code
length.
10/8/2016 CLIQUE clustering algorithm 94
Prune subspaces (Cont.)
 Group together dense units in the same subspace
 Compute the number of points for each subspace
 Sort subspaces in the descending order of their coverage
 Minimize the total length of the encoding
|))((|log))((log
|))((|log))((log)(
1
2
1
2
ixi
ixiiCL
P
nji
SP
I
ij
SI
j
j








2
2
 

jij Su iS ucountx )(
10/8/2016 CLIQUE clustering algorithm 95
Prune subspaces (Cont.)
Partitioning of the subspaces into selected and pruned sets
10/8/2016 CLIQUE clustering algorithm 96
Finding Clusters
10/8/2016 CLIQUE clustering algorithm 97
Generating minimal cluster
descriptions
 R is a cover of C
 optimal cover: NP-hard
 solution to the problem:
 greedily cover the cluster by a number of maximal
regions
 discard the redundant regions
10/8/2016 CLIQUE clustering algorithm 98
Greedy growth
1) begin with an arbitrary dense
unit u  C
2) Greedily grow a maximal
region covering u, add to R
3) repeat 2) with all uk  C are
covered by some maximal
regions in R
10/8/2016 CLIQUE clustering algorithm 99
Minimal Cover
 Remove from the cover the smallest
maximal region which is redundant.
 Repeat the procedure until no maximal
region can be removed.
100
Performance Experiments
10/8/2016 CLIQUE clustering algorithm 101
Comparison with Birch, DBScan
Concludes that CLIQUE performs better than Birch, DBScan
10/8/2016 CLIQUE clustering algorithm 102
Real data experimental result
 datasets:
 insurance industry (Insur1, Insur2)
 department store (Store)
 bank (Bank)
 In all cases, we discovered
meaningful clusters
embedded in lower
dimensional subspaces.
10/8/2016 CLIQUE clustering algorithm 103
Strength
 automatically finds clusters in subspaces
 insensitive to the order of records
 not presume some canonical data
distribution
 scales linearly with the size of input
 tolerant of missing values
10/8/2016 CLIQUE clustering algorithm 104
Weakness
 depends on some parameters that hard to
pre-select
 ξ (partition threshold)
  (density threshold)
 some potential clusters will be lost in the
density-units prune procedures.
 the correctness of the algorithm degrades
What or who is STING?
 A singer who was the
lead singer of the
band Police and then
took up solo career
and won many
grammy’s.
 The bite of a scorpion.
 A Statistical
Information Grid
Approach to Spatial
Data Mining.
 All of the above.
What is Spatial Data?
Many definitions according to specific areas
According to GIS
 Spatial data may be thought of as features
located on or referenced to the Earth's
surface, such as roads, streams, political
boundaries, schools, land use classifications,
property ownership parcels, drinking water
intakes, pollution discharge sites - in short,
anything that can be mapped.
 Geographic features are stored as a series of
coordinate values. Each point along a road or
other feature is defined by positional
coordinate value, such as longitude and
latitude.
 The GIS stores and manages the data not as
a map but as a series of layers or, as they
are sometimes called, themes
When viewed in a GIS, these layers
visually appear as one graphic, but are
actually still independent of each other.
This allows changes to specific themes,
without affecting the others.
Discussion Question 1: So can you define spatial Data Generically????
•Spatial database systems aim at storing, retrieving, manipulating, querying, and
analyzing geometric data.
•Special data types are necessary to model geometry and to suitably represent
geometric data in database systems. These data types are usually called spatial
data types, such as point, line, and region but also include more complex types like
partitions and graphs (networks).
•Data Type understanding is a prerequisite for an effective construction of important
components of a spatial database system (like spatial index structures, optimizers
for spatial data, spatial query languages, storage management, and graphical user
interfaces) and for a cooperation with extensible DBMS providing spatial type
extension packages (like spatial data blades and cartridges).
•Excellent tutorial on spatial data and data types available at:
http://www.informatik.fernuni-hagen.de/import/pi4/schneider/abstracts/TutorialSDT.html
What are Spatial Databases?
Different Grid Levels during Query
Processing.
Pennsylvania Spatial Data Access
http://www.pasda.psu.edu/
The Missouri Spatial Data Information Service
http://msdis.missouri.edu/
National Spatial Data Infrastructure
http://www.fgdc.gov/nsdi/nsdi.html
Michigan Department of Natural Resources Online
www.dnr.state.mi.us/spatialdatalibrary/
Georgia Spatial Data Infrastructure Home Page
www.gis.state.ga.us/
Free GIS Data - GIS Data Depot
www.gisdatadepot.com
Spatial Data Resources
Spatial Data Mining
 Discovery of interesting characteristics and
patterns that may implicitly exist in spatial
databases.
 Huge amount of data specialized in nature.
 Clustering and region oriented queries are
common problems in this domain.
 We deal with high dimensional data
generally.
 Applications: GIS, Medical Imaging etc.
•Huge Amount of Data Specialized in Nature
Problems????????
•Complexity
•Defining of geometric patterns and region
oriented queries
•Conceptual nature of problem!
•Spatial Data Accessing
STING-An Introduction
•STING is a grid based method to efficiently process many
common region oriented queries on a set of points
•What defines region? You tell me! Essentially it is a set of points
satisfying some criterion
•It is a hierarchical Method. The idea is to capture statistical
information associated with spatial cells in such a manner that
the whole classes of queries can be answered without referring
to the individual objects.
•Complexity is hence even less than O(n) infact what do you
think it will be???
•Link to Paper: http://citeseer.nj.nec.com/wang97sting.html
Related Work
Spatial Data Mining
Generalization Based Knowledge Discovery Clustering Based Methods
CLARANS BIRCH DBSCANSpatial Data Dominant Non-Spatial Data Dominant
Great comparison of Clustering algorithms
http://www.cs.ualberta.ca/~joerg/papers/KI-Journal.pdf
Generalization Based Approaches
 Two types: Spatial Data Dominant and Non-
Spatial Data Dominant
 Both of these require that a generalization
hierarchy is given explicitly by experts or is
somehow generated automatically.
 Quality of mined data depends on the structure of
the hierarchy.
 Computational Complexity O(nlogn)
 So the onus shifted to developing algorithms
which discover characteristics directly from data.
This was the motivation to move to clustering
algorithms
Clustering Based Approaches
 BIRCH: Already covered Remember it??
Complexity??
 The problem with BIRCH is that it does not work
well with clusters which are not spherical.
 DBSCAN: Already covered Remember it??
Complexity??
 The Global Parameter Eps determination in
DBSCAN requires human participation
 When the point set to be clustered is the response
set of objects with some qualifications, then
determination of Eps must be done each time and
cost is hence higher.
Clustering Based Approaches
 CLARANS: Clustering Large Applications based upon
RANdomized Search.
 Although claims have been made on it being linear it is
essentially quadratic.
 The computational Complexity is at least Ώ(KN2) where N
is the number of data point and K is the number of
clusters.
 Quality of results can not be guaranteed when N is large as
we use Randomized Search
Optimization with Randomized Search Heuristics The
(A)NFL Theorem, Realistic Scenarios, and Dicult Functions
Related Work
 All the approaches described in previous slides
are all query dependent approaches
 The structure of queries influence the structure of
the algorithm and cannot be generalized to all
queries.
 As they scan all the data points the complexity
will at least be O(N)
STING THE OVERVIEW
 Spatial Area is divided into rectangular cells
 Different levels of cells corresponding to different
resolution and these cells have a hierarchical structure.
 Each cell at a higher level is partitioned into number of
cells of the next lower level
 Statistical information of each cell is calculated and stored
beforehand and is used to answer queries
GRID CELL HIERARCHY
Each Cell at (i-1)th level has 4
children at ith level (can be
changed)
The size of leaf cell is dependent
on the density of objects.
Generally it should be from several
dozens to thousands
 For each cell we have attribute-dependent and
attribute-independent parameters
 The attribute independent parameter is number of
objects in a cell-n
 For attribute dependent parameters it is assumed
that for each object its attributes have numerical
values.
 For each Numerical attribute we have the
following five parameters
GRID CELL HIERARCHY
GRID CELL HIERARCHY
 m- mean of all values in this cell
 s- standard deviation of all values in
this cell
 min-the minimum value of the
attribute in this cell
 max-the minimum value of the
attribute in this cell
 distribution-the type of distribution
this cell follows. (This is of
enumeration type)
Parameter Generation
•The determination of dist parameter is as follows
•First the dist is set to distribution followed by most point
•An estimate is made on number of conflicting points confl according to following
Rules
1) if disti is not equal to dist, m=mi and s=si then confl is increased
by amount ni.
2) if disti is not equal to dist, m=mi or s=si but not both then confl is
set to n.
3) if disti=dist and m=mi and s=si then confl is not changed
4) if disti = dist, m=mi or s=si but not both then confl is set to n.
Finally if confl/n is greater than a threshold (say 0.05) then dist is set to none or
Original dist is retained
Parameter Generation
i 1 2 3 4
ni 100 50 60 10
mi 20.1 19.7 21 20.5
si 2.3 2.2 2.4 2.1
mini 4.5 5.5 3.8 7
max
i 36 34 37 40
disti
Norm
al
Norm
al
Norm
al
Non
e
The parameters of the current cell are
N=220
m=20.27
s=2.37
min=3.8
max=40
dist=NORMAL
This is so because there are 220 data points out of which 10 are not NORMAL
So confl/n=10/220=0.045<0.05 hence it is still NORMAL.
The parameters are calculated only once so overall compilation time is O(N)
But querying requires much less time as we only scan the number of grid cells
K i.e. O(K)
Query Types
 If hierarchical structure cannot
answer a query then can go to
underlying database
 SQL like Language used to describe
queries
 Two types of common queries
found: one is to find region
specifying certain constraints and
other take in a region and return
some attribute of the region
Query Type: Examples
Algorithm
 Top down querying. Examine cells at a higher level determine
if the cell is relevant to query at some confidence level. This
likelihood can be defined as the proportion of objects in this
cell that satisfy the query conditions. After obtaining the
confidence interval, we label this cell to be relevant or not
relevant at the specified confidence level.
 After doing so for the present layer process is repeated for the
children cells of the RELEVANT cells in the present layer only!!!
 Procedure continues till the bottom most layer
 Find region formed by relevant cells and return them
 If not satisfactory retrieve those data that fall into the
relevant cells from database and do some further processing.
 After all cells are labeled as relevant or not
relevant, we can easily find all regions that
satisfy the density specified by Breadth First
Search.
 For a relevant cell, we examine cells within a
certain distance d from the center of the
current cell to see if the average density
within this small area is greater than density
specified.
 If yes the cells are put into a queue
 Step 2 and 3 are repeated for all the cells in
the queue except cells previously examined
are omitted.
 When the queue is empty we get one region.
Algorithm
 The distance d =max (l, √(f/c∏)
 l, c, f are the side length of bottom
layer cell, the specified density and
small constant number set by STING
(does not vary from query to
another)
 L is usually the dominant term so we
generally have to examine the
neighborhood term. If only
granularity is very small do we need
examine very cell at that distance
rather than just the neighborhood.
Algorithm
Example
Given Data: Houses one of the attribute is price
Query:“Find those regions with area at least A where the number of houses per unit
area is at least c and at least b% of the houses have price between a and b with
(1 - a) confidence” where a < b. Here, a could be -æ and b could be +æ.
This query can be written as
We begin from the top level working our way down. Assume the dist type is NORMAL
First we calculate the proportion of houses whose price lies between [a,b]
The probability that price lies between a and b is
m and s are mean and standard deviation of all prices.
Example
 Now as we assume prices to be independent of m and s
the number of houses with price range [a, b] has a
binomial distribution with parameters n and p where n is
number of houses. Now we consider the following cases
according to n, np and n(1-p)
a) n<=30: binomial distribution used to determine confidence interval of the
number of houses whose prices fall into [a, b], and divide it by n to get the
confidence interval for the proportion.
b) When n > 30, n p ³ 5, and n(1 - p ) ³ 5, the proportion that the price falls
in [a, b] has a normal distribution Then 100(1 - alpha)%
confidence interval of the proportion is
c) When n>30 but np<5 , the Poissons distribution with parameters
is used for approximation.
d) When n>30 but n(1-p)<5, we can calculate the proportion of houses (X)
whose price is not in [a,b] using Poissons distribution with n(1-p) and
1-X is the proportion of houses whose prices is in [a,b].
Example
 Once we have the
confidence interval
or the estimated
range [p1, p2], we
can label this cell as
relevant or not
relevant.
 Let S be area of
cells at bottom
layer. If p1xn<Sxcx
%, we can label as
not relevant
otherwise as
relevant

Analysis of STING
 Step one takes constant time
 Step 2 and 3 total time is proportional
to the total number of cells in the
hierarchy.
 Total number of cells is 1.33K, where K
is number of cells at bottom layer.
 In all cases it is found or claimed to be
O(K)
 Discussion Question: what is the
complexity if we need to go to step 7 in
the algorithm??
Quality
 STING under the following sufficient condition
guarantee that if a region satisfies the specification of
the query then it is returned.
 Let F be a region. The width of F is defined as the
side length of the maximum square that can fit in F.
Limiting Behavior of STING
 The regions returned by Sting are
an approximation of the result by
DBSCAN. As the granularity
approaches zero the regions
returned by STING approaches
result of DBSCAN.
 SO worst case complexity is
O(nlogn)!!!!!
Performance measure
Case A: Normal Distribution
Query in e.g. answered in 0.2 sec
Structure generation: 9.8 second
Case A: None
Query in e.g. answered in 0.22 sec
Structure generation: 9.7 second
Performance measure
 Used a benchmark called SEQUOLA 2000 to
compare STING, DBSCAN, CLARANS
 All the previous algorithms have three phases in
query answering
1. Find Query Response
2. Build auxiliary structure
3. Do clustering
 STING does all of this in one step so is inherently
better.
Discussion Question
 “STING is trivially parallelizable.”
Comment why and what is the
importance of this statement?
References
 STING : Statistical Information Grid approach to spatial data
mining. Wei Wang et al.
 Optimization with Randomized Search Heuristics The (A)NFL
Theorem, Realistic Scenarios, and Dicult Functions. Stefan
Droste et al.
 Efficient and Effective clustering Method for spatial data
mining. R. Ng et al.
 BIRCH: An efficient data clustering method for very large
databases. T Zhang et al.
 Tutorial on Spatial data types: http://www.informatik.fernuni-
hagen.de/import/pi4/schneider/abstracts/TutorialSDT.html
 An efficient Approach to Clustering in Large Multimedia
Databases with Noise. A Hinneburg et al.
 Comparison of clustering algorithms :
http://www.cs.ualberta.ca/~joerg/papers/KI-Journal.pdf
Motivation
 All previousclustering algorithmare querydependent
 Theyare builtforone queryand generally no use for
otherquery.
 Needa separate scan foreachquery.
 Socomputation morecomplexat least O(n).
 Sowe need a structure outof Databaseso thatvarious
queriescan beanswered without rescanning.
Basics
 Grid based method-quantizes the object space into a
finite number of cells that form a grid structure on
which all of the operations for clustering are
performed
 Develop hierarchical Structure out of given data and
answer various queries efficiently.
 Everylevel of hierarchy consistof cells
 Answeringa query is not O(n) where n is the number
of elements in thedatabase
A hierarchical structure for STING clustering
continue …..
 Theroot of the hierarchy beat level 1
 Cell in level i corresponds to the union of the areasof
itschildrenat level i + 1
 Cellata higher level is partitioned to form a numberof
cellsof the next lowerlevel
 Statistical informationof each cell iscalculated and
stored beforehandand is used toanswerqueries
Cell parameter
 Attribute Independentparameter-
n- numberof objects (points) in this cell
 Attribute dependentparameters-
m - mean of all values in thiscell
s - standard deviation of all values of the attribute in this
cell
min - the minimumvalueof the attribute in this cell
max - the maximum value of the attribute in this cell
distribution - the type of distribution that the attribute
valuein thiscell follows
Parameter Generation
 n, m, s, min, and max of bottom levelcellsare
calculateddirectly from data
 Distributioncan beeitherassigned byuserorcan be
obtained byhypothetical tests - χ2 test
 Parametersof higher levelcells is calculated from
parameterof lowerlevelcells.
continue…..
 n, m, s, min, max, dist beparametersof current cell
 ni, mi, si, mini, maxi and disti beparametersof
corresponding lower levelcells
dist for Parent Cell
 Set dist as the distribution type followed by most pointsin
thiscell
 Nowcheck forconflicting points in thechild cells call it
confl.
1. If disti ≠ dist, mi ≈ m and si ≈ s, then confl is increased by an
amount of ni;
2. If disti ≠ dist, but either mi ≈ m or si ≈ s is not satisfied, then
set confl ton
3. If disti = dist, mi ≈ m and si ≈ s, then confl is increased by0;
4.If disti = dist, but either mi ≈ m or si ≈ s is not satisfied, then
confl is set ton.
continue…..
 If is greater thana threshold t set dist as NONE.
 Otherwise keeptheoriginal type.
Example:
continue…..
 Parameterforparentcell would be
n = 220
min = 3.8
m = 20.27
max =40
s = 2.37
dist = NORMAL
 210 points whose distributiontype is NORMAL
 Setdistof parentas Normal
 confl = 10
 = 0.045 < 0.05 so keeptheoriginal.
Query types
 STING structure is capableof answering variousqueries
 Butif itdoesn’tthenwealwayshave the underlying
Database
 Even if statistical information is notsufficientto answer
querieswe can still generate possiblesetof answers.
Common queries
 Select regions that satisfy certain conditions
Select the maximal regions that have at least 100 houses per unit
area and at least 70% of the house prices are above $400K and
withtotalareaat least100 unitswith90% confidence
SELECT REGION
FROM house-map
WHERE DENSITY IN (100, ∞)
AND price RANGE (400000, ∞) WITH PERCENT (0.7, 1)
AND AREA (100, ∞)
AND WITH CONFIDENCE 0.9
continue….
 Selects regions and returns some function of the region
Select the range of age of houses in those maximal regionswhere there
areat least 100 houses perunit areaand at least 70% of the houses have
pricebetween$150Kand $300Kwithareaat least100units in California.
SELECT RANGE(age)
FROM house-map
WHERE DENSITY IN (100, ∞)
AND price RANGE (150000, 300000) WITH PERCENT (0.7, 1)
AND AREA (100, ∞)
AND LOCATION California
Algorithm
 With the hierarchical structure of grid cells on hand,
we can use a top-down approach to answer spatial data
miningqueries
 For any query, we begin by examining cells on a high
levellayer
 calculate the likelihood that this cell is relevant to the
queryatsomeconfidence level using the parameters of
thiscell
 If the distribution type is NONE, we estimate the
likelihoodusing somedistribution free techniques
instead
continue….
 Afterweobtain the confidence interval, welabel this
cell to be relevant or not relevant at the specified
confidencelevel
 Proceed to the next layerbutonlyconsiderthe Childs
of relevantcellsof upperlayer
 Werepeat thisuntil wereach to the final layer
 Relevantcellsof final layerhave enoughstatistical
informationtogivesatisfactoryresultto query.
 However for accurate mining we may refer to data
corresponding torelevantcellsand furtherprocess it.
Finding regions
 Afterwe havegotall the relevantcellsat the final level
weneed to outputregionsthat satisfies thequery
 Wecan do itusing Breadth FirstSearch
Breadth First Search
 we examine cells within a certain distance from
the center of currentcell
 If the average density within this small area is
greater than the density specified mark this area
 Put the relevant cells just examined in thequeue.
 Take element from queue repeat the same
procedure except that only those relevant cells that
are not examined before are enqueued. When
queue is empty we have identified oneregion.
Statistical Information Grid-based
Algorithm
1. Determine a layer to beginwith.
2. For each cell of this layer, we calculate the confidence interval (or
estimated range) of probability that thiscell is relevantto thequery.
3. From the interval calculated above, we label the cell as relevant or not
relevant.
4. If this layer is the bottom layer, go to Step 6; otherwise, go to Step 5.
5. Wego down the hierarchy structure by one level. Go to Step 2 for those
cells that form the relevantcells of the higher level layer.
6. If the specification of the query is met, go to Step 8; otherwise, go to Step
7.
7. Retrieve those data fall into the relevant cells and do further processing.
Returntheresult that meet therequirementof thequery. Goto Step 9.
8. Find the regions of relevant cells. Return those regions that meet the
requirementof thequery. Go to Step 9.
9. Stop.
Time Analysis:
 Step1 takesconstanttime. Steps 2 and 3 require
constanttime.
 The total time is less than orequal to the total number
of cells in our hierarchical structure.
 Notice that the total numberof cells is 1.33K, where K
is the number of cells at bottomlayer.
 Sothe overall computationcomplexityon the grid
hierarchy structure isO(K)
Time Analysis:
 STING goesthrough the databaseonce to computethe
statistical parameters of thecells
 timecomplexityof generating clusters is O(n), where
n is the total numberof objects.
 After generating the hierarchical structure, the query
processing time is O(g), whereg is the total numberof
grid cells at the lowest level, which is usually much
smaller thann.
Comparison
Definitions That Need to Be Known
 Spatial Data:
 Data that have a spatial or location
component.
 These are objects that themselves are located
in physical space.
 Examples: My house, lake Geneva, New York
City, etc.
 Spatial Area:
 The area that encompasses the locations of all
the spatial data is called spatial area.
STING (Introduction)
 STING is used for performing clustering
on spatial data.
 STING uses a hierarchical multi resolution
grid data structure to partition the spatial
area.
 STINGS big benefit is that it processes
many common “region oriented” queries
on a set of points, efficiently.
 We want to cluster the records that are in
a spatial table in terms of location.
 Placement of a record in a grid cell is
completely determined by its physical
location.
Hierarchical Structure of Each Grid Cell
 The spatial area is divided into
rectangular cells. (Using latitude and
longitude.)
 Each cell forms a hierarchical structure.
 This means that each cell at a higher
level is further partitioned into 4 smaller
cells in the lower level.
 In other words each cell at the ith level
(except the leaves) has 4 children in the
i+1 level.
 The union of the 4 children cells would
give back the parent cell in the level
above them.
Hierarchical Structure of Cells (Cont.)
 The size of the leaf level cells and the
number of layers depends upon how
much granularity the user wants.
 So, Why do we have a hierarchical
structure for cells?
 We have them in order to provide a
better granularity, or higher resolution.
A Hierarchical Structure for Sting Clustering
Statistical Parameters Stored in each
Cell
 For each cell in each layer we have
attribute dependent and attribute
independent parameters.
 Attribute Independent Parameter:
 Count : number of records in this cell.
 Attribute Dependent Parameter:
 (We are assuming that our attribute
values are real numbers.)
Statistical Parameters (Cont.)
 For each attribute of each cell we store
the following parameters:
 M  mean of all values of each attribute in
this cell.
 S  Standard Deviation of all values of
each attribute in this cell.
 Min  The minimum value for each attribute
in this cell.
 Max  The maximum value for each
attribute in this cell.
 Distribution  The type of distribution that
the attribute value in this cell follows. (e.g.
normal, exponential, etc.) None is assigned
to “Distribution” if the distribution is
unknown.
Storing of Statistical Parameters
 Statistical information regarding the
attributes in each grid cell, for each layer
are pre-computed and stored before
hand.
 The statistical parameters for the cells in
the lowest layer is computed directly from
the values that are present in the table.
 The Statistical parameters for the cells in
all the other levels are computed from
their respective children cells that are in
the lower level.
How are Queries Processed ?
 STING can answer many queries, (especially
region queries) efficiently, because we don’t have
to access full database.
 How are spatial data queries processed?
 We use a top-down approach to answer spatial
data queries.
 Start from a pre-selected layer-typically with a
small number of cells.
 The pre-selected layer does not have to be the
top most layer.
 For each cell in the current layer compute the
confidence interval (or estimated range of
probability) reflecting the cells relevance to the
given query.
Query Processing (Cont.)
 The confidence interval is calculated by
using the statistical parameters of each
cell.
 Remove irrelevant cells from further
consideration.
 When finished with the current layer,
proceed to the next lower level.
 Processing of the next lower level
examines only the remaining relevant
cells.
 Repeat this process until the bottom layer
is reached.
Sample Query Examples
 Assume that the spatial area is the map of the
regions of Long Island, Brooklyn and Queens.
 Our records represent apartments that are
present throughout the above region.
 Query : “ Find all the apartments that are for
rent near Stony Brook University that have a
rent range of: $800 to $1000”
 The above query depend upon the parameter
“near.” For our example near means within 15
miles of Stony Brook University.

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)
 
Birch
BirchBirch
Birch
 
Clustering, k-means clustering
Clustering, k-means clusteringClustering, k-means clustering
Clustering, k-means clustering
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
 
Kmeans
KmeansKmeans
Kmeans
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methods
 
KNN
KNN KNN
KNN
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering method
 
Clustering
ClusteringClustering
Clustering
 
Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees
 
Random forest and decision tree
Random forest and decision treeRandom forest and decision tree
Random forest and decision tree
 
Clusters techniques
Clusters techniquesClusters techniques
Clusters techniques
 
Lect12 graph mining
Lect12 graph miningLect12 graph mining
Lect12 graph mining
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
MobileNet - PR044
MobileNet - PR044MobileNet - PR044
MobileNet - PR044
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
kmean clustering
kmean clusteringkmean clustering
kmean clustering
 
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 

Andere mochten auch

Greenwich Ateb Presentation 21 Oct 2008 Final
Greenwich Ateb Presentation 21 Oct 2008 FinalGreenwich Ateb Presentation 21 Oct 2008 Final
Greenwich Ateb Presentation 21 Oct 2008 Final
rcsmuk
 
Pochoco
PochocoPochoco
Pochoco
gndlf
 

Andere mochten auch (20)

My8clst
My8clstMy8clst
My8clst
 
A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...
A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...
A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...
 
Exact coloring of real-life graphs is easy
Exact coloring of real-life graphs is easyExact coloring of real-life graphs is easy
Exact coloring of real-life graphs is easy
 
Maximum clique detection algorithm
Maximum clique detection algorithmMaximum clique detection algorithm
Maximum clique detection algorithm
 
3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methods
 
Cliques
CliquesCliques
Cliques
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
Greenwich Ateb Presentation 21 Oct 2008 Final
Greenwich Ateb Presentation 21 Oct 2008 FinalGreenwich Ateb Presentation 21 Oct 2008 Final
Greenwich Ateb Presentation 21 Oct 2008 Final
 
Pochoco
PochocoPochoco
Pochoco
 
Does the social game business model work for mobile apps?
Does the social game business model work for mobile apps?Does the social game business model work for mobile apps?
Does the social game business model work for mobile apps?
 
Obama's T-shirts
Obama's T-shirtsObama's T-shirts
Obama's T-shirts
 
Sist distribucion
Sist distribucionSist distribucion
Sist distribucion
 
Cloud discussion
Cloud discussionCloud discussion
Cloud discussion
 
Forze
ForzeForze
Forze
 
Health Story RSNA 2011 Update
Health Story RSNA 2011 UpdateHealth Story RSNA 2011 Update
Health Story RSNA 2011 Update
 
IntelliKeys USB
IntelliKeys USBIntelliKeys USB
IntelliKeys USB
 
Illusions
IllusionsIllusions
Illusions
 
Web based information resources
Web based information resourcesWeb based information resources
Web based information resources
 
Social awareness
Social awarenessSocial awareness
Social awareness
 

Ähnlich wie Clique and sting

SkNoushadddoja_28100119039.pptx
SkNoushadddoja_28100119039.pptxSkNoushadddoja_28100119039.pptx
SkNoushadddoja_28100119039.pptx
PrakasBhowmik
 

Ähnlich wie Clique and sting (20)

Lect4
Lect4Lect4
Lect4
 
[PPT]
[PPT][PPT]
[PPT]
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Words in space
Words in spaceWords in space
Words in space
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and Distributions
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
Db Scan
Db ScanDb Scan
Db Scan
 
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
 
KNN
KNNKNN
KNN
 
SkNoushadddoja_28100119039.pptx
SkNoushadddoja_28100119039.pptxSkNoushadddoja_28100119039.pptx
SkNoushadddoja_28100119039.pptx
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
 
Clustering
ClusteringClustering
Clustering
 
CNN for modeling sentence
CNN for modeling sentenceCNN for modeling sentence
CNN for modeling sentence
 
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
 
Parallel kmeans clustering in Erlang
Parallel kmeans clustering in ErlangParallel kmeans clustering in Erlang
Parallel kmeans clustering in Erlang
 
KNN.pptx
KNN.pptxKNN.pptx
KNN.pptx
 
KNN.pptx
KNN.pptxKNN.pptx
KNN.pptx
 
Principal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT SlidesPrincipal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT Slides
 
DBSCAN
DBSCANDBSCAN
DBSCAN
 

Kürzlich hochgeladen

➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
amitlee9823
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 

Kürzlich hochgeladen (20)

➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 

Clique and sting

  • 1. CLIQUE and STING Dr S.Natarajan Professor and Key Resource Person Department of Information Science and Engineering PES Institute of Technology Bengaluru natarajan@pes.edu 995280225
  • 2. High-dimensional integration • High-dimensional integrals in statistics, ML, physics • Expectations / model averaging • Marginalization • Partition function / rank models / parameter learning • Curse of dimensionality: • Quadrature involves weighted sum over exponential number of items (e.g., units of volume) L L2 L3 Ln n dimensional hypercube L4 2
  • 3. High Dimensional Indexing Techniques • Index trees (e.g., X-tree, TV-tree, SS-tree, SR-tree, M- tree, Hybrid Tree) – Sequential scan better at high dim. (Dimensionality Curse) • Dimensionality reduction (e.g., Principal Component Analysis (PCA)), then build index on reduced space
  • 4. Datasets • Synthetic dataset: – 64-d data, 100,000 points, generates clusters in different subspaces (cluster sizes and subspace dimensionalities follow Zipf distribution), contains noise • Real dataset: – 64-d data (8X8 color histograms extracted from 70,000 images in Corel collection), available at http://kdd.ics.uci.edu/databases/CorelFeatures
  • 5. 5 Preliminaries – Nearest Neighbor Search • Given a collection of data points and a query point in m-dimensional metric space, find the data point that is closest to the query point • Variation: k-nearest neighbor • Relevant to clustering and similarity search • Applications: Geographical Information Systems, similarity search in multimedia databases
  • 7. 7 Problems with High Dimensional Data • A point’s nearest neighbor (NN) loses meaning Source: [2]
  • 8. 8 Problems Con’t • NN query cost degrades – more strong candidates to compare with • In as few as 10 dimensions, linear scan outperforms some multidimensional indexing structures (e.g. SS tree, R* tree, SR tree) • Biology and genomic data can have dimensions in the 1000’s.
  • 9. 9 Problems Con’t • The presence of irrelevant attributes decreases the tendency for clusters to form • Points in high dimensional space have high degree of freedom; they could be so scattered that they appear uniformly distributed
  • 10. 10 Problems Con’t • In which cluster does the query point fall?
  • 11. 11 The Curse • Refers to the decrease in performance of query processing when the dimensionality increases • The focus of this talk will be on quality issues of NN search and on not performance issues • In particular, under certain conditions, the distance between the nearest point and the query point equals the distance between the farthest and query point as dimensionality approaches infinity
  • 12. 12 Curse Con’t Source: N. Katayama, S. Satoh. Distinctiveness Sensitive Nearest Neighbor Search for Efficient Similarity Retrieval of Multimedia Information. ICDE Conference, 2001.
  • 13. 13 Unstable NN-Query A nearest neighbor query is unstable for a given  > 0 if the distance from the query point to most data points is less than (1+) times the distance from the query point to its nearest neighbor Source: [2]
  • 16. 16 Rate of Convergence • At what dimensionality does NN-queries become unstable. Not easy to answer, so experiments were performed on real and synthetic data. • If conditions of theorem are met, DMAXm/DMINm should decrease with increasing dimensionality
  • 17. 17 Conclusions • Make sure enough contrast between query and data points. If distance to NN is not much different from average distance, the NN may not be meaningful • When evaluating high-dimensional indexing techniques, should use data that do not satisfy Theorem 1 and should compare with linear scan • Meaningfulness also depends on how you describe the object that is represented by the data point (i.e., the feature vector)
  • 18. 18 Other Issues • After selecting relevant attributes, the dimensionality could still be high • Reporting cases when data does not yield any meaningful nearest neighbor, i.e. indistinctive nearest neighbors
  • 19. Sudoku • How many ways to fill a valid sudoku square? • Sum over 981 ~ 1077 possible squares (items) • w(x)=1 if it is a valid square, w(x)=0 otherwise • Accurate solution within seconds: • 1.634×1021 vs 6.671×1021 1 2 …. ? 19
  • 20. MDL
  • 21. Minimum Description Length Principle  Occam’s razor: prefer the simplest hypothesis  Simplest hypothesis  hypothesis with shortest description length  Minimum description length  Prefer shortest hypothesis  LC (x) is the description length for message x under coding scheme c 1 2 argmin ( ) ( | )MDL C C h H h L h L D h    # of bits to encode hypothesis h # of bits to encode data D given h Complexity of Model # of Mistakes
  • 22. MDL: Interpretation of –logP(D|H)+K(H)  Interpreting –logP(D|H)+K(H) K(H) is mimimum description length of H -logP(D|H) is the mimimum description length of D (experimental data) given H. That is, if H perfectly explains D, then P(D|H)=1, then this term is 0. If not perfect, then this is interpreted as the number of bits needed to encode errors.  MDL: Minimum Description Length principle (J. Rissanen): given data D, the best theory for D is the theory H which minimizes the sum of Length of encoding H Length of encoding D, based on H (encoding errors)
  • 23. CLIQUE: A Dimension-Growth Subspace Clustering Method  Firstdimensiongrowth subspaceclustering algorithm  Clusteringstarts atsingle-dimensionsubspaceand moveupwardstowards higherdimension subspace  Thisalgorithmcan beviewedas the integration Densitybasedand Grid based algorithm
  • 24. CLIQUE (CLstering In QUEst) • Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98). • Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space • CLIQUE can be considered as both density-based and grid-based – It partitions each dimension into the same number of equal length intervals – It partitions an m-dimensional data space into non-overlapping rectangular units – A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter – A cluster is a maximal set of connected dense units within a subspace
  • 25. Definitions That Need to Be Known  Unit : After forming a grid structure on the space, each rectangular cell is called a Unit.  Dense: A unit is dense, if the fraction of total data points contained in the unit exceeds the input model parameter.  Cluster: A cluster is defined as a maximal set of connected dense units.
  • 26. Informal problem statement  Given a large set of multidimensional data points, the data space is usually not uniformly occupied by the datapoints.  CLIQUE’s clustering identifies “crowded” areas in space the sparse and the (or units), thereby discovering the overall distribution patterns of the dataset.  A unit is dense if the fraction of total data points contained in itexceedsan input model parameter.  In CLIQUE, a clusteris defined as a maximal setof connected denseunits.
  • 27. Formal Problem Statement  LetA= {A1, A2, . . . , Ad } bea setof bounded, totally ordered domains and S = A1× A2× · · · × Ad a d- dimensional numericalspace.  Wewill referto A1, . . . , Ad as the dimensions (attributes) of S.  The inputconsistsof a setof d-dimensional pointsV = {v1, v2, . . . ,vm}  Wherevi = vi1, vi2, . . . , vid . The j th componentof vi is drawn from domainAj .
  • 28. 28  The CLIQUE Algorithm (cont.) 3. Minimal description of clusters The minimal description of a cluster C, produced by the above procedure, is the minimum possible union of hyperrectangular regions. For example • A  B is the minimum cluster description of the shaded region. • C  D  E is a non-minimal cluster description of the same region.
  • 29. Clique Working  2 StepProcess  1st step – Partitioning the d- dimensional dataspace  2nd step- Generates the minimal descriptionof each cluster.
  • 30. 1st step- Partitioning  Partitioning is done foreach dimension.
  • 32. continue….  The subspaces representing these dense units are intersected to forma candidatesearch space in which denseunitsof higherdimensionality mayexist.  Thisapproachof selecting candidates is quite similar toApiori Gen processof generating candidates.  Here it is expected that if some thing is dense in higherdimensional space itcant besparse in lower dimensionstate.
  • 33. More formally  If a k-dimensional unit is dense, then soare its projections in (k-1)-dimensionalspace.  Given a k-dimensional candidate dense unit, if any of it’s (k-1)th projection unit is not dense then kth dimensional unitcannot be dense  So,we can generate candidate dense units in k-dimensional space from the dense units found in (k-1)-dimensional space  Theresulting space searched is much smaller than the originalspace.  Thedense unitsare thenexamined inorder todetermine theclusters.
  • 34. Intersection Denseunits found with respect to age for the dimensionssalary and vacation are intersected in order to provide a candidate search space fordense units of higherdimensionality.
  • 35. 2nd stage- Minimal Description  Foreach cluster, Clique determines the maximal region that covers the cluster of connected dense units.  It then determines a minimal cover (logicdescription) foreach cluster.
  • 36. Effectiveness of Clique-  CLIQUE automatically finds subspaces of the highest dimensionalitysuch that high-densityclustersexist in thosesubspaces.  It is insensitive to the orderof inputobjects  Itscales linearlywith thesizeof input  Easily scalablewith the numberof dimensions in the data
  • 37. GRID-BASED CLUSTERING METHODS  This is the approach in which we quantize space into a finite number of cells that form a grid structure on which all of the operations for clustering is performed.  So, for example assume that we have a set of records and we want to cluster with respect to two attributes, then, we divide the related space (plane), into a grid structure and then we find the clusters.
  • 38. Age Salary (10,000) Our “space” is this plane 20 30 40 50 60 8 7 6 5 4 3 2 1 0
  • 39. Techniques for Grid-Based Clustering The following are some techniques that are used to perform Grid-Based Clustering:  CLIQUE (CLustering In QUest.)  STING (STatistical Information Grid.)  WaveCluster
  • 40. Looking at CLIQUE as an Example  CLIQUE is used for the clustering of high- dimensional data present in large tables. By high-dimensional data we mean records that have many attributes.  CLIQUE identifies the dense units in the subspaces of high dimensional data space, and uses these subspaces to provide more efficient clustering.
  • 41. How Does CLIQUE Work?  Let us say that we have a set of records that we would like to cluster in terms of n-attributes.  So, we are dealing with an n- dimensional space.  MAJOR STEPS :  CLIQUE partitions each subspace that has dimension 1 into the same number of equal length intervals.  Using this as basis, it partitions the n- dimensional data space into non-overlapping rectangular units.
  • 42. CLIQUE: Major Steps (Cont.)  Now CLIQUE’S goal is to identify the dense n- dimensional units.  It does this in the following way:  CLIQUE finds dense units of higher dimensionality by finding the dense units in the subspaces.  So, for example if we are dealing with a 3- dimensional space, CLIQUE finds the dense units in the 3 related PLANES (2-dimensional subspaces.)  It then intersects the extension of the subspaces representing the dense units to form a candidate search space in which dense units of higher dimensionality would exist.
  • 43. CLIQUE: Major Steps. (Cont.)  Each maximal set of connected dense units is considered a cluster.  Using this definition, the dense units in the subspaces are examined in order to find clusters in the subspaces.  The information of the subspaces is then used to find clusters in the n-dimensional space.  It must be noted that all cluster boundaries are either horizontal or vertical. This is due to the nature of the rectangular grid cells.
  • 44. Example for CLIQUE  Let us say that we want to cluster a set of records that have three attributes, namely, salary, vacation and age.  The data space for the this data would be 3-dimensional. age salary vacation
  • 45. Example (Cont.)  After plotting the data objects, each dimension, (i.e., salary, vacation and age) is split into intervals of equal length.  Then we form a 3-dimensional grid on the space, each unit of which would be a 3-D rectangle.  Now, our goal is to find the dense 3-D rectangular units.
  • 46. Example (Cont.)  To do this, we find the dense units of the subspaces of this 3-d space.  So, we find the dense units with respect to age for salary. This means that we look at the salary- age plane and find all the 2-D rectangular units that are dense.  We also find the dense 2-D rectangular units for the vacation- age plane.
  • 47. Example 1 Salary (10,000) 20 30 40 50 60 age 54312670 20 30 40 50 60 age 54312670 Vacation (week) 20 30 40 50 60 age 54312670 Vacation (week)
  • 48. Example (Cont.)  Now let us try to visualize the dense units of the two planes on the following 3-d figure : age Vacation Salary 30 50 age Vacation Salary 30 50  = 3
  • 49. Example (Cont.)  We can extend the dense areas in the vacation-age plane inwards.  We can extend the dense areas in the salary-age plane upwards.  The intersection of these two spaces would give us a candidate search space in which 3-dimensional dense units exist.  We then find the dense units in the salary-vacation plane and we form an extension of the subspace that represents these dense units.
  • 50. Example (Cont.)  Now, we perform an intersection of the candidate search space with the extension of the dense units of the salary-vacation plane, in order to get all the 3-d dense units.  So, What was the main idea?  We used the dense units in subspaces in order to find the dense units in the 3-dimensional space.  After finding the dense units, it is very easy to find clusters.
  • 51. Reflecting upon CLIQUE  Why does CLIQUE confine its search for dense units in high dimensions to the intersection of dense units in subspaces?  Because the Apriori property employs prior knowledge of the items in the search space so that portions of the space can be pruned.  The property for CLIQUE says that if a k- dimensional unit is dense then so are its projections in the (k-1) dimensional space.
  • 52. Strength and Weakness of CLIQUE  Strength  It automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces.  It is quite efficient.  It is insensitive to the order of records in input and does not presume some canonical data distribution.  It scales linearly with the size of input and has good scalability as the number of dimensions in the data increases.  Weakness  The accuracy of the clustering result may be degraded at the expense of simplicity of the simplicity of this method.
  • 53. CLIQUE: The Major Steps • Partition the data space and find the number of points that lie inside each cell of the partition. • Identify the subspaces that contain clusters using the Apriori principle • Identify clusters: – Determine dense units in all subspaces of interests – Determine connected dense units in all subspaces of interests. • Generate minimal description for the clusters – Determine maximal regions that cover a cluster of connected dense units for each cluster – Determination of minimal cover for each cluster
  • 54. Salary (10,000) 20 30 40 50 60 age 54312670 20 30 40 50 60 age 54312670 Vacation (week) age Vacation 30 50  = 3
  • 55. Strength and Weakness of CLIQUE • Strength – It automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces – It is insensitive to the order of records in input and does not presume some canonical data distribution – It scales linearly with the size of input and has good scalability as the number of dimensions in the data increases • Weakness – The accuracy of the clustering result may be degraded at the expense of simplicity of the method
  • 56. Global Dimensionality Reduction (GDR) First Principal Component (PC) First PC •works well only when data is globally correlated •otherwise too many false positives result in high query cost •solution: find local correlations instead of global correlation
  • 57. Local Dimensionality Reduction (LDR) First PC GDR LDR First PC of Cluster1 Cluster1 Cluster2 First PC of Cluster2
  • 58. Correlated Cluster Second PC (eliminated dim.) Centroid of cluster (projection of mean on eliminated dim) First PC (retained dim.) Mean of all points in cluster A set of locally correlated points = <PCs, subspace dim, centroid, points>
  • 59. Reconstruction Distance Centroid of cluster First PC (retained dim) Second PC (eliminated dim) Point Q Projection of Q on eliminated dim Reconstruction Distance(Q,S)
  • 60. Reconstruction Distance Bound Centroid First PC (retained dim) Second PC (eliminated dim) MaxReconDist MaxReconDist ReconDist(P, S) MaxReconDist, " P in S
  • 61. Other constraints • Dimensionality bound: A cluster must not retain any more dimensions necessary and subspace dimensionality MaxDim • Size bound: number of points in the cluster  MinSize
  • 62. Clustering Algorithm Step 1: Construct Spatial Clusters • Choose a set of well- scattered points as centroids (piercing set) from random sample • Group each point P in the dataset with its closest centroid C if the Dist(P,C) 
  • 63. Clustering Algorithm Step 2: Choose PCs for each cluster • Compute PCs
  • 64. Clustering Algorithm Step 3: Compute Subspace Dimensionality 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 #dims retained Fracpointsobeying recons.bound • Assign each point to cluster that needs min dim. to accommodate it • Subspace dim. for each cluster is the min # dims to retain to keep most points
  • 65. Clustering Algorithm Step 4: Recluster points • Assign each point P to the cluster S such that ReconDist(P,S)  MaxReconDist • If multiple such clusters, assign to first cluster (overcomes “splitting” problem) Empty clusters
  • 66. Clustering algorithm Step 5: Map points • Eliminate small clusters • Map each point to subspace (also store reconstruction dist.) Map
  • 67. Clustering algorithm Step 6: Iterate • Iterate for more clusters as long as new clusters are being found among outliers • Overall Complexity: 3 passes, O(ND2K)
  • 68. Experiments (Part 1) • Precision Experiments: – Compare information loss in GDR and LDR for same reduced dimensionality – Precision = |Orig. Space Result|/|Reduced Space Result| (for range queries) – Note: precision measures efficiency, not answer quality
  • 69. Datasets • Synthetic dataset: – 64-d data, 100,000 points, generates clusters in different subspaces (cluster sizes and subspace dimensionalities follow Zipf distribution), contains noise • Real dataset: – 64-d data (8X8 color histograms extracted from 70,000 images in Corel collection), available at http://kdd.ics.uci.edu/databases/CorelFeatures
  • 70. Precision Experiments (1) 0 0.5 1 Prec. 0 0.5 1 2 Skew in cluster size Sensitivity of prec. to skew LDR GDR 0 0.5 1 Prec. 1 2 5 10 Number of clusters Sensitivity of prec. to num clus LDR GDR
  • 71. Precision Experiments (2) 0 0.5 1 Prec. 0 0.02 0.05 0.1 0.2 Degree of Correlation Sensitivity of prec. to correlation LDR GDR 0 0.5 1 Prec. 7 10 12 14 23 42 Reduced dim Sensitivity of prec. to reduced dim LDR GDR
  • 72. Index structure Root containing pointers to root of each cluster index (also stores PCs and subspace dim.) Index on Cluster 1 Index on Cluster K Set of outliers (no index: sequential scan) Properties: (1) disk based (2) height  1 + height(original space index) (3) almost balanced
  • 73. Cluster Indices • For each cluster S, multidimensional index on (d+1)-dimensional space instead of d-dimensional space: – NewImage(P,S)[j] = projection of P along jth PC for 1 j d = ReconDist(P,S) for j= d+1 • Better estimate:  D(NewImage(P,S), NewImage(Q,S))  D(Image(P,S), Image(Q,S)) • Correctness: Lower Bounding Lemma  D(NewImage(P,S), NewImage(Q,S)) D(P,Q)
  • 74. Effect of Extra dimension I/O cost 0 200 400 600 800 1000 12 14 15 17 19 30 34 Reduced dimensionality #randdisk accesses d-dim (d+1)-dim
  • 75. Outlier Index • Retain all dimensions • May build an index, else use sequential scan (we use sequential scan for our experiments)
  • 76. Query Support • Correctness: – Query result same as original space index • Point query, Range Query, k-NN query – similar to algorithms in multidimensional index structures – see paper for details • Dynamic insertions and deletions – see paper for details
  • 77. Experiments (Part 2)• Cost Experiments: – Compare linear scan, Original Space Index(OSI), GDR and LDR in terms of I/O and CPU costs. We used hybrid tree index structure for OSI, GDR and LDR. • Cost Formulae: – Linear Scan: I/O cost (#rand accesses)=file_size/10, CPU cost – OSI: I/O cost=num index nodes visited, CPU cost – GDR: I/O cost=index cost+post processing cost (to eliminate false positives), CPU cost – LDR: I/O cost=index cost+post processing cost+outlier_file_size/10, CPU cost
  • 78. I/O Cost (#random disk accesses) I/O cost comparison 0 500 1000 1500 2000 2500 3000 7 10 12 14 23 42 50 60 Reduced dim #rand disk acc LDR GDR OSI Lin Scan
  • 79. CPU Cost (only computation time) CPU cost comparison 0 20 40 60 80 7 10 12 14 23 42 Reduced dim CPU cost (sec) LDR GDR OSI Lin Scan
  • 80. Conclusion • LDR is a powerful dimensionality reduction technique for high dimensional data – reduces dimensionality with lower loss in distance information compared to GDR – achieves significantly lower query cost compared to linear scan, original space index and GDR • LDR has applications beyond high dimensional indexing
  • 81. 10/8/2016 CLIQUE clustering algorithm 81 Motivation  An object typically has dozens of attributes, the domain for each attribute can be large  Require the user to specify the subspace for cluster analysis  User-identification of subspaces is quite error- prone.
  • 82. 10/8/2016 CLIQUE clustering algorithm 82 The Contribution of CLIQUE  Automatically find subspaces with high-density clusters in high dimensional attribute space
  • 83. 10/8/2016 CLIQUE clustering algorithm 83 Background  A1, A2, ﹍, Ad is the dimensions of S:  A = {A1, A2, ﹍, Ad}  S = A1 × A2 × ﹍ × Ad  units:  Partition every dimension into ξ intervals of equal length  unit u: {u1, u2, ﹍, ud} where ui = [ li, hi )
  • 84. 10/8/2016 CLIQUE clustering algorithm 84 Background(Cont.)  Selectivity: the fraction of total data points contained in the unit  Dense unit: selectivity(u) >   Cluster: a maximal set of connected dense units
  • 85. 10/8/2016 CLIQUE clustering algorithm 85 1 Example
  • 86. 10/8/2016 CLIQUE clustering algorithm 86 Background(Cont.)  region: axis-parallel rectangular set  RC=R: R is contained in C  maximal region: no proper superset of R is contained in C  minimal description: a non-redundant covering of the cluster with maximal regions
  • 87. 10/8/2016 CLIQUE clustering algorithm 87 Example ((30age50)(4 salary8))  ((40age60)(2 salary6)) 2
  • 88. 10/8/2016 CLIQUE clustering algorithm 88 CLIQUE Algorithm 1. Identification of dense units 2. Identification of clusters. 3. Generation of minimal description
  • 89. 10/8/2016 CLIQUE clustering algorithm 89 Identification of dense units  bottom-up algorithm:  like Apriori algorithm  Monotonicity:  If a collection of points S is a cluster in a k-dimensional space, then S is also part of a cluster in any (k–1)- dimensional projections of this space.
  • 90. 10/8/2016 CLIQUE clustering algorithm 90 Algorithm 1. determine 1-dimensional dense units 2. k = 2 3. generate candidate k-dimensional units from (k-1)-dimensional dense units 4. if candidates are not empty find dense units k = k + 1 go to step 3
  • 91. 10/8/2016 CLIQUE clustering algorithm 91 Algorithm - Candidate Generation  Self-joining insert into Ck select u1.[l1, h1), u1.[l2, h2), ﹍, u1.[lk-1, hk-1), u2.[lk-1, hk-1) from Dk-1 u1, Dk-1 u2 where u1.a1 = u2.a1, u1.l1 = u2.l1, u2.h1 = u2.h1, u1.a2 = u2.a2, u1.l2 = u2.l2, u2.h2 = u2.h2, ﹍, u1.ak-2 = u2.ak-2, u1.lk-2 = u2.lk-2, u2.hk-2 = u2.hk-2, u1.ak-1 < u2.ak-1  Pruning
  • 93. 10/8/2016 CLIQUE clustering algorithm 93 Prune subspaces  Objective: use only the dense units that lie in “interesting” subspaces  MDL principle:  encode the input data under a given model and select the encoding that minimizes the code length.
  • 94. 10/8/2016 CLIQUE clustering algorithm 94 Prune subspaces (Cont.)  Group together dense units in the same subspace  Compute the number of points for each subspace  Sort subspaces in the descending order of their coverage  Minimize the total length of the encoding |))((|log))((log |))((|log))((log)( 1 2 1 2 ixi ixiiCL P nji SP I ij SI j j         2 2    jij Su iS ucountx )(
  • 95. 10/8/2016 CLIQUE clustering algorithm 95 Prune subspaces (Cont.) Partitioning of the subspaces into selected and pruned sets
  • 96. 10/8/2016 CLIQUE clustering algorithm 96 Finding Clusters
  • 97. 10/8/2016 CLIQUE clustering algorithm 97 Generating minimal cluster descriptions  R is a cover of C  optimal cover: NP-hard  solution to the problem:  greedily cover the cluster by a number of maximal regions  discard the redundant regions
  • 98. 10/8/2016 CLIQUE clustering algorithm 98 Greedy growth 1) begin with an arbitrary dense unit u  C 2) Greedily grow a maximal region covering u, add to R 3) repeat 2) with all uk  C are covered by some maximal regions in R
  • 99. 10/8/2016 CLIQUE clustering algorithm 99 Minimal Cover  Remove from the cover the smallest maximal region which is redundant.  Repeat the procedure until no maximal region can be removed.
  • 101. 10/8/2016 CLIQUE clustering algorithm 101 Comparison with Birch, DBScan Concludes that CLIQUE performs better than Birch, DBScan
  • 102. 10/8/2016 CLIQUE clustering algorithm 102 Real data experimental result  datasets:  insurance industry (Insur1, Insur2)  department store (Store)  bank (Bank)  In all cases, we discovered meaningful clusters embedded in lower dimensional subspaces.
  • 103. 10/8/2016 CLIQUE clustering algorithm 103 Strength  automatically finds clusters in subspaces  insensitive to the order of records  not presume some canonical data distribution  scales linearly with the size of input  tolerant of missing values
  • 104. 10/8/2016 CLIQUE clustering algorithm 104 Weakness  depends on some parameters that hard to pre-select  ξ (partition threshold)   (density threshold)  some potential clusters will be lost in the density-units prune procedures.  the correctness of the algorithm degrades
  • 105. What or who is STING?  A singer who was the lead singer of the band Police and then took up solo career and won many grammy’s.  The bite of a scorpion.  A Statistical Information Grid Approach to Spatial Data Mining.  All of the above.
  • 106. What is Spatial Data? Many definitions according to specific areas According to GIS  Spatial data may be thought of as features located on or referenced to the Earth's surface, such as roads, streams, political boundaries, schools, land use classifications, property ownership parcels, drinking water intakes, pollution discharge sites - in short, anything that can be mapped.  Geographic features are stored as a series of coordinate values. Each point along a road or other feature is defined by positional coordinate value, such as longitude and latitude.  The GIS stores and manages the data not as a map but as a series of layers or, as they are sometimes called, themes When viewed in a GIS, these layers visually appear as one graphic, but are actually still independent of each other. This allows changes to specific themes, without affecting the others. Discussion Question 1: So can you define spatial Data Generically????
  • 107. •Spatial database systems aim at storing, retrieving, manipulating, querying, and analyzing geometric data. •Special data types are necessary to model geometry and to suitably represent geometric data in database systems. These data types are usually called spatial data types, such as point, line, and region but also include more complex types like partitions and graphs (networks). •Data Type understanding is a prerequisite for an effective construction of important components of a spatial database system (like spatial index structures, optimizers for spatial data, spatial query languages, storage management, and graphical user interfaces) and for a cooperation with extensible DBMS providing spatial type extension packages (like spatial data blades and cartridges). •Excellent tutorial on spatial data and data types available at: http://www.informatik.fernuni-hagen.de/import/pi4/schneider/abstracts/TutorialSDT.html What are Spatial Databases?
  • 108.
  • 109. Different Grid Levels during Query Processing.
  • 110. Pennsylvania Spatial Data Access http://www.pasda.psu.edu/ The Missouri Spatial Data Information Service http://msdis.missouri.edu/ National Spatial Data Infrastructure http://www.fgdc.gov/nsdi/nsdi.html Michigan Department of Natural Resources Online www.dnr.state.mi.us/spatialdatalibrary/ Georgia Spatial Data Infrastructure Home Page www.gis.state.ga.us/ Free GIS Data - GIS Data Depot www.gisdatadepot.com Spatial Data Resources
  • 111. Spatial Data Mining  Discovery of interesting characteristics and patterns that may implicitly exist in spatial databases.  Huge amount of data specialized in nature.  Clustering and region oriented queries are common problems in this domain.  We deal with high dimensional data generally.  Applications: GIS, Medical Imaging etc.
  • 112. •Huge Amount of Data Specialized in Nature Problems???????? •Complexity •Defining of geometric patterns and region oriented queries •Conceptual nature of problem! •Spatial Data Accessing
  • 113. STING-An Introduction •STING is a grid based method to efficiently process many common region oriented queries on a set of points •What defines region? You tell me! Essentially it is a set of points satisfying some criterion •It is a hierarchical Method. The idea is to capture statistical information associated with spatial cells in such a manner that the whole classes of queries can be answered without referring to the individual objects. •Complexity is hence even less than O(n) infact what do you think it will be??? •Link to Paper: http://citeseer.nj.nec.com/wang97sting.html
  • 114. Related Work Spatial Data Mining Generalization Based Knowledge Discovery Clustering Based Methods CLARANS BIRCH DBSCANSpatial Data Dominant Non-Spatial Data Dominant Great comparison of Clustering algorithms http://www.cs.ualberta.ca/~joerg/papers/KI-Journal.pdf
  • 115. Generalization Based Approaches  Two types: Spatial Data Dominant and Non- Spatial Data Dominant  Both of these require that a generalization hierarchy is given explicitly by experts or is somehow generated automatically.  Quality of mined data depends on the structure of the hierarchy.  Computational Complexity O(nlogn)  So the onus shifted to developing algorithms which discover characteristics directly from data. This was the motivation to move to clustering algorithms
  • 116. Clustering Based Approaches  BIRCH: Already covered Remember it?? Complexity??  The problem with BIRCH is that it does not work well with clusters which are not spherical.  DBSCAN: Already covered Remember it?? Complexity??  The Global Parameter Eps determination in DBSCAN requires human participation  When the point set to be clustered is the response set of objects with some qualifications, then determination of Eps must be done each time and cost is hence higher.
  • 117. Clustering Based Approaches  CLARANS: Clustering Large Applications based upon RANdomized Search.  Although claims have been made on it being linear it is essentially quadratic.  The computational Complexity is at least Ώ(KN2) where N is the number of data point and K is the number of clusters.  Quality of results can not be guaranteed when N is large as we use Randomized Search Optimization with Randomized Search Heuristics The (A)NFL Theorem, Realistic Scenarios, and Dicult Functions
  • 118. Related Work  All the approaches described in previous slides are all query dependent approaches  The structure of queries influence the structure of the algorithm and cannot be generalized to all queries.  As they scan all the data points the complexity will at least be O(N)
  • 119. STING THE OVERVIEW  Spatial Area is divided into rectangular cells  Different levels of cells corresponding to different resolution and these cells have a hierarchical structure.  Each cell at a higher level is partitioned into number of cells of the next lower level  Statistical information of each cell is calculated and stored beforehand and is used to answer queries
  • 120. GRID CELL HIERARCHY Each Cell at (i-1)th level has 4 children at ith level (can be changed) The size of leaf cell is dependent on the density of objects. Generally it should be from several dozens to thousands
  • 121.  For each cell we have attribute-dependent and attribute-independent parameters  The attribute independent parameter is number of objects in a cell-n  For attribute dependent parameters it is assumed that for each object its attributes have numerical values.  For each Numerical attribute we have the following five parameters GRID CELL HIERARCHY
  • 122. GRID CELL HIERARCHY  m- mean of all values in this cell  s- standard deviation of all values in this cell  min-the minimum value of the attribute in this cell  max-the minimum value of the attribute in this cell  distribution-the type of distribution this cell follows. (This is of enumeration type)
  • 123. Parameter Generation •The determination of dist parameter is as follows •First the dist is set to distribution followed by most point •An estimate is made on number of conflicting points confl according to following Rules 1) if disti is not equal to dist, m=mi and s=si then confl is increased by amount ni. 2) if disti is not equal to dist, m=mi or s=si but not both then confl is set to n. 3) if disti=dist and m=mi and s=si then confl is not changed 4) if disti = dist, m=mi or s=si but not both then confl is set to n. Finally if confl/n is greater than a threshold (say 0.05) then dist is set to none or Original dist is retained
  • 124. Parameter Generation i 1 2 3 4 ni 100 50 60 10 mi 20.1 19.7 21 20.5 si 2.3 2.2 2.4 2.1 mini 4.5 5.5 3.8 7 max i 36 34 37 40 disti Norm al Norm al Norm al Non e The parameters of the current cell are N=220 m=20.27 s=2.37 min=3.8 max=40 dist=NORMAL This is so because there are 220 data points out of which 10 are not NORMAL So confl/n=10/220=0.045<0.05 hence it is still NORMAL. The parameters are calculated only once so overall compilation time is O(N) But querying requires much less time as we only scan the number of grid cells K i.e. O(K)
  • 125. Query Types  If hierarchical structure cannot answer a query then can go to underlying database  SQL like Language used to describe queries  Two types of common queries found: one is to find region specifying certain constraints and other take in a region and return some attribute of the region
  • 127. Algorithm  Top down querying. Examine cells at a higher level determine if the cell is relevant to query at some confidence level. This likelihood can be defined as the proportion of objects in this cell that satisfy the query conditions. After obtaining the confidence interval, we label this cell to be relevant or not relevant at the specified confidence level.  After doing so for the present layer process is repeated for the children cells of the RELEVANT cells in the present layer only!!!  Procedure continues till the bottom most layer  Find region formed by relevant cells and return them  If not satisfactory retrieve those data that fall into the relevant cells from database and do some further processing.
  • 128.  After all cells are labeled as relevant or not relevant, we can easily find all regions that satisfy the density specified by Breadth First Search.  For a relevant cell, we examine cells within a certain distance d from the center of the current cell to see if the average density within this small area is greater than density specified.  If yes the cells are put into a queue  Step 2 and 3 are repeated for all the cells in the queue except cells previously examined are omitted.  When the queue is empty we get one region. Algorithm
  • 129.  The distance d =max (l, √(f/c∏)  l, c, f are the side length of bottom layer cell, the specified density and small constant number set by STING (does not vary from query to another)  L is usually the dominant term so we generally have to examine the neighborhood term. If only granularity is very small do we need examine very cell at that distance rather than just the neighborhood. Algorithm
  • 130. Example Given Data: Houses one of the attribute is price Query:“Find those regions with area at least A where the number of houses per unit area is at least c and at least b% of the houses have price between a and b with (1 - a) confidence” where a < b. Here, a could be -æ and b could be +æ. This query can be written as We begin from the top level working our way down. Assume the dist type is NORMAL First we calculate the proportion of houses whose price lies between [a,b] The probability that price lies between a and b is m and s are mean and standard deviation of all prices.
  • 131. Example  Now as we assume prices to be independent of m and s the number of houses with price range [a, b] has a binomial distribution with parameters n and p where n is number of houses. Now we consider the following cases according to n, np and n(1-p) a) n<=30: binomial distribution used to determine confidence interval of the number of houses whose prices fall into [a, b], and divide it by n to get the confidence interval for the proportion. b) When n > 30, n p ³ 5, and n(1 - p ) ³ 5, the proportion that the price falls in [a, b] has a normal distribution Then 100(1 - alpha)% confidence interval of the proportion is c) When n>30 but np<5 , the Poissons distribution with parameters is used for approximation. d) When n>30 but n(1-p)<5, we can calculate the proportion of houses (X) whose price is not in [a,b] using Poissons distribution with n(1-p) and 1-X is the proportion of houses whose prices is in [a,b].
  • 132. Example  Once we have the confidence interval or the estimated range [p1, p2], we can label this cell as relevant or not relevant.  Let S be area of cells at bottom layer. If p1xn<Sxcx %, we can label as not relevant otherwise as relevant 
  • 133. Analysis of STING  Step one takes constant time  Step 2 and 3 total time is proportional to the total number of cells in the hierarchy.  Total number of cells is 1.33K, where K is number of cells at bottom layer.  In all cases it is found or claimed to be O(K)  Discussion Question: what is the complexity if we need to go to step 7 in the algorithm??
  • 134. Quality  STING under the following sufficient condition guarantee that if a region satisfies the specification of the query then it is returned.  Let F be a region. The width of F is defined as the side length of the maximum square that can fit in F.
  • 135. Limiting Behavior of STING  The regions returned by Sting are an approximation of the result by DBSCAN. As the granularity approaches zero the regions returned by STING approaches result of DBSCAN.  SO worst case complexity is O(nlogn)!!!!!
  • 136. Performance measure Case A: Normal Distribution Query in e.g. answered in 0.2 sec Structure generation: 9.8 second Case A: None Query in e.g. answered in 0.22 sec Structure generation: 9.7 second
  • 137. Performance measure  Used a benchmark called SEQUOLA 2000 to compare STING, DBSCAN, CLARANS  All the previous algorithms have three phases in query answering 1. Find Query Response 2. Build auxiliary structure 3. Do clustering  STING does all of this in one step so is inherently better.
  • 138. Discussion Question  “STING is trivially parallelizable.” Comment why and what is the importance of this statement?
  • 139. References  STING : Statistical Information Grid approach to spatial data mining. Wei Wang et al.  Optimization with Randomized Search Heuristics The (A)NFL Theorem, Realistic Scenarios, and Dicult Functions. Stefan Droste et al.  Efficient and Effective clustering Method for spatial data mining. R. Ng et al.  BIRCH: An efficient data clustering method for very large databases. T Zhang et al.  Tutorial on Spatial data types: http://www.informatik.fernuni- hagen.de/import/pi4/schneider/abstracts/TutorialSDT.html  An efficient Approach to Clustering in Large Multimedia Databases with Noise. A Hinneburg et al.  Comparison of clustering algorithms : http://www.cs.ualberta.ca/~joerg/papers/KI-Journal.pdf
  • 140. Motivation  All previousclustering algorithmare querydependent  Theyare builtforone queryand generally no use for otherquery.  Needa separate scan foreachquery.  Socomputation morecomplexat least O(n).  Sowe need a structure outof Databaseso thatvarious queriescan beanswered without rescanning.
  • 141. Basics  Grid based method-quantizes the object space into a finite number of cells that form a grid structure on which all of the operations for clustering are performed  Develop hierarchical Structure out of given data and answer various queries efficiently.  Everylevel of hierarchy consistof cells  Answeringa query is not O(n) where n is the number of elements in thedatabase
  • 142. A hierarchical structure for STING clustering
  • 143. continue …..  Theroot of the hierarchy beat level 1  Cell in level i corresponds to the union of the areasof itschildrenat level i + 1  Cellata higher level is partitioned to form a numberof cellsof the next lowerlevel  Statistical informationof each cell iscalculated and stored beforehandand is used toanswerqueries
  • 144. Cell parameter  Attribute Independentparameter- n- numberof objects (points) in this cell  Attribute dependentparameters- m - mean of all values in thiscell s - standard deviation of all values of the attribute in this cell min - the minimumvalueof the attribute in this cell max - the maximum value of the attribute in this cell distribution - the type of distribution that the attribute valuein thiscell follows
  • 145. Parameter Generation  n, m, s, min, and max of bottom levelcellsare calculateddirectly from data  Distributioncan beeitherassigned byuserorcan be obtained byhypothetical tests - χ2 test  Parametersof higher levelcells is calculated from parameterof lowerlevelcells.
  • 146. continue…..  n, m, s, min, max, dist beparametersof current cell  ni, mi, si, mini, maxi and disti beparametersof corresponding lower levelcells
  • 147. dist for Parent Cell  Set dist as the distribution type followed by most pointsin thiscell  Nowcheck forconflicting points in thechild cells call it confl. 1. If disti ≠ dist, mi ≈ m and si ≈ s, then confl is increased by an amount of ni; 2. If disti ≠ dist, but either mi ≈ m or si ≈ s is not satisfied, then set confl ton 3. If disti = dist, mi ≈ m and si ≈ s, then confl is increased by0; 4.If disti = dist, but either mi ≈ m or si ≈ s is not satisfied, then confl is set ton.
  • 148. continue…..  If is greater thana threshold t set dist as NONE.  Otherwise keeptheoriginal type. Example:
  • 149. continue…..  Parameterforparentcell would be n = 220 min = 3.8 m = 20.27 max =40 s = 2.37 dist = NORMAL  210 points whose distributiontype is NORMAL  Setdistof parentas Normal  confl = 10  = 0.045 < 0.05 so keeptheoriginal.
  • 150. Query types  STING structure is capableof answering variousqueries  Butif itdoesn’tthenwealwayshave the underlying Database  Even if statistical information is notsufficientto answer querieswe can still generate possiblesetof answers.
  • 151. Common queries  Select regions that satisfy certain conditions Select the maximal regions that have at least 100 houses per unit area and at least 70% of the house prices are above $400K and withtotalareaat least100 unitswith90% confidence SELECT REGION FROM house-map WHERE DENSITY IN (100, ∞) AND price RANGE (400000, ∞) WITH PERCENT (0.7, 1) AND AREA (100, ∞) AND WITH CONFIDENCE 0.9
  • 152. continue….  Selects regions and returns some function of the region Select the range of age of houses in those maximal regionswhere there areat least 100 houses perunit areaand at least 70% of the houses have pricebetween$150Kand $300Kwithareaat least100units in California. SELECT RANGE(age) FROM house-map WHERE DENSITY IN (100, ∞) AND price RANGE (150000, 300000) WITH PERCENT (0.7, 1) AND AREA (100, ∞) AND LOCATION California
  • 153. Algorithm  With the hierarchical structure of grid cells on hand, we can use a top-down approach to answer spatial data miningqueries  For any query, we begin by examining cells on a high levellayer  calculate the likelihood that this cell is relevant to the queryatsomeconfidence level using the parameters of thiscell  If the distribution type is NONE, we estimate the likelihoodusing somedistribution free techniques instead
  • 154. continue….  Afterweobtain the confidence interval, welabel this cell to be relevant or not relevant at the specified confidencelevel  Proceed to the next layerbutonlyconsiderthe Childs of relevantcellsof upperlayer  Werepeat thisuntil wereach to the final layer  Relevantcellsof final layerhave enoughstatistical informationtogivesatisfactoryresultto query.  However for accurate mining we may refer to data corresponding torelevantcellsand furtherprocess it.
  • 155. Finding regions  Afterwe havegotall the relevantcellsat the final level weneed to outputregionsthat satisfies thequery  Wecan do itusing Breadth FirstSearch
  • 156. Breadth First Search  we examine cells within a certain distance from the center of currentcell  If the average density within this small area is greater than the density specified mark this area  Put the relevant cells just examined in thequeue.  Take element from queue repeat the same procedure except that only those relevant cells that are not examined before are enqueued. When queue is empty we have identified oneregion.
  • 157. Statistical Information Grid-based Algorithm 1. Determine a layer to beginwith. 2. For each cell of this layer, we calculate the confidence interval (or estimated range) of probability that thiscell is relevantto thequery. 3. From the interval calculated above, we label the cell as relevant or not relevant. 4. If this layer is the bottom layer, go to Step 6; otherwise, go to Step 5. 5. Wego down the hierarchy structure by one level. Go to Step 2 for those cells that form the relevantcells of the higher level layer. 6. If the specification of the query is met, go to Step 8; otherwise, go to Step 7. 7. Retrieve those data fall into the relevant cells and do further processing. Returntheresult that meet therequirementof thequery. Goto Step 9. 8. Find the regions of relevant cells. Return those regions that meet the requirementof thequery. Go to Step 9. 9. Stop.
  • 158. Time Analysis:  Step1 takesconstanttime. Steps 2 and 3 require constanttime.  The total time is less than orequal to the total number of cells in our hierarchical structure.  Notice that the total numberof cells is 1.33K, where K is the number of cells at bottomlayer.  Sothe overall computationcomplexityon the grid hierarchy structure isO(K)
  • 159. Time Analysis:  STING goesthrough the databaseonce to computethe statistical parameters of thecells  timecomplexityof generating clusters is O(n), where n is the total numberof objects.  After generating the hierarchical structure, the query processing time is O(g), whereg is the total numberof grid cells at the lowest level, which is usually much smaller thann.
  • 161. Definitions That Need to Be Known  Spatial Data:  Data that have a spatial or location component.  These are objects that themselves are located in physical space.  Examples: My house, lake Geneva, New York City, etc.  Spatial Area:  The area that encompasses the locations of all the spatial data is called spatial area.
  • 162. STING (Introduction)  STING is used for performing clustering on spatial data.  STING uses a hierarchical multi resolution grid data structure to partition the spatial area.  STINGS big benefit is that it processes many common “region oriented” queries on a set of points, efficiently.  We want to cluster the records that are in a spatial table in terms of location.  Placement of a record in a grid cell is completely determined by its physical location.
  • 163. Hierarchical Structure of Each Grid Cell  The spatial area is divided into rectangular cells. (Using latitude and longitude.)  Each cell forms a hierarchical structure.  This means that each cell at a higher level is further partitioned into 4 smaller cells in the lower level.  In other words each cell at the ith level (except the leaves) has 4 children in the i+1 level.  The union of the 4 children cells would give back the parent cell in the level above them.
  • 164. Hierarchical Structure of Cells (Cont.)  The size of the leaf level cells and the number of layers depends upon how much granularity the user wants.  So, Why do we have a hierarchical structure for cells?  We have them in order to provide a better granularity, or higher resolution.
  • 165. A Hierarchical Structure for Sting Clustering
  • 166. Statistical Parameters Stored in each Cell  For each cell in each layer we have attribute dependent and attribute independent parameters.  Attribute Independent Parameter:  Count : number of records in this cell.  Attribute Dependent Parameter:  (We are assuming that our attribute values are real numbers.)
  • 167. Statistical Parameters (Cont.)  For each attribute of each cell we store the following parameters:  M  mean of all values of each attribute in this cell.  S  Standard Deviation of all values of each attribute in this cell.  Min  The minimum value for each attribute in this cell.  Max  The maximum value for each attribute in this cell.  Distribution  The type of distribution that the attribute value in this cell follows. (e.g. normal, exponential, etc.) None is assigned to “Distribution” if the distribution is unknown.
  • 168. Storing of Statistical Parameters  Statistical information regarding the attributes in each grid cell, for each layer are pre-computed and stored before hand.  The statistical parameters for the cells in the lowest layer is computed directly from the values that are present in the table.  The Statistical parameters for the cells in all the other levels are computed from their respective children cells that are in the lower level.
  • 169. How are Queries Processed ?  STING can answer many queries, (especially region queries) efficiently, because we don’t have to access full database.  How are spatial data queries processed?  We use a top-down approach to answer spatial data queries.  Start from a pre-selected layer-typically with a small number of cells.  The pre-selected layer does not have to be the top most layer.  For each cell in the current layer compute the confidence interval (or estimated range of probability) reflecting the cells relevance to the given query.
  • 170. Query Processing (Cont.)  The confidence interval is calculated by using the statistical parameters of each cell.  Remove irrelevant cells from further consideration.  When finished with the current layer, proceed to the next lower level.  Processing of the next lower level examines only the remaining relevant cells.  Repeat this process until the bottom layer is reached.
  • 171. Sample Query Examples  Assume that the spatial area is the map of the regions of Long Island, Brooklyn and Queens.  Our records represent apartments that are present throughout the above region.  Query : “ Find all the apartments that are for rent near Stony Brook University that have a rent range of: $800 to $1000”  The above query depend upon the parameter “near.” For our example near means within 15 miles of Stony Brook University.