SlideShare a Scribd company logo
1 of 25
Download to read offline
Clustering Tendency
Assistant Professor :
Dr. Mohammad Javad Fadaeieslam
By :
Amir Shokri – Farshad Asgharzade – Sajad Dehghan – Amin Nazari
Amirsh.nll@gmail.com
Introduction
Problems clustering algorithms :
Most clustering algorithms impose a clustering structure on a data set X even though
the vectors of X do not exhibit such a structure.
Solution :
Before we apply any clustering algorithm on X , its must first be verified that X
possesses a clustering structure.
clustering tendency : The problem of determining the presence or the absence of a
clustering structure in X.
• Clustering tendency methods have been applied in various application areas, However,
most of these methods are suitable only for l = 2. In the sequel, we discuss the problem
in the general l ≥ 2 case.
focus : methods that are suitable for detecting compact clusters (if any).
Clustering Tendency
• Clustering tendency is heavily based on hypothesis testing.
• Specifically is based on testing the randomness (null) hypothesis (H0) against the
clustering hypothesis (H2) and the regularity hypothesis (H1).
Randomness hypothesis : the vectors of X are randomly distributed, according to the
uniform distribution in the sampling window8 of X”(H0).
Clustering hypothesis: the vectors of X are regularly spaced in the sampling window.”
This implies that, they are not too close to each other.
Regularity hypothesis : the vectors of X form clusters.
• P(q|H0) , P(q|H01) , P(q|H2) are estimated via monte carlo simulations.
• If the randomness or the regularity hypothesis is accepted, methods alternative to
clustering analysis should be used for the interpretation of the data set X.
Clustering Tendency(cont.)
• There are two key points that have an important influence on the performance of many
statistical tests used in clustering tendency:
1) dimensionality of the data
2) sampling window
Problem sampling window : in practice, we do not know the sampling window
 ways to overcome this situation is :
1) use a periodic extension of the sampling window
2) sampling frame (extension of the sampling window)
sampling frame : consider data in a smaller area inside the sampling window.
• With sampling frame , we overcome the boundary effects in the sampling frame by
considering points outside it and inside the sampling frame, for the estimation of statistical
properties.
Sampling Window
• A method for estimating the sampling window is to use the convex hull of the vectors in X.
Problems : the distributions for the tests, derived using this sampling window :
1) depend on the specific data at hand.
2) high computational cost for computing the convex hull of X.
An alternative : define the sampling window as the hypersphere centered at the mean point
of X and including half of its vectors.
• test statistics, q, suitable for the detection of clustering tendency :
1) Generation of clustered data
2) Generation of regularly spaced data
Generation of clustered data
• A well-known procedure for generating (compact) clustered data is the Neyman–
Scott procedure :
1) assumes that the sampling window is known
2) The number of points in each cluster follows the Poisson distribution
 requires inputs :
1. total number of points N of the set
2. the intensity of the Poisson process
3. spread parameter : that controls the spread of each cluster around its center
Generation of clustered data(cont.)
 STEPS :
I. randomly insert a point 𝒚𝒊 in the sampling window, following the uniform
distribution
II. This point serves as the center of the ith cluster, and we determine its number of
vectors, 𝒏𝒊, using the Poisson distribution.
III. the 𝒏𝒊 points around 𝒚𝒊 are generated according to the normal distribution with mean
𝑦𝑖 and covariance matrix 𝜹 𝟐 𝑰 .
• If a point turns out to be outside the sampling window, we ignore it and another one is
generated.
• This procedure is repeated until N points have been inserted in the sampling window.
Generation of regularly spaced data
• Perhaps the simplest way to produce regularly spaced points is :
define a lattice in the convex hull of X and to place the vectors at its vertices
An alternative procedure, known as simple sequential inhibition (SSI)
I. The points 𝒚𝒊 are inserted in the sampling window one at a time.
II. For each point we define a hypersphere of radius r centered at 𝒚𝒊.
III. The next point can be placed anywhere in the sampling window in such a way that its
hypersphere does not intersect with any of the hyperspheres defined by the previously
inserted points.
The procedure stops :
a predetermined number of points have been inserted in the sampling window
no more points can be inserted in the sampling window, after say a few thousand
trials
Generation of regularly spaced data(cont.)
packing density : A measure of the degree of fulfillment of the sampling window
• which is defined as : 𝝆 =
𝑳
𝑽
𝑽 𝒓

𝑳
𝑽
is the average number of points per unit volume
 𝑽 𝒓 is the volume of a hypersphere of radius r
 𝑽 𝒓 can be written as : 𝑽 𝒓= 𝐴𝑟 𝑙
• where A is the volume of the l-dimensional hypersphere with unit radius, which is given
by :
𝑨 =
𝝅
𝒍
𝟐
𝜞(
𝒍
𝟐
+ 𝟏)
Example
 (a) and (b) :Clustered data sets produced by the Neyman–Scott
process
 (c) : Regularly spaced data produced by the SSI model
Tests for Spatial Randomness
• Several tests for spatial randomness have been proposed in the literature. All of them assume
knowledge of the sampling window :
The scan test
the quadrat analysis
the second moment structure
the interpoint distances
• provide us with tests for clustering tendency that have been extensively used when l = 2.
• three methods for determining clustering tendency that are well suited for the general l ≥ 2
case. All these methods require knowledge of the sampling window :
1) Tests Based on Structural Graphs
2) Tests Based on Nearest Neighbor Distances
3) A Sparse Decomposition Technique
1) Tests Based on Structural Graphs
• based on the idea of the minimum spanning tree (MST)
Steps :
I. determine the convex region where the vectors of X lie.
II. generate M vectors that are uniformly distributed over a region that approximates
the convex region found before (usually M = N). These vectors constitute the set X
III. find the MST of X ∪ X and we determine the number of edges, q, that connect
vectors of X with vectors of X.
If X contains clusters, then we expect q to be small.
small values of q indicate the presence of clusters.
large values of q indicate a regular arrangement of the vectors of X.
1) Tests Based on Structural Graphs(cont.)
• mean value of q and the variance of q under the null (randomness) hypothesis,
conditioned on e, are derived:
• 𝐸 𝑞 𝐻0 =
2𝑀𝑁
𝑀+𝑁
• 𝑣𝑎𝑟 𝑞 𝑒, 𝐻0 =
2𝑀𝑁
𝐿(𝐿−1)
2𝑀𝑁−𝐿
𝐿
+
𝑒−𝐿+2
(𝐿_2)(𝐿−3)
[𝐿 𝐿 − 1 − 4𝑀𝑁 + 2]
• where L =M + N and e = the number of pairs of the MST edges that share a node.
• if M, N →∞ and M/N is away from 0 and ∞ , the pdf of the statistic is approximately
given by the standard normal distribution.
Tests Based on Structural Graphs(cont.)
Formula :
𝑞` =
𝑞 − 𝐸(𝑞|𝐻0)
𝑣𝑎𝑟(𝑞|𝑒, 𝐻0)
if q`is less than the 𝝆 -percentile of the standard normal distribution:
 reject 𝐻0 at significance level 𝝆
This test exhibits high power against clustering tendency and little power
against regularity.
2) Tests Based on Nearest Neighbor Distances
• The tests rely on the distances between the vectors of X and a number of vectors
which are randomly placed in the sampling window.
• Two tests of this kind are :
1) The Hopkins test
This statistic compares the nearest neighbor distribution of the points in 𝑿 𝟏
with that from the points in X.
2) The Cox–Lewis test
It follows the setup of the previous test with the exception that 𝑿 𝟏 need not
be defined.
2_1)The Hopkins Test
Definitions:
X ` {yi, i 1, . . . , M}, M<< N : a set of vectors that are randomly distributed in the
sampling window, following the uniform distribution.
𝑋1⊂ X : a set of M randomly chosen vectors of X.
𝑑𝑗 : the distance from 𝒚𝒋 ∈ X` to its closest vector in 𝑿 𝟏, denoted by 𝒙𝒋,
 j : the distance from 𝑿𝒋 to its closest vector in 𝑿 𝟏 − {𝑿𝒋} .
• The Hopkins statistic involves the lth powers of 𝒅𝒋 and 𝜹𝒋 and it is defined as:
𝒉 =
σ𝒋=𝟏
𝑴
𝒅𝒋
𝒍
σ𝒋=𝟏
𝑴
𝒅𝒋
𝒍
+ σ𝒋=𝟏
𝑴
𝜹𝒋
𝒍
2_1)The Hopkins Test (cont.)
Values of h :
Large values : large values of h indicate the presence of a clustering structure in X.
Small values : small values of h indicate the presence of regularly spaced points.
h = ½ : a value around 1/2 is an indication that the vectors of X are randomly
distributed over the sampling window.
• if the generated vectors are distributed according to a Poisson random process and all
nearest neighbor distances are statistically independent:
h (under 𝐻0) follows a beta distribution, with (M, M) parameters
2_2 )The Cox–Lewis test
Definitions:
𝑥𝑗 ∶For each 𝑦𝑗 ∈ X` we determine its closest vector in X
𝑥𝑖 : the vector closest to 𝑋𝑗 in X-{ 𝑥𝑗}
𝑑𝑖 : be the distance between 𝑦𝑗 and 𝑥𝑗
𝛿𝑖 : distance between 𝑥𝑗 and 𝑥𝑖
M : be the number of such 𝑦𝑗’s
2-2) The Cox–Lewis test(cont.)
• We consider all 𝒚𝒋’s for which 2dj/ 𝜹𝒊 is greater than or equal to one.
• Finally, we define the statistic :
𝑹 =
𝟏
𝑴
෍
𝒋=𝟏
𝑴`
𝑹𝒋
Values of R :
Small values : indicate the presence of a clustering structure in X
large values : indicate a regular structure in X.
R values around the mean : indicate that the vectors of X are randomly arranged
in the sampling window.
Hopkins vs Cox–Lewis
1) The Hopkins test
This test exhibits high power against regularity for a hypercubic sampling
window and periodic boundaries, for l= 2, . . . , 5.
However, its power is limited against clustering tendency.
2) The Cox–Lewis test
This test is less intuitive than the previous one. It was first proposed for the
two-dimensional case and it has been extended to the general l ≥ 2 dimensional
case.
This test exhibits inferior performance compared with the Hopkins test against
the clustering alternative.
 However, this is not the case against the regularity hypothesis.
3) A Sparse Decomposition Technique
• This technique begins with the data set X and sequentially removes vectors from it until no vectors are
left.
• A sequential decomposition D of X is a partition of X into L1, . . . , Lk sets, such that the order of their
formation matters. Li’s are also called decomposition layers.
I. We denote by MST(X). S(X) be the set derived from X according to the following procedure.
Initially, S(X) =∅.
II. move an end point x of the longest edge,e,of the MST (X) to S(X).
III. mark this point and all points that lie at a distance less than or equal to b from x, where b is the
length of e.
IV. determine the unmarked point , y ∈ X , that lies closer to S(X) and we move it to S(X).
V. mark all the unmarked vectors that lie at a distance no greater than b from y.
VI. apply the same procedure for all the unmarked vectors of X.
VII. The procedure terminates : when all vectors are marked.
3) A Sparse Decomposition Technique(cont.)
• Let us define R(X) ≡ X S(X). Setting 𝐗 = 𝐑 𝟎
(𝐗) , we define :
𝐋𝐢 = 𝐒(𝐑𝐢−𝟏
(𝐗)) , i=1,2,…,k
• where k is the smallest integer such that 𝑹 𝟎
(𝑿) = ∅.
• The index i denotes the so-called decomposition layer. Intuitively speaking, the procedure
sequentially “peels”X until all of its vectors have been removed.
• The information that becomes available to us after the application of the decomposition
procedure is :
(a) the number of decomposition layers k
(b) the decomposition layers Li
(c) the cardinality, li, of the Li decomposition layer, i =1, . . . , k
(d) the sequence of the longest MST edges used in deriving the decomposition layers
3) A Sparse Decomposition Technique(cont.)
• The decomposition procedure gives different results, when :
 the vectors of X are clustered
 the vectors of X regularly spaced or randomly distributed in the sampling window
• Based on this observation we may define statistical indices utilizing the information
associated with this decomposition procedure.
• For example, it is expected that the number of decomposition layers, k, is smaller for
random data than it is for clustered data. Also, it is smaller for regularly spaced data
than for random data .
Another tests
• Exist Several tests that rely on the preceding information ,One such statistic that
exhibits good performance is the so-called P statistic, which is defined as follows:
𝐏 = ෑ
𝐢=𝟏
𝐊
𝐥𝐢
𝐧𝐢 − 𝐥𝐢
• where 𝒏𝒊 is the number of points in 𝐑𝐢−𝟏
(𝐗). In words,each factor of P is the ratio of
the removed to the remaining points at each decomposition stage.
• Finally, tests for clustering tendency for the cases in which ordinal proximity
matrices are in use have also been proposed.
Most of them are based on graph theory concepts.
Conclusion
The Basic steps of the clustering tendency philosophy are :
I. Definition of a test statistics q suitable for the detection of clustering tendency.
II. Estimation of the pdf of q under the null (H0) hypothesis, p(q|H0)
III. Estimation of p(q|H1) and p(q|H2) (they are necessary for measuring the power of q
(the probability of making a correct decision when H0 is rejected )against the
regularity and the clustering tendency hypotheses).
IV. Evaluation of q for the data set at hand ,X, and examination whether it lies in the
critical interval of p(q|H0), which corresponds to a predetermined significance level p

More Related Content

What's hot

Using Principal Component Analysis to Remove Correlated Signal from Astronomi...
Using Principal Component Analysis to Remove Correlated Signal from Astronomi...Using Principal Component Analysis to Remove Correlated Signal from Astronomi...
Using Principal Component Analysis to Remove Correlated Signal from Astronomi...
CvilleDataScience
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm
Laura Petrosanu
 
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
neeraj7svp
 

What's hot (20)

Study on atome probe
Study on atome probe Study on atome probe
Study on atome probe
 
Statistical Physics Assignment Help
Statistical Physics Assignment HelpStatistical Physics Assignment Help
Statistical Physics Assignment Help
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniques
 
Efficient projections
Efficient projectionsEfficient projections
Efficient projections
 
O0447796
O0447796O0447796
O0447796
 
K means and dbscan
K means and dbscanK means and dbscan
K means and dbscan
 
Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)
 
Satellite image compression technique
Satellite image compression techniqueSatellite image compression technique
Satellite image compression technique
 
Using Principal Component Analysis to Remove Correlated Signal from Astronomi...
Using Principal Component Analysis to Remove Correlated Signal from Astronomi...Using Principal Component Analysis to Remove Correlated Signal from Astronomi...
Using Principal Component Analysis to Remove Correlated Signal from Astronomi...
 
K-means and GMM
K-means and GMMK-means and GMM
K-means and GMM
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm
 
Implementation of K-Nearest Neighbor Algorithm
Implementation of K-Nearest Neighbor AlgorithmImplementation of K-Nearest Neighbor Algorithm
Implementation of K-Nearest Neighbor Algorithm
 
Designing a Minimum Distance classifier to Class Mean Classifier
Designing a Minimum Distance classifier to Class Mean ClassifierDesigning a Minimum Distance classifier to Class Mean Classifier
Designing a Minimum Distance classifier to Class Mean Classifier
 
Implementing Minimum Error Rate Classifier
Implementing Minimum Error Rate ClassifierImplementing Minimum Error Rate Classifier
Implementing Minimum Error Rate Classifier
 
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
 
Firefly exact MCMC for Big Data
Firefly exact MCMC for Big DataFirefly exact MCMC for Big Data
Firefly exact MCMC for Big Data
 
Graph Based Clustering
Graph Based ClusteringGraph Based Clustering
Graph Based Clustering
 
Matlab Assignment Help
Matlab Assignment HelpMatlab Assignment Help
Matlab Assignment Help
 

Similar to clustering tendency

CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
Nandhini S
 
CHI SQUARE DISTRIBUTIONdjfnbefklwfwpfioaekf.pptx
CHI SQUARE DISTRIBUTIONdjfnbefklwfwpfioaekf.pptxCHI SQUARE DISTRIBUTIONdjfnbefklwfwpfioaekf.pptx
CHI SQUARE DISTRIBUTIONdjfnbefklwfwpfioaekf.pptx
rathorebhagwan07
 
Dimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptxDimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptx
RohanBorgalli
 
Ability Study of Proximity Measure for Big Data Mining Context on Clustering
Ability Study of Proximity Measure for Big Data Mining Context on ClusteringAbility Study of Proximity Measure for Big Data Mining Context on Clustering
Ability Study of Proximity Measure for Big Data Mining Context on Clustering
KamleshKumar394
 
Massive MIMO and Random Matrix
Massive MIMO and Random MatrixMassive MIMO and Random Matrix
Massive MIMO and Random Matrix
VARUN KUMAR
 
On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1
VitAnhNguyn94
 
AcceleratingBayesianInference
AcceleratingBayesianInferenceAcceleratingBayesianInference
AcceleratingBayesianInference
Luca Carniato
 

Similar to clustering tendency (20)

Unit3
Unit3Unit3
Unit3
 
ORMR_Monte Carlo Method.pdf
ORMR_Monte Carlo Method.pdfORMR_Monte Carlo Method.pdf
ORMR_Monte Carlo Method.pdf
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Statistics Applied to Biomedical Sciences
Statistics Applied to Biomedical SciencesStatistics Applied to Biomedical Sciences
Statistics Applied to Biomedical Sciences
 
Cs345 cl
Cs345 clCs345 cl
Cs345 cl
 
CHI SQUARE DISTRIBUTIONdjfnbefklwfwpfioaekf.pptx
CHI SQUARE DISTRIBUTIONdjfnbefklwfwpfioaekf.pptxCHI SQUARE DISTRIBUTIONdjfnbefklwfwpfioaekf.pptx
CHI SQUARE DISTRIBUTIONdjfnbefklwfwpfioaekf.pptx
 
The Standard Normal Distribution
The Standard Normal Distribution  The Standard Normal Distribution
The Standard Normal Distribution
 
Dimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptxDimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptx
 
Statistical computing2
Statistical computing2Statistical computing2
Statistical computing2
 
Clustering
ClusteringClustering
Clustering
 
CNN for modeling sentence
CNN for modeling sentenceCNN for modeling sentence
CNN for modeling sentence
 
Ability Study of Proximity Measure for Big Data Mining Context on Clustering
Ability Study of Proximity Measure for Big Data Mining Context on ClusteringAbility Study of Proximity Measure for Big Data Mining Context on Clustering
Ability Study of Proximity Measure for Big Data Mining Context on Clustering
 
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
 
Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)
 
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular AutomataCost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
 
Massive MIMO and Random Matrix
Massive MIMO and Random MatrixMassive MIMO and Random Matrix
Massive MIMO and Random Matrix
 
On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1
 
Monte Carlo Berkeley.pptx
Monte Carlo Berkeley.pptxMonte Carlo Berkeley.pptx
Monte Carlo Berkeley.pptx
 
AcceleratingBayesianInference
AcceleratingBayesianInferenceAcceleratingBayesianInference
AcceleratingBayesianInference
 

More from Amir Shokri

More from Amir Shokri (20)

LAUNCH - growth practices - PRODUCT MANAGER
LAUNCH - growth practices - PRODUCT MANAGERLAUNCH - growth practices - PRODUCT MANAGER
LAUNCH - growth practices - PRODUCT MANAGER
 
Remote work
Remote workRemote work
Remote work
 
Project Management Growth Practices
Project Management Growth PracticesProject Management Growth Practices
Project Management Growth Practices
 
GROWTH PRACTICES - Cracking the PM Career - CHAPTER 7
GROWTH PRACTICES - Cracking the PM Career - CHAPTER 7GROWTH PRACTICES - Cracking the PM Career - CHAPTER 7
GROWTH PRACTICES - Cracking the PM Career - CHAPTER 7
 
Numbers, math operation, converting bases
Numbers, math operation, converting basesNumbers, math operation, converting bases
Numbers, math operation, converting bases
 
GROWTH PRACTICES - Cracking the PM Career - CHAPTER 4
GROWTH PRACTICES - Cracking the PM Career - CHAPTER 4GROWTH PRACTICES - Cracking the PM Career - CHAPTER 4
GROWTH PRACTICES - Cracking the PM Career - CHAPTER 4
 
review of image memorability methods
review of image memorability methodsreview of image memorability methods
review of image memorability methods
 
key.net
key.netkey.net
key.net
 
beyesian learning exercises
beyesian learning exercisesbeyesian learning exercises
beyesian learning exercises
 
Knn
KnnKnn
Knn
 
Bayesian learning
Bayesian learningBayesian learning
Bayesian learning
 
machine learning code
machine learning codemachine learning code
machine learning code
 
machine learning - id3, find-s, candidate elimination, desicion tree example
machine learning - id3, find-s, candidate elimination, desicion tree examplemachine learning - id3, find-s, candidate elimination, desicion tree example
machine learning - id3, find-s, candidate elimination, desicion tree example
 
ID3 Algorithm
ID3 AlgorithmID3 Algorithm
ID3 Algorithm
 
Concept learning
Concept learningConcept learning
Concept learning
 
logical operators decision tree
logical operators decision treelogical operators decision tree
logical operators decision tree
 
Matplotlib
MatplotlibMatplotlib
Matplotlib
 
Mining social network graphs - persian
Mining social network graphs - persianMining social network graphs - persian
Mining social network graphs - persian
 
product glossary
product glossaryproduct glossary
product glossary
 
Popular Maple codes Book - Persian
Popular Maple codes Book - PersianPopular Maple codes Book - Persian
Popular Maple codes Book - Persian
 

Recently uploaded

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 

Recently uploaded (20)

ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 

clustering tendency

  • 1. Clustering Tendency Assistant Professor : Dr. Mohammad Javad Fadaeieslam By : Amir Shokri – Farshad Asgharzade – Sajad Dehghan – Amin Nazari Amirsh.nll@gmail.com
  • 2. Introduction Problems clustering algorithms : Most clustering algorithms impose a clustering structure on a data set X even though the vectors of X do not exhibit such a structure. Solution : Before we apply any clustering algorithm on X , its must first be verified that X possesses a clustering structure. clustering tendency : The problem of determining the presence or the absence of a clustering structure in X. • Clustering tendency methods have been applied in various application areas, However, most of these methods are suitable only for l = 2. In the sequel, we discuss the problem in the general l ≥ 2 case. focus : methods that are suitable for detecting compact clusters (if any).
  • 3. Clustering Tendency • Clustering tendency is heavily based on hypothesis testing. • Specifically is based on testing the randomness (null) hypothesis (H0) against the clustering hypothesis (H2) and the regularity hypothesis (H1). Randomness hypothesis : the vectors of X are randomly distributed, according to the uniform distribution in the sampling window8 of X”(H0). Clustering hypothesis: the vectors of X are regularly spaced in the sampling window.” This implies that, they are not too close to each other. Regularity hypothesis : the vectors of X form clusters. • P(q|H0) , P(q|H01) , P(q|H2) are estimated via monte carlo simulations. • If the randomness or the regularity hypothesis is accepted, methods alternative to clustering analysis should be used for the interpretation of the data set X.
  • 4. Clustering Tendency(cont.) • There are two key points that have an important influence on the performance of many statistical tests used in clustering tendency: 1) dimensionality of the data 2) sampling window Problem sampling window : in practice, we do not know the sampling window  ways to overcome this situation is : 1) use a periodic extension of the sampling window 2) sampling frame (extension of the sampling window) sampling frame : consider data in a smaller area inside the sampling window. • With sampling frame , we overcome the boundary effects in the sampling frame by considering points outside it and inside the sampling frame, for the estimation of statistical properties.
  • 5. Sampling Window • A method for estimating the sampling window is to use the convex hull of the vectors in X. Problems : the distributions for the tests, derived using this sampling window : 1) depend on the specific data at hand. 2) high computational cost for computing the convex hull of X. An alternative : define the sampling window as the hypersphere centered at the mean point of X and including half of its vectors. • test statistics, q, suitable for the detection of clustering tendency : 1) Generation of clustered data 2) Generation of regularly spaced data
  • 6. Generation of clustered data • A well-known procedure for generating (compact) clustered data is the Neyman– Scott procedure : 1) assumes that the sampling window is known 2) The number of points in each cluster follows the Poisson distribution  requires inputs : 1. total number of points N of the set 2. the intensity of the Poisson process 3. spread parameter : that controls the spread of each cluster around its center
  • 7. Generation of clustered data(cont.)  STEPS : I. randomly insert a point 𝒚𝒊 in the sampling window, following the uniform distribution II. This point serves as the center of the ith cluster, and we determine its number of vectors, 𝒏𝒊, using the Poisson distribution. III. the 𝒏𝒊 points around 𝒚𝒊 are generated according to the normal distribution with mean 𝑦𝑖 and covariance matrix 𝜹 𝟐 𝑰 . • If a point turns out to be outside the sampling window, we ignore it and another one is generated. • This procedure is repeated until N points have been inserted in the sampling window.
  • 8. Generation of regularly spaced data • Perhaps the simplest way to produce regularly spaced points is : define a lattice in the convex hull of X and to place the vectors at its vertices An alternative procedure, known as simple sequential inhibition (SSI) I. The points 𝒚𝒊 are inserted in the sampling window one at a time. II. For each point we define a hypersphere of radius r centered at 𝒚𝒊. III. The next point can be placed anywhere in the sampling window in such a way that its hypersphere does not intersect with any of the hyperspheres defined by the previously inserted points. The procedure stops : a predetermined number of points have been inserted in the sampling window no more points can be inserted in the sampling window, after say a few thousand trials
  • 9. Generation of regularly spaced data(cont.) packing density : A measure of the degree of fulfillment of the sampling window • which is defined as : 𝝆 = 𝑳 𝑽 𝑽 𝒓  𝑳 𝑽 is the average number of points per unit volume  𝑽 𝒓 is the volume of a hypersphere of radius r  𝑽 𝒓 can be written as : 𝑽 𝒓= 𝐴𝑟 𝑙 • where A is the volume of the l-dimensional hypersphere with unit radius, which is given by : 𝑨 = 𝝅 𝒍 𝟐 𝜞( 𝒍 𝟐 + 𝟏)
  • 10. Example  (a) and (b) :Clustered data sets produced by the Neyman–Scott process  (c) : Regularly spaced data produced by the SSI model
  • 11. Tests for Spatial Randomness • Several tests for spatial randomness have been proposed in the literature. All of them assume knowledge of the sampling window : The scan test the quadrat analysis the second moment structure the interpoint distances • provide us with tests for clustering tendency that have been extensively used when l = 2. • three methods for determining clustering tendency that are well suited for the general l ≥ 2 case. All these methods require knowledge of the sampling window : 1) Tests Based on Structural Graphs 2) Tests Based on Nearest Neighbor Distances 3) A Sparse Decomposition Technique
  • 12. 1) Tests Based on Structural Graphs • based on the idea of the minimum spanning tree (MST) Steps : I. determine the convex region where the vectors of X lie. II. generate M vectors that are uniformly distributed over a region that approximates the convex region found before (usually M = N). These vectors constitute the set X III. find the MST of X ∪ X and we determine the number of edges, q, that connect vectors of X with vectors of X. If X contains clusters, then we expect q to be small. small values of q indicate the presence of clusters. large values of q indicate a regular arrangement of the vectors of X.
  • 13. 1) Tests Based on Structural Graphs(cont.) • mean value of q and the variance of q under the null (randomness) hypothesis, conditioned on e, are derived: • 𝐸 𝑞 𝐻0 = 2𝑀𝑁 𝑀+𝑁 • 𝑣𝑎𝑟 𝑞 𝑒, 𝐻0 = 2𝑀𝑁 𝐿(𝐿−1) 2𝑀𝑁−𝐿 𝐿 + 𝑒−𝐿+2 (𝐿_2)(𝐿−3) [𝐿 𝐿 − 1 − 4𝑀𝑁 + 2] • where L =M + N and e = the number of pairs of the MST edges that share a node. • if M, N →∞ and M/N is away from 0 and ∞ , the pdf of the statistic is approximately given by the standard normal distribution.
  • 14. Tests Based on Structural Graphs(cont.) Formula : 𝑞` = 𝑞 − 𝐸(𝑞|𝐻0) 𝑣𝑎𝑟(𝑞|𝑒, 𝐻0) if q`is less than the 𝝆 -percentile of the standard normal distribution:  reject 𝐻0 at significance level 𝝆 This test exhibits high power against clustering tendency and little power against regularity.
  • 15. 2) Tests Based on Nearest Neighbor Distances • The tests rely on the distances between the vectors of X and a number of vectors which are randomly placed in the sampling window. • Two tests of this kind are : 1) The Hopkins test This statistic compares the nearest neighbor distribution of the points in 𝑿 𝟏 with that from the points in X. 2) The Cox–Lewis test It follows the setup of the previous test with the exception that 𝑿 𝟏 need not be defined.
  • 16. 2_1)The Hopkins Test Definitions: X ` {yi, i 1, . . . , M}, M<< N : a set of vectors that are randomly distributed in the sampling window, following the uniform distribution. 𝑋1⊂ X : a set of M randomly chosen vectors of X. 𝑑𝑗 : the distance from 𝒚𝒋 ∈ X` to its closest vector in 𝑿 𝟏, denoted by 𝒙𝒋,  j : the distance from 𝑿𝒋 to its closest vector in 𝑿 𝟏 − {𝑿𝒋} . • The Hopkins statistic involves the lth powers of 𝒅𝒋 and 𝜹𝒋 and it is defined as: 𝒉 = σ𝒋=𝟏 𝑴 𝒅𝒋 𝒍 σ𝒋=𝟏 𝑴 𝒅𝒋 𝒍 + σ𝒋=𝟏 𝑴 𝜹𝒋 𝒍
  • 17. 2_1)The Hopkins Test (cont.) Values of h : Large values : large values of h indicate the presence of a clustering structure in X. Small values : small values of h indicate the presence of regularly spaced points. h = ½ : a value around 1/2 is an indication that the vectors of X are randomly distributed over the sampling window. • if the generated vectors are distributed according to a Poisson random process and all nearest neighbor distances are statistically independent: h (under 𝐻0) follows a beta distribution, with (M, M) parameters
  • 18. 2_2 )The Cox–Lewis test Definitions: 𝑥𝑗 ∶For each 𝑦𝑗 ∈ X` we determine its closest vector in X 𝑥𝑖 : the vector closest to 𝑋𝑗 in X-{ 𝑥𝑗} 𝑑𝑖 : be the distance between 𝑦𝑗 and 𝑥𝑗 𝛿𝑖 : distance between 𝑥𝑗 and 𝑥𝑖 M : be the number of such 𝑦𝑗’s
  • 19. 2-2) The Cox–Lewis test(cont.) • We consider all 𝒚𝒋’s for which 2dj/ 𝜹𝒊 is greater than or equal to one. • Finally, we define the statistic : 𝑹 = 𝟏 𝑴 ෍ 𝒋=𝟏 𝑴` 𝑹𝒋 Values of R : Small values : indicate the presence of a clustering structure in X large values : indicate a regular structure in X. R values around the mean : indicate that the vectors of X are randomly arranged in the sampling window.
  • 20. Hopkins vs Cox–Lewis 1) The Hopkins test This test exhibits high power against regularity for a hypercubic sampling window and periodic boundaries, for l= 2, . . . , 5. However, its power is limited against clustering tendency. 2) The Cox–Lewis test This test is less intuitive than the previous one. It was first proposed for the two-dimensional case and it has been extended to the general l ≥ 2 dimensional case. This test exhibits inferior performance compared with the Hopkins test against the clustering alternative.  However, this is not the case against the regularity hypothesis.
  • 21. 3) A Sparse Decomposition Technique • This technique begins with the data set X and sequentially removes vectors from it until no vectors are left. • A sequential decomposition D of X is a partition of X into L1, . . . , Lk sets, such that the order of their formation matters. Li’s are also called decomposition layers. I. We denote by MST(X). S(X) be the set derived from X according to the following procedure. Initially, S(X) =∅. II. move an end point x of the longest edge,e,of the MST (X) to S(X). III. mark this point and all points that lie at a distance less than or equal to b from x, where b is the length of e. IV. determine the unmarked point , y ∈ X , that lies closer to S(X) and we move it to S(X). V. mark all the unmarked vectors that lie at a distance no greater than b from y. VI. apply the same procedure for all the unmarked vectors of X. VII. The procedure terminates : when all vectors are marked.
  • 22. 3) A Sparse Decomposition Technique(cont.) • Let us define R(X) ≡ X S(X). Setting 𝐗 = 𝐑 𝟎 (𝐗) , we define : 𝐋𝐢 = 𝐒(𝐑𝐢−𝟏 (𝐗)) , i=1,2,…,k • where k is the smallest integer such that 𝑹 𝟎 (𝑿) = ∅. • The index i denotes the so-called decomposition layer. Intuitively speaking, the procedure sequentially “peels”X until all of its vectors have been removed. • The information that becomes available to us after the application of the decomposition procedure is : (a) the number of decomposition layers k (b) the decomposition layers Li (c) the cardinality, li, of the Li decomposition layer, i =1, . . . , k (d) the sequence of the longest MST edges used in deriving the decomposition layers
  • 23. 3) A Sparse Decomposition Technique(cont.) • The decomposition procedure gives different results, when :  the vectors of X are clustered  the vectors of X regularly spaced or randomly distributed in the sampling window • Based on this observation we may define statistical indices utilizing the information associated with this decomposition procedure. • For example, it is expected that the number of decomposition layers, k, is smaller for random data than it is for clustered data. Also, it is smaller for regularly spaced data than for random data .
  • 24. Another tests • Exist Several tests that rely on the preceding information ,One such statistic that exhibits good performance is the so-called P statistic, which is defined as follows: 𝐏 = ෑ 𝐢=𝟏 𝐊 𝐥𝐢 𝐧𝐢 − 𝐥𝐢 • where 𝒏𝒊 is the number of points in 𝐑𝐢−𝟏 (𝐗). In words,each factor of P is the ratio of the removed to the remaining points at each decomposition stage. • Finally, tests for clustering tendency for the cases in which ordinal proximity matrices are in use have also been proposed. Most of them are based on graph theory concepts.
  • 25. Conclusion The Basic steps of the clustering tendency philosophy are : I. Definition of a test statistics q suitable for the detection of clustering tendency. II. Estimation of the pdf of q under the null (H0) hypothesis, p(q|H0) III. Estimation of p(q|H1) and p(q|H2) (they are necessary for measuring the power of q (the probability of making a correct decision when H0 is rejected )against the regularity and the clustering tendency hypotheses). IV. Evaluation of q for the data set at hand ,X, and examination whether it lies in the critical interval of p(q|H0), which corresponds to a predetermined significance level p