SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
INTERNSHIP REPORT
RANDOM DISTORTION TESTS APPLIED IN A
SPECTRAL CLUSTERING CONTEXT
October 19, 2016
Hamza Ameur
Institut Mines-Telecom - Telecom Bretagne
Department of Signal and Communications
Contents
1 Introduction 2
2 Random distortion testing 3
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Spectral Clustering 5
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Graph Laplacian matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 NCut and RatioCut problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4.1 RatioCut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4.2 NCut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5 Random distortion testing p-value kernel . . . . . . . . . . . . . . . . . . . . . . . 19
3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1
Chapter 1
Introduction
Spectral clustering algorithms are widely used in machine learning and data partitioning
contexts. They consist on building a graph where the vertices represent data and then
partitioning it into k logically separated clusters. By "logically" we mean that the elements of
each cluster must be "highly-connected" within the cluster and well-separated from the rest
of the graph.
The quality of partitioning is based on a judicious choice of the similarity function between
the vertices of the graph. A natural way is to connect all the vertices of the graph and then
weight the edges using a kernel function. The most popular function used in this regard is
the Gaussian kernel.
Recently at Telecom Bretagne, Dominique Pastor and Quang-Thang Nguyen have developed
in [1] a new theory called Random Distortion Testing (RDT). The objective of this theory is to
guarantee any level of false-alarm probability while detecting distortion in disturbed data
using threshold-based tests. In this internship, we will show how this theoretical framework
leads to choosing the similarity function as equal to the p-value of random distortion tests.
We will study the partitioning performance of this kernel function compared to the Gaussian
kernel in a context of disturbed data.
ACKNOWLEDGMENT
I am very grateful to Dominique Pastor and Redouane Lguensat who supervised my one-
month internship. Their help with the difficulties I encountered during my work was of a
valuable interest.
2
Chapter 2
Random distortion testing
2.1 INTRODUCTION
In signal processing contexts, we generally observe the signal of interest in additive inde-
pendent Gaussian noise. In [1] the observed data is supposed to be a vector Y = Θ+ X where
Θ is a random signal of interest with unknown distribution and X is an additive independent
Gaussian noise with positive definite covariance matrix C.
In most of signal processing problems we either try to estimate one or more parameters
of a certain signal of interest (e.g: its mean or its variance) or decide if this signal follows a
certain model as it is the case in [1]. In the latter case threshold-based tests are commonly
used to either confirm the null hypothesis which is that the signal follows actually the known
model or the alternative hypothesis which is that the observed signal is an anomaly. False
alarm probability is one of the parameters we try to optimize in such tests. It is defined as
the probability that we decide to confirm the null hypothesis whereas it is actually false. The
RDT’s theoretical framework enables us to guarantee any level of this probability regardless
of the unknown distribution of the signal of interest. Thence the strength of this framework.
2.2 NOTATION
As we mentioned above, we suppose in [1] that the observed data is a vector Y = Θ+X such
that Θ is a random vector of Rd
with unknown distribution encoding the signal of interest.
Whereas X is an additive independent Gaussian noise with null mean and positive definite
covariance matrix C. Let θ0 denote a known deterministic model of the signal of interest.
3
RDT and Spectral Clustering Telecom Bretagne
Let H0 denote the null hypothesis and H1 the alternative one. We have:



H0 : Θ−θ0 ≤ τ
H1 : Θ−θ0 > τ
(2.1)
where . denotes the Mahalanobis distance defined with respect to the covariance matrix C
as:
∀X ∈ Rd
: X = X T C−1X (2.2)
τ is called the tolerance which is technically the maximum value of the distortion Θ−θ0
that we tolerate. By "tolerate" we mean that we decide that Θ still follows the deterministic
model θ0 if the distortion Θ−θ0 is under this value. Otherwise we decide that there is an
anomaly.
Finally, given any η ≥ 0, a threshold-based test with threshold height η on the Mahalanobis
distance to θ0 is any measurable map Tη of Rd
into {0,1} such that:
∀y ∈ Rd
: Tη(y) =



1 if y −θ0 > τ
0 if y −θ0 ≤ τ
(2.3)
2.3 THEOREM
In most of signal processing problems, we don’t actually decide whether the value of the
distortion Θ−θ0 satisfies H0 or H1 with respect to the real value of Θ−θ0 . In fact, the
observed data Y might be a corrupted version of Θ due to the noise X . In [1] it is proven that
for any γ and τ both greater than 0, the false-alarm probability of the threshold-based test
Tλγ(τ) is less than γ. Here λγ(τ) denotes the unique solution of the equation:
1−R(τ,λγ(τ)) = γ (2.4)
where the function R is defined for any ρ and η both greater than 0 as:
R(ρ,η) = Fχ2
d
(ρ2)(η2
)
where Fχ2
d
(ρ2) denotes the cumulative distribution function of the Chi-squared distribution
with d degrees of freedom and parameter ρ2
. In section 3.5, we will define the p-value of a
random distortion test and study its performance as a kernel function in spectral clustering
contexts.
Page 4 of 31
Chapter 3
Spectral Clustering
3.1 INTRODUCTION
Data partitioning aims at deciding which sets of data have "similar" properties. These sets
are commonly called Clusters. One of the widely-used algorithms in data partitioning is the
k-means clustering. The idea of this algorithm is that, given a data set {Y1,Y2,··· ,Yn}, the
objective is to partition it into k (k ≤ n) sets called clusters, {C1,C2,··· ,Ck} so as to minimize
the quantity: 0<i<k
Yj ∈Ci
Yj − mi
2
where mi denotes the mean of the cluster Ci . For more
information about how to implement the k-means clustering see [2].
The spectral clustering approach consists on building a graph where the vertices encode
the data and the edges encode the "degree of the connection" between these vertices. We
generally weight the edges using a certain similarity matrix S where si j ≥ 0 represents a
certain notion of similarity between the i − th and the j − th elements of the observed data.
We then partition the graph into k clusters of vertices. In each cluster the vertices must be
highly-connected within the cluster and well-separated from the rest of the graph. Since
spectral clustering algorithms are used in graph partitioning contexts, it is obvious that
spectral clustering is highly-related to the graph theory. In the next section, we will introduce
some notions of the graph theory we find necessary to broach spectral clustering algorithms.
Then in section 3.3 we will define the graph Laplacian matrices which are the main tools
of these algorithms. We will see in section 3.4 how these matrices are used to approximate
the optimal solutions of the classical NCut and RatioCut problems. Finally, we will discuss
in section 3.5 the partitioning performance of the RDT’s p-value compared to the Gaussian
kernel and conclude.
5
RDT and Spectral Clustering Telecom Bretagne
3.2 DEFINITIONS
Definition 3.2.1 (Graph). A graph is generally defined by a couple G = (V,E). V is called the
set of vertices or nodes and E the set of edges which is a subset of V ×V where A ×B denotes the
Cartesian product defined as: A ×B = {(a,b) | a ∈ A and b ∈ B}. Two vertices x and y are
connected if (x, y) ∈ E.
Definition 3.2.2 (Directed and undirected graphs). A graph G = (V,E) is said to be undi-
rected if for all (x, y) in V ×V , (x, y) ∈ E if and only if (y,x) ∈ E. Otherwise, the graph is said to
be directed. The direction doesn’t have any effect on the connection between the vertices of an
undirected graph.
Definition 3.2.3 (Weighted and unweighted graphs - The degree of a vertex). We call weighted
a graph where we assign to each element (x, y) ∈ E a real number called the weight of the edge
(x, y). When an undirected graph is weighted, the edges (x, y) and (y,x) must be equally-
weighted. We call degree of the vertex i the quantity:
di :=
j∈V
(i,j)∈E
si j where si j denotes the weight of the edge (i, j) (3.1)
Definition 3.2.4 (Simple graphs and multigraphs). A simple graph is a graph that contains
no loops. A loop is an edge that connects a vertex to itself. A graph where the loops are allowed
is called a multigraph.
Definition 3.2.5 (Connected components). Two vertices x and y are linked if there is a "path"
of edges that connects them. It means in a more formal way that there is a positive integer n and
a sequence of vertices (xk)k∈{0,1,···,n} such that: x0 = x, xn = y and ∀k ∈ {0,1,··· ,n − 1} :
xi and xi+1 are connected.
Denote A a subset of V . A is said to be a connected component if it verifies these two conditions:
i. ∀(x, y) ∈ A × A : (x, y) are linked by a path
ii. ∀(x, y) ∈ A ×V  A : (x, y) are not linked
Page 6 of 31
RDT and Spectral Clustering Telecom Bretagne
Figure 3.1: Graphical representation of a simple, undirected and weighted graph. Here the graph
contains 6 vertices labeled by their indexes {0,1,··· ,5} and 5 weighted edges. As we can find a path
between any couple of vertices, this graph contains only one connected component.
Definition 3.2.6 (Similarity and degree matrices). One of the commonly-used ways to repre-
sent a graph is the similarity matrix. It is a square matrix S where the input si j is equal to the
weight of the edge connecting the i-th and the j-th vertices if they’re connected. Otherwise, the
input si j is null. The degree matrix is defined as: D := diag(d1,··· ,dn) where di denotes the
degree of the vertex i.
Example : the similarity and degree matrices corresponding to the graph represented in
figure 3.1 above:
S =












0 0 2 0 0 0
0 0 3.5 0 0 0
2 3.5 0 0.1 0 0
0 0 0.1 0 10 3
0 0 0 10 0 0
0 0 0 3 0 0












and D =












2 0 0 0 0 0
0 3.5 0 0 0 0
0 0 5.6 0 0 0
0 0 0 13.1 0 0
0 0 0 0 10 0
0 0 0 0 0 3












In the following, let G = (V,E) denote a simple, undirected and weighted graph where all
the edges are with non-negative weights. Thus, the matrix S is symmetric and its diagonal
elements are null. Let n =| V | denote the number of its vertices. The figure 3.1 represents an
example of such a graph.
Page 7 of 31
RDT and Spectral Clustering Telecom Bretagne
3.3 GRAPH LAPLACIAN MATRICES
The graph Laplacian matrices are along with similarity matrices the key elements of spectral
clustering and spectral graph theory (see [5] for an overview of this theory). The three main
types are the unnormalized (it is also called standard), the symmetric and the random walk
normalized graph Laplacians.
Let D and S respectively denote the degree and similarity matrices of the graph G.
Graph Laplacian Definition Form
Unnormalized graph
Laplacian
L := D −S L =





d1 −s12 ··· −s1n
−s21 d2 ··· −s2n
...
...
...
...
−sn1 −sn2 ··· dn





Symmetric normalized
graph Laplacian
Lsym := D−1/2
LD−1/2
= In −D−1/2
SD−1/2 Lsym =








1 −s12
d1d2
··· −s1n
d1dn
−s21
d2d1
1 ··· −s2n
d2dn
...
...
...
...
−sn1
dnd1
−sn2
dnd2
··· 1








Random walk
normalized graph
Laplacian
Lr w := D−1
L = In −D−1
S Lr w =






1 −s12
d1
··· −s1n
d1
−s21
d2
1 ··· −s2n
d2
...
...
...
...
−sn1
dn
−sn2
dn
··· 1






Table 3.1: Definition and form of graph Laplacian matrices.
Lsym and Lr w are said to be normalized because the inputs of the unnormalized graph
Laplacian Li j are divided either by the degree of the i-th vertex (Lr w ) or by the square root of
the product of the latter and the the degree of the j-th vertex (Lsym) (See table 3.1 above).
Note that the forms given in table 3.1 are not necessarily the same for all types of graphs.
First, recall that G is a simple graph. The weights sii are therefore null and don’t appear in
diagonal inputs. Second, if there is any isolated vertex i, all the terms si j and sji such that
j ∈ {1,··· ,n} would be null and so would be di . In that case, D−1
would be substituted by the
generalized inverse of D to avoid the problems of dividing by zero while computing Lsym and
Lr w . In that case, the i-th row would be null in both Lsym and Lr w and so would be the i-th
column in Lsym.
Now that we have defined graph Laplacian matrices, it is time to show some of their
mathematical properties that will be helpful to understand spectral clustering algorithms.
Page 8 of 31
RDT and Spectral Clustering Telecom Bretagne
Proposition 3.3.1 (Properties of L).
i. ∀f = (f1,··· , fn)T
∈ Rn
: f T
L f = 1
2 1≤i≤n
1≤j≤n
si j (fi − fj )2
.
ii. L is a symmetric positive semi-definite matrix with n real non-negative eigenvalues.
iii. 0 is an eigenvalue of L with the constant one vector 1 as a corresponding eigenvector.
iv. The number of connected components in the graph G equals the multiplicity of the
eigenvalue 0. Let A1,··· , Ak denote the connected components of the graph G. Then the
eigenspace of 0 is spanned by the indicator vectors of these subsets 1A1 ,··· ,1Ak
:
1Aj
(i) =



1, if i ∈ Aj
0, if i ∉ Aj
Proof.
i. Let f = (f1,··· , fn)T
be a vector of Rn
. Then:
f T
L f = f T
D f − f T
S f =
n
i=1
di f 2
i −
n
i,j=1
si j fi fj =
1
2
n
i=1
di f 2
i −2
n
i,j=1
si j fi fj +
n
i=1
dj f 2
j
According to the definition of di , we have: di =
n
j=1
si j =
n
j=1
sji because the graph is
undirected and therefore its corresponding similarity matrix S is symmetric. Thus:
f T
L f =
1
2
n
i,j=1
si j f 2
i −2fi fj + f 2
j =
1
2
n
i,j=1
si j fi − fj
2
ii. L = D −S. G is undirected, then S is symmetric. D is diagonal and therefore symmetric.
Then L is symmetric as a difference of two symmetric matrices. We have supposed that
all the weights are non-negative, then it follows from i. that L is positive semi-definite
and therefore it has n real and non-negative eigenvalues.
iii. The sum of each row of L is null by the definition of the degree of a vertex. Thus, 0 is an
eigenvalue of L and the constant one vector 1 is a corresponding eigenvector.
iv. See section 3.1 in [4]
Page 9 of 31
RDT and Spectral Clustering Telecom Bretagne
Proposition 3.3.2 (Properties of Lsym and Lr w ).
i. ∀f = (f1,··· , fn)T
∈ Rn
: f T
Lsym f = 1
2 1≤i≤n
1≤j≤n
si j
fi
di
−
fj
dj
2
ii. Lsym is symmetric positive semi-definite with n real non-negative eigenvalues.
iii. λ is an eigenvalue of Lr w with eigenvector u if and only if λ is an eigenvalue of Lsym
with eigenvector w = D1/2
u.
iv. λ is an eigenvalue of Lr w with eigenvector u if and only if λ and u solve the generalized
eigenproblem Lu = λDu.
v. 0 is an eigenvalue of Lr w with eigenvector the constant one vector 1 and an eigenvalue of
Lsym with eigenvector D1/2
1.
vi. The number of connected components of the graph G is equal to the multiplicity of
the eigenvalue 0 of Lr w and Lsym. Let A1,··· , Ak denote the connected components
of the graph G. Then the eigenspace of 0 of Lr w is spanned by the indicator vectors
of these subsets 1A1 ,··· ,1Ak
. For, Lsym, the eigenspace of 0 is spanned by the vectors
D1/21A1 ,··· ,D1/21Ak
.
Proof. Here we suppose that all the vertices are with non-zero degrees.
i. Let f = (f1,··· , fn)T
be a vector of Rn
. Then:
f T
Lsym f = f T
D−1/2
LD−1/2
f = (D−1/2
f )T
LD−1/2
f = gT
Lg where g = D−1/2
f
g =
f1
d1
,··· ,
fn
dn
T
. Then the statement comes from i. of proposition 3.3.1
ii. LT
sym = D−1/2
LD−1/2 T
= D−1/2
LT
D−1/2
= D−1/2
LD−1/2
= Lsym. Then Lsym is symmet-
ric (Thence the name of this graph Laplacian). It follows from i. that Lsym is positive
semi-definite.
iii. Let λ be an eigenvalue of Lr w with eigenvector u. Then Lr w u = λu and D−1
Lu = λu.
Thus, D−1/2
Lu = λD1/2
u and D−1/2
LD−1/2
D1/2
u = λD1/2
u.
Therefore, LsymD1/2
u = λD1/2
u and λ is an eigenvalue of Lsym with eigenvector w =
D1/2
u. On the other hand, if λ is an eigenvalue of Lsym with eigenvector w then
Lsymw = λw. By multiplying by D−1/2
from the left we get D−1
LD−1/2
w = λD−1/2
w
and therefore Lr w u = λu.
Page 10 of 31
RDT and Spectral Clustering Telecom Bretagne
iv. λ is an eigenvalue of Lr w with eigenvector u if only if D−1
Lu = λu if only if Lu = λDu.
v. It follows from proposition 3.3.1 that: L1 = 0, then D−1
L1 = 0 and Lr w 1 = 0. The second
part o f the statement follows from iii.
vi. It follows from iv. of proposition 3.3.1 and the previous statements of proposition 3.3.2
3.4 NCUT AND RATIOCUT PROBLEMS
Given a graph G = (V,E) and a positive integer k, the NCut and RatioCut problems aim
at finding a partition A1,··· , Ak of the set V . A good graph partitioning must satisfy two
conditions. The first one is maximizing the within-cluster connectivity and the second is
minimizing the connectivity between different clusters. Sections 5 and 8.5 in [4] were very
useful for us to understand the approach of these two graph partitioning problems.
Given two subsets A and B of V , let A, |A|, vol(A) and W (A,B) denote the following
quantities:
A := the complement of A
|A| := the number of vertices in A
vol(A) :=
i∈A
di
W (A,B) :=
i∈A,j∈B
si j
The quantities |A| and vol(A) are two ways to measure the size of the subset A whereas the
quantity W (A,B) measures the connectivity between the two subsets A and B. A simple
way to find a suitable partition is to find the subsets A1,··· , Ak that minimize Cut(A1,..., Ak)
defined as:
Cut(A1,..., Ak) :=
1
2
k
i=1
W (Ai , Ai ) (3.2)
In fact, we still consider in this section that the graph G is supposed to be undirected, simple
and with positively-weighted edges. Thus, minimizing the positive quantity Cut(A1,..., Ak) is
the same as minimizing the connectivity between the clusters Ai and the rest of the graph.
As for the factor 1
2, it is introduced so as to prevent the repetition of the edges due to the fact
that G is undirected.
Page 11 of 31
RDT and Spectral Clustering Telecom Bretagne
However, this method is not really appealing because it doesn’t impose any condition
upon the size of the clusters. For example when k = 2, it often leads to separating one vertex
from the rest of the graph (more specifically the vertex that has the lowest degree).
The NCut and the RatioCut problems (introduced respectively by Shi and Malik in [3] and
Hagen and kahng in [6]) use the degree of a subset and the number of its vertices respectively
to overcome the problem of the "badly-sized" clusters. In the next two subsections, we will
define these problems, see how the graph Laplacians are used to relax them and introduce
the spectral clustering algorithms that correspond to each one of them.
3.4.1 RatioCut
Let’s conserve the same notations above. The RatioCut is defined with respect to the
partition A1,··· , Ak as:
RatioCut(A1,··· , Ak) :=
1
2
k
i=1
W (Ai , Ai )
|Ai |
=
k
i=1
Cut(Ai , Ai )
|Ai |
(3.3)
The objective of this problem is to find the partition that minimizes the quantity 3.3. Here,
it is clear that a subset with a too small number of vertices would raise the RatioCut. That’s
how the RatioCut avoids the problem of badly-sized subsets. The only issue here is that the
RatioCut problem is NP-hard, thence the necessity of a relaxation method.
Case k = 2
Let’s start with the case k = 2 which is easy to express and understand. Here we try to find
a subset A that minimizes RatioCut(A, A) = Cut(A,A)
|A|
+ Cut(A,A)
|A|
. Let A denote a subset of V
and f a vector of Rn
defined as: f = (f1,··· , fn)T
such that:
fi =



|A|
|A|, if i ∈ A
− |A|
|A|
, if i ∉ A
(3.4)
According to proposition 3.3.1:
f T
L f =
1
2 1≤i≤n
1≤j≤n
si j (fi − fj )2
=
1
2
i∈A,j∈A
si j (fi − fj )2
+
1
2
i∈A,j∈A
si j (fi − fj )2
Page 12 of 31
RDT and Spectral Clustering Telecom Bretagne
In fact, the sums i∈A,j∈A si j (fi − fj )2
and i∈A,j∈A si j (fi − fj )2
are null because the entries fi
and fj are equal in both sums. Then:
f T
L f =
1
2
i∈A,j∈A
si j

 |A|
|A|
+
|A|
|A|


2
+
1
2
i∈A,j∈A
si j

−
|A|
|A|
−
|A|
|A|


2
= Cut(A, A)
|A|
|A|
+
|A|
|A|
+2
Thus:
f T
L f = Cut(A, A)
|A|+|A|
|A|
+
|A|+|A|
|A|
= |V |RatioCut(|A|,|A|) (3.5)
Let < . > denote the canonical inner product of Rn
and . 2 the euclidean norm. We can
notice that:
< f ,1 >=
n
i=1
fi =
i∈A
fi +
i∈A
fi = |A|
|A|
|A|
−|A|
|A|
|A|
= 0, then f is orthogonal to 1.
We can also notice that: f 2
2 =
n
i=1
f 2
i =
i∈A
f 2
i +
i∈A
f 2
i = |A|
|A|
|A|
+|A|
|A|
|A|
= n. Then the RatioCut
minimization problem in case k = 2 can be expressed as finding: min
A⊂V
f ⊥1, f 2= n
f T
L f where f
can take entries fi as defined in equation 3.4 which is a discrete optimization problem since
the entries fi are only allowed to take two different values. The relaxation of this problem
consists on allowing these entries to take any real-valued number. The relaxed RatioCut
minimization problem can be therefore written as follows:
min
f ∈Rn
f ⊥1, f 2= n
f T
L f (3.6)
which can be easily solved using basic linear algebra. In fact, the solution of this relaxed
problem is the eigenvector corresponding to the second smallest eigenvalue of the standard
graph Laplacian matrix (Remember that the smallest is zero with eigenvector 1). This can be
proven using the Rayleigh-Ritz theorem. Thus, this eigenvector can be used to approximate
the RatioCut minimizing solution. Note that the latter is not necessarily the optimal solution.
In fact the entries of the optimal solution must be as defined in equation 3.4, however the
solution of the relaxed problem is a randomly real-valued vector. Then we should find a way
to transform it into a discrete indicator vector so as to classify the vertices according to the
latter.
A simple way consists on using the sign of the entry fi of the solution vector to decide
Page 13 of 31
RDT and Spectral Clustering Telecom Bretagne
whether the vertex i is in A or not. But most of spectral clustering algorithms use the k-means
clustering instead. In fact, the sign-based heuristic is a "too simple" mainly when we have
to deal with more than two clusters. We will see below how we can relax a k ≥ 2 RatioCut
minimization problem using the properties of the standard graph Laplacian L.
General case k ≥ 2
Let A1,··· , Ak denote k a partition of V and h1,··· ,hk denote k vectors of Rn
defined as:
hj = (h1j ,··· ,hn j )T
where hi j =



1
|Aj |
, if i ∈ Aj
0, if i ∉ Aj
(3.7)
Let H denote the n-by-k matrix where the entries are equal to hi j . Since A1,··· , Ak is sup-
posed to be a partition of V , each row of H contains only one non-zero entry which is the
j-th one if i ∈ Aj . The factor 1
|Aj |
is used in order to normalize the indicator vectors so as
to have: < hi ,hj >= δi j where δi j denotes the Kronecker delta which is equal to one if i = j.
Otherwise, it is null. We can notice that HT
H = Ik. We have:
hT
l Lhl =
1
2 1≤i≤n
1≤j≤n
si j (hi −hj )2
=
1
2
i∈Al ,j∈Al
si j (hi −hj )2
+
1
2
i∈Al ,j∈Al
si j (hi −hj )2
Thus:
hT
l Lhl =
1
2
i∈Al ,j∈Al
si j
1
Al
2
+
1
2
i∈Al ,j∈Al
si j −
1
Al
2
=
Cut(Al , Al )
|Al |
Let El denote the l-th element of the canonical basis of Rk
. We then have:
(HT
LH)ll = ET
l HT
LHEl = (HEl )T
LHEl = hT
l Lhl
Thus: (HT
LH)ll = hT
l
Lhl =
Cut(Al ,Al )
|Al |
We finally obtain:
RatioCut(A1,··· , Ak) =
k
l=1
Cut(Al , Al )
|Al |
=
k
l=1
(HT
LH)ll = Tr(HT
LH) (3.8)
Where Tr denotes the trace of a matrix. According to equation 3.8, the k ≥ 2 RatioCut
minimization problem can be rewritten as follows:
min
A1,···,Ak ⊂V
HT
H=Ik
Tr(HT
LH) (3.9)
Page 14 of 31
RDT and Spectral Clustering Telecom Bretagne
Such that H can take entries hi j as defined in 3.7. As in case k = 2, this is a NP-hard discrete
optimization problem. To relax it, we will allow the entries hi j of the matrix H to browse all
the real-valued numbers. Thus, the relaxed problem can be rewritten as follows:
min
H∈Rn×k
HT
H=Ik
Tr(HT
LH) (3.10)
According to the Rayleigh-Ritz theorem, the solution of the relaxed problem is the matrix U
containing the k first eigenvectors of the standard graph Laplacian. By "k first eigenvectors"
we mean the k first column vectors of the orthogonal matrix P such that:
L = Pdiag(λ1,··· ,λn)PT
and λ1 ≤ ··· ≤ λn
Again the solution of the relaxed problem does not necessarily contain discrete indicator
vectors like the ones defined in 3.7. Thence the necessity of a clustering algorithm. Most of
spectral clustering algorithms use the k-means clustering on the rows of U. Let Y1,··· ,Yn
denote the n rows of U. The outputs are then clusters C1,··· ,Ck. Then we classify the vertices
by choosing i ∈ Aj if Yi ∈ Cj . In fact, two vertices i and j belong to the same cluster if and
only if the corresponding rows Yi and Yj of the matrix U belong to the same cluster as well.
To understand this equivalence, let’s consider a simple case where the solution U is optimal.
Then, its column vectors would be similar to the indicator vectors that we defined in 3.7. It is
clear that two vertices i and j belong to the subset Al if and only if hil = hjl and therefore the
i-th row and the j-th rows are equal as well.
The spectral clustering algorithm corresponding to the RatioCut minimization problem
is the unnormalized spectral clustering described below:
Algorithm 1 Unnormalized spectral clustering
1: Input: n vectors X1,··· ,Xn of Rd
encoding the data and the number of clusters k to
construct
2: Output: Clusters A1,··· , Ak
3: Build the similarity graph (see section 3.5 below) and compute the similarity matrix S
4: Compute the standard graph Laplacian L.
5: Compute v1,··· ,vk, the k first eigenvectors of L.
6: U ← the matrix containing v1,··· ,vk.
7: for i = 1,··· ,n do
8: Yi ← the i−th row of U
9: Cluster the vectors Y1,··· ,Yn of Rk
into clusters C1,··· ,Ck
10: Aj = {Xi |Yi ∈ Cj }
Page 15 of 31
RDT and Spectral Clustering Telecom Bretagne
3.4.2 NCut
As in the case of RatioCut, we conserve the notations of section 3.4. The NCut is then
defined with respect to the partition A1,··· , Ak as:
NCut(A1,··· , Ak) :=
1
2
k
i=1
W (Ai , Ai )
vol(Ai )
=
k
i=1
Cut(Ai , Ai )
vol(Ai )
(3.11)
The NCut problem measures the size of a subset A ⊂ V by the quantity vol(A). Recall that
vol(A) is equal to the sum of the degrees of the vertices of A. It is clear that a small value
of vol(Ai ) will raise the value of the NCut. This is the way how the NCut problem avoids
the problem of badly-sized subsets (see section 3.4 above). The advantage of using vol(A)
instead of |A| is that vol(A) measures how "strong" the connectivity between the vertices of A
is. Unlike |A| which doesn’t give any information about this connectivity. Another advantage
of using the NCut instead of the RatioCut is that the NCut minimizes the probability that a
random walker goes from a cluster to another. See proposition 5 in section 6 of [4] for more
explanations.
As in the case of the RatioCut problem, the objective here is to minimize the quantity 3.11
which is a NP-hard problem to solve. We will see below how the graph Laplacian matrices are
used to relax it.
Case k = 2
The objective here is to find a subset A ⊂ V that minimizes NCut(A, A) = Cut(A,A)
vol(A) + Cut(A,A)
vol(A)
.
Given a subset A ⊂ V , let f = (f1,··· , fn)T
a vector of Rn
such that:
fi =



vol(A)
vol(A), if i ∈ A
− vol(A)
vol(A)
, if i ∉ A
(3.12)
As we did in the RatioCut’s case, we can check that:
f T
L f = vol(V )NCut(A, A), f T
D f = vol(V ) and (D f )T
1 = 0
Then we can write the NCut minimization problem:
min
A⊂V
D f ⊥1, f T
D f =vol(V )
f T
L f (3.13)
Page 16 of 31
RDT and Spectral Clustering Telecom Bretagne
such that f can only take entries fi as defined in 3.12. Again we allow the entries fi to
browse all the real-valued numbers in order to relax the NP-hard NCut minimization problem.
Therefore the relaxed problem can be written as follows:
min
f ∈Rn
D f ⊥1, f T
D f =vol(V )
f T
L f (3.14)
It can be also rewritten in an equivalent way using the substitution g := D1/2
f :
min
g∈Rn
g⊥D1/2
1, g 2
2=vol(V )
gT
D−1/2
LD−1/2
g = min
g∈Rn
g⊥D1/2
1, g 2
2=vol(V )
gT
Lsymg (3.15)
According to propositions 3.3.1 and 3.3.2, D1/2
1 is the first eigenvector of Lsym. Then, using
the Rayleigh-Ritz theorem, we can conclude that the solution g of the relaxed problem is
equal to the second eigenvector of Lsym. If we re-substitute f = D−1/2
g, we can notice using
the proposition 3.3.2 that f is equal to the second eigenvector of Lr w or the generalized
eigenproblem L f = λD f .
After finding the solution of the relaxed problem we carry on the classification process
the same way we did in the case of the RatioCut minimization problem.
General case k ≥ 2
As in the case of the RatioCut, given a partition A1,··· , Ak of the set of vertices V , we define
the vectors h1,··· ,hk of Rn
where hi = (h1j ,··· ,hn j )T
such that:
hi j =



1
vol(Aj )
, if i ∈ Aj
0, if i ∉ Aj
(3.16)
Again let H denote the n-by-k matrix containing the vectors h1,··· ,hk as columns. We can
check that: (HT
LH)ll = hT
l
Lhl =
Cut(Al ,Al )
vol(Al )
and therefore:
NCut(A1,··· , Ak) =
k
l=1
Cut(Al , Al )
vol(Al )
=
k
l=1
(HT
LH)ll = Tr(HT
LH) (3.17)
Page 17 of 31
RDT and Spectral Clustering Telecom Bretagne
We can also see that: HT
DH = Ik. Therefore the NCut minimization problem can be re-
expressed as:
min
A1,···,Ak ⊂V
HT
DH=Ik
Tr(HT
LH) (3.18)
where H can take the entries hi j as defined in 3.15. Again, the relaxation consists on allowing
these entries to take any real number. The relaxed problem can be then rewritten as follows:
min
H∈Rn×k
HT
DH=Ik
Tr(HT
LH) (3.19)
Or after the substitution F := D1/2
H:
min
F∈Rn×k
FT
F=Ik
Tr(FT
D−1/2
LD−1/2
F) = min
F∈Rn×k
FT
F=Ik
Tr(FT
LsymF) (3.20)
As in the case of the RatioCut problem, he Rayleigh-Ritz theorem allows us to conclude that
the solution of the relaxed problem is given by the matrix F containing the first k eigenvectors
of the symmetric normalized graph Laplacian Lsym. Then if we re-substitute H = D−1/2
F,
we can notice using proposition 3.3.2 that H is equal to the matrix containing the k first
eigenvectors of Lr w or the generalized eigenproblem L f = λD f .
The spectral clustering algorithms corresponding to the NCut minimization problem are
the random walk normalized spectral clustering introduced by Shi and Malik in [3] and the
symmetric normalized spectral clustering introduced by Ng, Jordan and Weiss in [7].
Algorithm 2 Random walk normalized spectral clustering (Shi and Malik in [3])
1: Input: n vectors X1,··· ,Xn of Rd
encoding the data and the number of clusters k to
construct
2: Output: Clusters A1,··· , Ak
3: Build the similarity graph (see section 3.5 below) and compute the similarity matrix S
4: Compute the standard graph Laplacian L.
5: Compute v1,··· ,vk, the k first eigenvectors of the generalized eigenproblem L f = λDu
(which are also the k first eigenvectors of Lr w ).
6: U ← the matrix containing v1,··· ,vk
7: for i = 1,··· ,n do
8: Yi ← the i−th row of U
9: Cluster the vectors Y1,··· ,Yn of Rk
into clusters C1,··· ,Ck
10: Aj = {Xi |Yi ∈ Cj }
Page 18 of 31
RDT and Spectral Clustering Telecom Bretagne
Algorithm 3 Symmetric normalized spectral clustering (Ng, Jordan and Weiss in [7])
1: Input: n vectors X1,··· ,Xn of Rd
encoding the data and the number of clusters k to
construct
2: Output: Clusters A1,··· , Ak
3: Build the similarity graph (see section 3.5 below) and compute the similarity matrix S
4: Compute the symmetric normalized graph Laplacian Lsym.
5: Compute v1,··· ,vk, the k first eigenvectors of Lsym.
6: U ← the matrix containing v1,··· ,vk.
7: compute T the n-by-k matrix by normalizing the rows of U:
8: for i = 1,··· ,n and j = 1,··· ,k do
9: ti j ←
ui j
k
p=1 u2
ip
10: for i = 1,··· ,n do Yi ← the i−th row of T
11: Cluster the vectors Y1,··· ,Yn of Rk
into clusters C1,··· ,Ck
12: Aj = {Xi |Yi ∈ Cj }
Note that the normalization step in the symmetric normalized spectral clustering aims at
avoiding certain computational problems. The latter are related to the fact that the calculators
precision is limited to some bounds. Note also that in their experiments, the authors of [3]
and [7] used the fully connected graph with the Gaussian kernel as a similarity graph. So will
we do in the next section because the two other options (see section 3.5) require a choice of
additional parameters. This choice is generally based on certain heuristics which might have
a negative impact on the experiments.
3.5 RANDOM DISTORTION TESTING p-VALUE KERNEL
In all spectral clustering algorithms, the first step consists on building a similarity graph.
Recall that data partitioning is based on the notion of similarity. In fact, a cluster must
contain data that are "similar" within the cluster and "dissimilar" to the rest of the data set.
A natural way to model the similarity within the data consists on building a graph whose
vertices encode the data. The weights of the connections are then chosen to be proportional
to the degree of "similarity" between the vertices. A kernel function (it is also called similarity
function) is a map from the set of edges E to the set of real numbers. This map assigns to
each pair of vertices a real number called the weight. In practice, the data set is supposed to
contain vectors of Rd
. Thus we can say that two vectors are similar if the distance between
them (e.g the euclidean distance) is "small". Therefore, the smaller the distance between two
vertices is the greater the weight of the edge connecting them is.
Page 19 of 31
RDT and Spectral Clustering Telecom Bretagne
The most popular kernel function is the Gaussian kernel which is defined with respect to
a parameter σ and two d-dimensional vectors xi and xj as:
si j := exp
− xi − xj
2
2
2σ2
(3.21)
It is obvious that the Gaussian kernel is a decreasing positive function of the euclidean
distance. The parameter σ is homogeneous to a standard variation, thence its notation. We
can also notice that the Gaussian kernel is homogeneous to a probability.
Given a set of d-dimensional vectors and a kernel function, there are different ways to
build a similarity graph:
• The fully connected graph: here we connect all the vectors and then we weight the
edges using the kernel function.
• The k-nearest neighbors graph: this method consists on connecting each vector to its
k nearest vectors in term of the euclidean distance. We then weight the edges using the
kernel function.
• The -neighborhood graph: we connect the vectors xi and xj if xi −xj 2 ≤ and then
we weight the edges using the kernel function.
In this internship we study the impact of choosing the kernel function equal to the p-value
of random distortion testing on the partitioning performances. First of all, let’s give a formal
definition of the p-value of a random distortion test.
Definition 3.5.1 (p-value of a random distortion test). The p-value of a random distortion
testing is defined with respect to the tolerance τ of the latter, its corresponding deterministic
model θ0, its corresponding noise covariance matrix C and a vector y of Rd
as:
ˆγ(y) = in f {γ ∈ (0,1) : Tλγ(τ) = 1} = in f {γ ∈ (0,1) : λγ(τ) < y −θ0 }
Like the Gaussian kernel, the p-value of RDT is homogeneous to a probability. It can be
seen as the optimal value of the false-alarm probability. In fact, given a fixed value of the
tolerance τ, it is proven in [1] that the function λγ(τ) is a strictly decreasing function of γ.
Hence the p-value is optimal in the sense that it is not too big to detect always false outliers
(false alarms) and not too small to miss detecting them. Given a vector y and a tolerance τ,
it follows from the definition of the p-value that a small value of the latter corresponds to a
high value of λˆγ(y)(τ) and therefore a high value of the distortion y −θ0 and vice versa.
Page 20 of 31
RDT and Spectral Clustering Telecom Bretagne
Now let’s consider the case where θ0 = 0Rd the zero constant vector. Then given two
vectors yi and yj of Rd
, a small value of ˆγ(yi − yj ) will correspond to a high value of yi − yj
and vice versa. This is the first reason why we thought of using the p-value as a similarity
(kernel) function. One other reason is that when d = 2, τ = 0, and the covariance matrix C = I2,
the p-value and the Gaussian kernel are equal. In fact, since λγ(τ) is a strictly decreasing
function of γ, then ˆγ(y) satisfies:
λˆγ(y)(τ) = y −θ0 (3.22)
It follow then from equation 2.4 (see section 2.3) that:
ˆγ(y) = 1−R(τ,λˆγ(y)(τ)) = 1−R(τ, y −θ0 ) = Qd/2(τ, y −θ0 ) (3.23)
where QM denotes the Marcum Q-function defined with respect to two real numbers x and y
as:
QM (x, y) = exp −
x2
+ y2
2
∞
k=1−M
x
y
k
Ik(xy) (3.24)
where Ik denotes the Bessel function of order k. Thus, it follows from 3.23 and 3.24 that in
case τ = 0, d = 2 and C = I2:
y −θ0 = (y −θ0)T
I2(y −θ0) = (y −θ0)T
(y −θ0) = y −θ0 2
and:
ˆγ(y) = Q1(0, y −θ0 2) = exp −
y −θ0
2
2
2
∞
k=0
0
y −θ0 2
k
Ik(0) = exp −
y −θ0
2
2
2
(3.25)
It follows from these calculations that the p-value can be considered as a generalization of
the Gaussian kernel.
3.6 EXPERIMENTS
In this section we will compare the clustering performances of the p-value and the Gaussian
kernel in the context of disturbed data. In order to visualize the clustering results we will only
be interested in the case of bi-dimensional and tri-dimensional data. We still suppose in this
section that the data set is supposed to contain vectors Y = Θ+ X of Rd
where Θ denotes the
vector of interest and X ∼ N (0,σ2
Id ) denotes the noise vector.
Page 21 of 31
RDT and Spectral Clustering Telecom Bretagne
We will start with the unnormalized spectral clustering where the results are very satisfying.
Figure 3.2: Two circles with additive noise drawn from the normal distribution. A good clustering
must separate the two circles. The number of points in each circle is equal to 50.
Figure 3.3: In the left is represented the resulting clustering using the RDT-pvalue kernel. In the middle
and the right, the figure represents the resulting clustering of the Gaussian kernel with parameter
sigma respectively equal to the standard variation and the double of the latter
As we can see in figure 3.3 the performances of the Gaussian kernel and the RDT-pvalue
are both satisfying. In fact, the number of points in each circle is enough to end up at good
clustering results.
Page 22 of 31
RDT and Spectral Clustering Telecom Bretagne
Figure 3.4: The figure in the left represents the resulting clustering using the RDT-pvalue whereas the
two others represent the clustering results of the Gaussian kernel (16 points in each circle).
In figure 3.4 above we see the impact of reducing the number of points on the clustering
performance. Here, the RDT-pvalue leads to far more interesting results than the Gaussian
kernel does. In figure 3.5 below, we try the two similarity functions on tri-dimensional data.
Here, a good clustering must separate the two helices. Again we notice that the RDT-pvalue
leads to interesting results.
Figure 3.5: The figure in the left represents the resulting clustering using the RDT-pvalue while the
two others represent the clustering results of the Gaussian kernel.
Page 23 of 31
RDT and Spectral Clustering Telecom Bretagne
The script we used to obtain all the figures above is available in the attachments and it is
named "USCmain.m". Put a break-point between the two blocks before running the script.
Try to change N the number of points (line 81) to see its impact on the clustering results.
In the following we will see how the symmetric and the random walk normalized spectral
clustering algorithms don’t end up at good clustering results. But as we will notice in the
figures, there is an interval of tolerance values where the RDT-pvalue kernel leads to the
correct clusters.
Figure 3.6: This figure represents the original data set to partition using the symmetric normalized
spectral clustering algorithm. The standard variation of the noise is equal to 0.5.
After running several times the symmetric normalized spectral clustering algorithm with
different initial conditions, it seems that it doesn’t lead to satisfying results in particular with
the Gaussian similarity function. However, choosing an appropriate value of the tolerance τ
makes the RDT-pvalue kernel work very well. (See the pictures 3.7 and 3.8 below) As we can
see in figure 3.7, the symmetric normalized spectral clustering didn’t manage to end up at the
correct clusters for a tolerance between 0 and 1.33. However, for a tolerance between 1.7778
and 4, the results are very satisfying (figure 3.8) To obtain the figures 3.7 and 3.8 run the file
"SymSCmain.m" in the attachments.
As for the random walk normalized spectral clustering, the results were very similar (See
figures 3.9, 3.10 and 3.11 below) they aren’t satisfying for 0 < τ < 1.1111 and 3.3333 < τ < 5
(figures 3.9 and 3.11). However a tolerance 1.6667 < τ < 2.7778 leads to the correct clusters
(figure 3.10)
Page 24 of 31
RDT and Spectral Clustering Telecom Bretagne
CONCLUSION
The experiments show that choosing the p-value of random distortion tests as a similarity
function in spectral clustering algorithms leads to promising results. In fact, we saw how the
results were very satisfying when we used the unnormalized spectral clustering algorithm to
partition bi-dimensional and tri-dimensional data. In normalized spectral clustering cases,
experiments showed that we must find the appropriate tolerance to end up at satisfying
results. We have also proved that in some cases the p-value and the Gaussian kernel are
equal.
Finding the appropriate value of the tolerance can be a difficult task. This is also true for
the parameter sigma of the Gaussian kernel. But in general, such parameters are chosen in
order to optimize certain criteria.
Page 25 of 31
RDT and Spectral Clustering Telecom Bretagne
REFERENCES
[1] DOMINIQUE PASTOR, QUANG-THANG. Nguyen, Random Distortion Testing and Opti-
mality of Thresholding Tests, IEEE Transactions on Signal processing, Vol. 61, No. 16, August
15, 2013
[2] MacKay, David (2003). "Chapter 20. An Example Inference Task: Clustering" . Infor-
mation Theory, Inference and Learning Algorithms. Cambridge University Press. pp. 284-292.
[3] JIANBO SHI, JITENDRA MALIK, Normalized Cuts and Image Segmentation, IEEE TRANS-
ACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 8, AUGUST
2000
[4] ULRIKE VON LUXBURG, A Tutorial on Spectral Clustering , Statistics and Computing, 17
(4), 2007
[5] Fan R. K. Chung (1997). Spectral graph theory (Vol. 92 of the CBMS Regional Conference
Series in Mathematics).
[6] Lars Hagen and Andrew B. Kahng (1992). New spectral methods for ratio cut partitioning
and clustering. IEEE Trans. Computer-Aided Design, 11(9), 1074 – 1085.
[7] Andrew Y. Ng, Michael I. Jordan and Yair Weiss (2002). On spectral clustering: analy-
sis and an algorithm. In T. Dietterich, S. Becker, and Z. Ghahramani (Eds.), Advances in
Neural Information Processing Systems 14
Page 26 of 31
RDT and Spectral Clustering Telecom Bretagne
Figure 3.7: In each row, the figure in the left represents the resulting clustering using the RDT-pvalue
while the two others represent the clustering results of the Gaussian kernel. Here 0 < τ < 1.33
Page 27 of 31
RDT and Spectral Clustering Telecom Bretagne
Figure 3.8: In each row, the figure in the left represents the resulting clustering using the RDT-pvalue
while the two others represent the clustering results of the Gaussian kernel. Here 1.7778 < τ < 4
Page 28 of 31
RDT and Spectral Clustering Telecom Bretagne
Figure 3.9: In each row, the figure in the left represents the resulting clustering using the RDT-pvalue
while the two others represent the clustering results of the Gaussian kernel. Here 0 < τ < 1.1111
Page 29 of 31
RDT and Spectral Clustering Telecom Bretagne
Figure 3.10: In each row, the figure in the left represents the resulting clustering using the RDT-pvalue
while the two others represent the clustering results of the Gaussian kernel. Here 1.6667 < τ < 2.7778
Page 30 of 31
RDT and Spectral Clustering Telecom Bretagne
Figure 3.11: In each row, the figure in the left represents the resulting clustering using the RDT-pvalue
while the two others represent the clustering results of the Gaussian kernel. Here 3.3333 < τ < 5
Page 31 of 31

Weitere ähnliche Inhalte

Was ist angesagt?

A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...IJDKP
 
A Counterexample to the Forward Recursion in Fuzzy Critical Path Analysis Und...
A Counterexample to the Forward Recursion in Fuzzy Critical Path Analysis Und...A Counterexample to the Forward Recursion in Fuzzy Critical Path Analysis Und...
A Counterexample to the Forward Recursion in Fuzzy Critical Path Analysis Und...ijfls
 
Tensor representations in signal processing and machine learning (tutorial ta...
Tensor representations in signal processing and machine learning (tutorial ta...Tensor representations in signal processing and machine learning (tutorial ta...
Tensor representations in signal processing and machine learning (tutorial ta...Tatsuya Yokota
 
AN EFFICIENT PARALLEL ALGORITHM FOR COMPUTING DETERMINANT OF NON-SQUARE MATRI...
AN EFFICIENT PARALLEL ALGORITHM FOR COMPUTING DETERMINANT OF NON-SQUARE MATRI...AN EFFICIENT PARALLEL ALGORITHM FOR COMPUTING DETERMINANT OF NON-SQUARE MATRI...
AN EFFICIENT PARALLEL ALGORITHM FOR COMPUTING DETERMINANT OF NON-SQUARE MATRI...ijdpsjournal
 
Second or fourth-order finite difference operators, which one is most effective?
Second or fourth-order finite difference operators, which one is most effective?Second or fourth-order finite difference operators, which one is most effective?
Second or fourth-order finite difference operators, which one is most effective?Premier Publishers
 
Solving the Poisson Equation
Solving the Poisson EquationSolving the Poisson Equation
Solving the Poisson EquationShahzaib Malik
 
International journal of engineering issues vol 2015 - no 2 - paper5
International journal of engineering issues   vol 2015 - no 2 - paper5International journal of engineering issues   vol 2015 - no 2 - paper5
International journal of engineering issues vol 2015 - no 2 - paper5sophiabelthome
 
A Mathematical Model to Solve Nonlinear Initial and Boundary Value Problems b...
A Mathematical Model to Solve Nonlinear Initial and Boundary Value Problems b...A Mathematical Model to Solve Nonlinear Initial and Boundary Value Problems b...
A Mathematical Model to Solve Nonlinear Initial and Boundary Value Problems b...IJERA Editor
 
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...Adam Fausett
 
Trust Region Algorithm - Bachelor Dissertation
Trust Region Algorithm - Bachelor DissertationTrust Region Algorithm - Bachelor Dissertation
Trust Region Algorithm - Bachelor DissertationChristian Adom
 
Sensitivity Analysis of GRA Method for Interval Valued Intuitionistic Fuzzy M...
Sensitivity Analysis of GRA Method for Interval Valued Intuitionistic Fuzzy M...Sensitivity Analysis of GRA Method for Interval Valued Intuitionistic Fuzzy M...
Sensitivity Analysis of GRA Method for Interval Valued Intuitionistic Fuzzy M...ijsrd.com
 
Sensing Method for Two-Target Detection in Time-Constrained Vector Poisson Ch...
Sensing Method for Two-Target Detection in Time-Constrained Vector Poisson Ch...Sensing Method for Two-Target Detection in Time-Constrained Vector Poisson Ch...
Sensing Method for Two-Target Detection in Time-Constrained Vector Poisson Ch...sipij
 
Ch2 probability and random variables pg 81
Ch2 probability and random variables pg 81Ch2 probability and random variables pg 81
Ch2 probability and random variables pg 81Prateek Omer
 
Icitam2019 2020 book_chapter
Icitam2019 2020 book_chapterIcitam2019 2020 book_chapter
Icitam2019 2020 book_chapterBan Bang
 
Robust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity Measure
Robust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity MeasureRobust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity Measure
Robust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity MeasureIJRES Journal
 
hankel_norm approximation_fir_ ijc
hankel_norm approximation_fir_ ijchankel_norm approximation_fir_ ijc
hankel_norm approximation_fir_ ijcVasilis Tsoulkas
 

Was ist angesagt? (19)

Numerical Solution and Stability Analysis of Huxley Equation
Numerical Solution and Stability Analysis of Huxley EquationNumerical Solution and Stability Analysis of Huxley Equation
Numerical Solution and Stability Analysis of Huxley Equation
 
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
 
A Counterexample to the Forward Recursion in Fuzzy Critical Path Analysis Und...
A Counterexample to the Forward Recursion in Fuzzy Critical Path Analysis Und...A Counterexample to the Forward Recursion in Fuzzy Critical Path Analysis Und...
A Counterexample to the Forward Recursion in Fuzzy Critical Path Analysis Und...
 
Tensor representations in signal processing and machine learning (tutorial ta...
Tensor representations in signal processing and machine learning (tutorial ta...Tensor representations in signal processing and machine learning (tutorial ta...
Tensor representations in signal processing and machine learning (tutorial ta...
 
AN EFFICIENT PARALLEL ALGORITHM FOR COMPUTING DETERMINANT OF NON-SQUARE MATRI...
AN EFFICIENT PARALLEL ALGORITHM FOR COMPUTING DETERMINANT OF NON-SQUARE MATRI...AN EFFICIENT PARALLEL ALGORITHM FOR COMPUTING DETERMINANT OF NON-SQUARE MATRI...
AN EFFICIENT PARALLEL ALGORITHM FOR COMPUTING DETERMINANT OF NON-SQUARE MATRI...
 
Second or fourth-order finite difference operators, which one is most effective?
Second or fourth-order finite difference operators, which one is most effective?Second or fourth-order finite difference operators, which one is most effective?
Second or fourth-order finite difference operators, which one is most effective?
 
Solving the Poisson Equation
Solving the Poisson EquationSolving the Poisson Equation
Solving the Poisson Equation
 
Intro
IntroIntro
Intro
 
International journal of engineering issues vol 2015 - no 2 - paper5
International journal of engineering issues   vol 2015 - no 2 - paper5International journal of engineering issues   vol 2015 - no 2 - paper5
International journal of engineering issues vol 2015 - no 2 - paper5
 
A Mathematical Model to Solve Nonlinear Initial and Boundary Value Problems b...
A Mathematical Model to Solve Nonlinear Initial and Boundary Value Problems b...A Mathematical Model to Solve Nonlinear Initial and Boundary Value Problems b...
A Mathematical Model to Solve Nonlinear Initial and Boundary Value Problems b...
 
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
 
Trust Region Algorithm - Bachelor Dissertation
Trust Region Algorithm - Bachelor DissertationTrust Region Algorithm - Bachelor Dissertation
Trust Region Algorithm - Bachelor Dissertation
 
Sensitivity Analysis of GRA Method for Interval Valued Intuitionistic Fuzzy M...
Sensitivity Analysis of GRA Method for Interval Valued Intuitionistic Fuzzy M...Sensitivity Analysis of GRA Method for Interval Valued Intuitionistic Fuzzy M...
Sensitivity Analysis of GRA Method for Interval Valued Intuitionistic Fuzzy M...
 
NFSFIXES
NFSFIXESNFSFIXES
NFSFIXES
 
Sensing Method for Two-Target Detection in Time-Constrained Vector Poisson Ch...
Sensing Method for Two-Target Detection in Time-Constrained Vector Poisson Ch...Sensing Method for Two-Target Detection in Time-Constrained Vector Poisson Ch...
Sensing Method for Two-Target Detection in Time-Constrained Vector Poisson Ch...
 
Ch2 probability and random variables pg 81
Ch2 probability and random variables pg 81Ch2 probability and random variables pg 81
Ch2 probability and random variables pg 81
 
Icitam2019 2020 book_chapter
Icitam2019 2020 book_chapterIcitam2019 2020 book_chapter
Icitam2019 2020 book_chapter
 
Robust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity Measure
Robust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity MeasureRobust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity Measure
Robust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity Measure
 
hankel_norm approximation_fir_ ijc
hankel_norm approximation_fir_ ijchankel_norm approximation_fir_ ijc
hankel_norm approximation_fir_ ijc
 

Ähnlich wie InternshipReport

Sparse data formats and efficient numerical methods for uncertainties in nume...
Sparse data formats and efficient numerical methods for uncertainties in nume...Sparse data formats and efficient numerical methods for uncertainties in nume...
Sparse data formats and efficient numerical methods for uncertainties in nume...Alexander Litvinenko
 
Fuzzy c-Means Clustering Algorithms
Fuzzy c-Means Clustering AlgorithmsFuzzy c-Means Clustering Algorithms
Fuzzy c-Means Clustering AlgorithmsJustin Cletus
 
20070823
2007082320070823
20070823neostar
 
An investigation of inference of the generalized extreme value distribution b...
An investigation of inference of the generalized extreme value distribution b...An investigation of inference of the generalized extreme value distribution b...
An investigation of inference of the generalized extreme value distribution b...Alexander Decker
 
Higgs Boson Machine Learning Challenge Report
Higgs Boson Machine Learning Challenge ReportHiggs Boson Machine Learning Challenge Report
Higgs Boson Machine Learning Challenge ReportChamila Wijayarathna
 
Regularized Compression of A Noisy Blurred Image
Regularized Compression of A Noisy Blurred Image Regularized Compression of A Noisy Blurred Image
Regularized Compression of A Noisy Blurred Image ijcsa
 
Tensor Spectral Clustering
Tensor Spectral ClusteringTensor Spectral Clustering
Tensor Spectral ClusteringAustin Benson
 
Exact network reconstruction from consensus signals and one eigen value
Exact network reconstruction from consensus signals and one eigen valueExact network reconstruction from consensus signals and one eigen value
Exact network reconstruction from consensus signals and one eigen valueIJCNCJournal
 
2012 mdsp pr09 pca lda
2012 mdsp pr09 pca lda2012 mdsp pr09 pca lda
2012 mdsp pr09 pca ldanozomuhamada
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clusteringishmecse13
 
Spike sorting: What is it? Why do we need it? Where does it come from? How is...
Spike sorting: What is it? Why do we need it? Where does it come from? How is...Spike sorting: What is it? Why do we need it? Where does it come from? How is...
Spike sorting: What is it? Why do we need it? Where does it come from? How is...NeuroMat
 
WaveletTutorial.pdf
WaveletTutorial.pdfWaveletTutorial.pdf
WaveletTutorial.pdfshreyassr9
 
Paper id 26201482
Paper id 26201482Paper id 26201482
Paper id 26201482IJRAT
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Zihui Li
 
Modelos de Redes Neuronales
Modelos de Redes NeuronalesModelos de Redes Neuronales
Modelos de Redes NeuronalesCarlosNigerDiaz
 

Ähnlich wie InternshipReport (20)

Sparse data formats and efficient numerical methods for uncertainties in nume...
Sparse data formats and efficient numerical methods for uncertainties in nume...Sparse data formats and efficient numerical methods for uncertainties in nume...
Sparse data formats and efficient numerical methods for uncertainties in nume...
 
Linear algebra havard university
Linear algebra havard universityLinear algebra havard university
Linear algebra havard university
 
Fuzzy c-Means Clustering Algorithms
Fuzzy c-Means Clustering AlgorithmsFuzzy c-Means Clustering Algorithms
Fuzzy c-Means Clustering Algorithms
 
chap3.pdf
chap3.pdfchap3.pdf
chap3.pdf
 
20070823
2007082320070823
20070823
 
An investigation of inference of the generalized extreme value distribution b...
An investigation of inference of the generalized extreme value distribution b...An investigation of inference of the generalized extreme value distribution b...
An investigation of inference of the generalized extreme value distribution b...
 
Higgs Boson Machine Learning Challenge Report
Higgs Boson Machine Learning Challenge ReportHiggs Boson Machine Learning Challenge Report
Higgs Boson Machine Learning Challenge Report
 
Gdistance vignette
Gdistance vignetteGdistance vignette
Gdistance vignette
 
Regularized Compression of A Noisy Blurred Image
Regularized Compression of A Noisy Blurred Image Regularized Compression of A Noisy Blurred Image
Regularized Compression of A Noisy Blurred Image
 
Tensor Spectral Clustering
Tensor Spectral ClusteringTensor Spectral Clustering
Tensor Spectral Clustering
 
Exact network reconstruction from consensus signals and one eigen value
Exact network reconstruction from consensus signals and one eigen valueExact network reconstruction from consensus signals and one eigen value
Exact network reconstruction from consensus signals and one eigen value
 
2012 mdsp pr09 pca lda
2012 mdsp pr09 pca lda2012 mdsp pr09 pca lda
2012 mdsp pr09 pca lda
 
wzhu_paper
wzhu_paperwzhu_paper
wzhu_paper
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
 
Spike sorting: What is it? Why do we need it? Where does it come from? How is...
Spike sorting: What is it? Why do we need it? Where does it come from? How is...Spike sorting: What is it? Why do we need it? Where does it come from? How is...
Spike sorting: What is it? Why do we need it? Where does it come from? How is...
 
WaveletTutorial.pdf
WaveletTutorial.pdfWaveletTutorial.pdf
WaveletTutorial.pdf
 
PhotonCountingMethods
PhotonCountingMethodsPhotonCountingMethods
PhotonCountingMethods
 
Paper id 26201482
Paper id 26201482Paper id 26201482
Paper id 26201482
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Modelos de Redes Neuronales
Modelos de Redes NeuronalesModelos de Redes Neuronales
Modelos de Redes Neuronales
 

InternshipReport

  • 1. INTERNSHIP REPORT RANDOM DISTORTION TESTS APPLIED IN A SPECTRAL CLUSTERING CONTEXT October 19, 2016 Hamza Ameur Institut Mines-Telecom - Telecom Bretagne Department of Signal and Communications
  • 2. Contents 1 Introduction 2 2 Random distortion testing 3 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3 Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3 Spectral Clustering 5 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.3 Graph Laplacian matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.4 NCut and RatioCut problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.4.1 RatioCut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.4.2 NCut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.5 Random distortion testing p-value kernel . . . . . . . . . . . . . . . . . . . . . . . 19 3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1
  • 3. Chapter 1 Introduction Spectral clustering algorithms are widely used in machine learning and data partitioning contexts. They consist on building a graph where the vertices represent data and then partitioning it into k logically separated clusters. By "logically" we mean that the elements of each cluster must be "highly-connected" within the cluster and well-separated from the rest of the graph. The quality of partitioning is based on a judicious choice of the similarity function between the vertices of the graph. A natural way is to connect all the vertices of the graph and then weight the edges using a kernel function. The most popular function used in this regard is the Gaussian kernel. Recently at Telecom Bretagne, Dominique Pastor and Quang-Thang Nguyen have developed in [1] a new theory called Random Distortion Testing (RDT). The objective of this theory is to guarantee any level of false-alarm probability while detecting distortion in disturbed data using threshold-based tests. In this internship, we will show how this theoretical framework leads to choosing the similarity function as equal to the p-value of random distortion tests. We will study the partitioning performance of this kernel function compared to the Gaussian kernel in a context of disturbed data. ACKNOWLEDGMENT I am very grateful to Dominique Pastor and Redouane Lguensat who supervised my one- month internship. Their help with the difficulties I encountered during my work was of a valuable interest. 2
  • 4. Chapter 2 Random distortion testing 2.1 INTRODUCTION In signal processing contexts, we generally observe the signal of interest in additive inde- pendent Gaussian noise. In [1] the observed data is supposed to be a vector Y = Θ+ X where Θ is a random signal of interest with unknown distribution and X is an additive independent Gaussian noise with positive definite covariance matrix C. In most of signal processing problems we either try to estimate one or more parameters of a certain signal of interest (e.g: its mean or its variance) or decide if this signal follows a certain model as it is the case in [1]. In the latter case threshold-based tests are commonly used to either confirm the null hypothesis which is that the signal follows actually the known model or the alternative hypothesis which is that the observed signal is an anomaly. False alarm probability is one of the parameters we try to optimize in such tests. It is defined as the probability that we decide to confirm the null hypothesis whereas it is actually false. The RDT’s theoretical framework enables us to guarantee any level of this probability regardless of the unknown distribution of the signal of interest. Thence the strength of this framework. 2.2 NOTATION As we mentioned above, we suppose in [1] that the observed data is a vector Y = Θ+X such that Θ is a random vector of Rd with unknown distribution encoding the signal of interest. Whereas X is an additive independent Gaussian noise with null mean and positive definite covariance matrix C. Let θ0 denote a known deterministic model of the signal of interest. 3
  • 5. RDT and Spectral Clustering Telecom Bretagne Let H0 denote the null hypothesis and H1 the alternative one. We have:    H0 : Θ−θ0 ≤ τ H1 : Θ−θ0 > τ (2.1) where . denotes the Mahalanobis distance defined with respect to the covariance matrix C as: ∀X ∈ Rd : X = X T C−1X (2.2) τ is called the tolerance which is technically the maximum value of the distortion Θ−θ0 that we tolerate. By "tolerate" we mean that we decide that Θ still follows the deterministic model θ0 if the distortion Θ−θ0 is under this value. Otherwise we decide that there is an anomaly. Finally, given any η ≥ 0, a threshold-based test with threshold height η on the Mahalanobis distance to θ0 is any measurable map Tη of Rd into {0,1} such that: ∀y ∈ Rd : Tη(y) =    1 if y −θ0 > τ 0 if y −θ0 ≤ τ (2.3) 2.3 THEOREM In most of signal processing problems, we don’t actually decide whether the value of the distortion Θ−θ0 satisfies H0 or H1 with respect to the real value of Θ−θ0 . In fact, the observed data Y might be a corrupted version of Θ due to the noise X . In [1] it is proven that for any γ and τ both greater than 0, the false-alarm probability of the threshold-based test Tλγ(τ) is less than γ. Here λγ(τ) denotes the unique solution of the equation: 1−R(τ,λγ(τ)) = γ (2.4) where the function R is defined for any ρ and η both greater than 0 as: R(ρ,η) = Fχ2 d (ρ2)(η2 ) where Fχ2 d (ρ2) denotes the cumulative distribution function of the Chi-squared distribution with d degrees of freedom and parameter ρ2 . In section 3.5, we will define the p-value of a random distortion test and study its performance as a kernel function in spectral clustering contexts. Page 4 of 31
  • 6. Chapter 3 Spectral Clustering 3.1 INTRODUCTION Data partitioning aims at deciding which sets of data have "similar" properties. These sets are commonly called Clusters. One of the widely-used algorithms in data partitioning is the k-means clustering. The idea of this algorithm is that, given a data set {Y1,Y2,··· ,Yn}, the objective is to partition it into k (k ≤ n) sets called clusters, {C1,C2,··· ,Ck} so as to minimize the quantity: 0<i<k Yj ∈Ci Yj − mi 2 where mi denotes the mean of the cluster Ci . For more information about how to implement the k-means clustering see [2]. The spectral clustering approach consists on building a graph where the vertices encode the data and the edges encode the "degree of the connection" between these vertices. We generally weight the edges using a certain similarity matrix S where si j ≥ 0 represents a certain notion of similarity between the i − th and the j − th elements of the observed data. We then partition the graph into k clusters of vertices. In each cluster the vertices must be highly-connected within the cluster and well-separated from the rest of the graph. Since spectral clustering algorithms are used in graph partitioning contexts, it is obvious that spectral clustering is highly-related to the graph theory. In the next section, we will introduce some notions of the graph theory we find necessary to broach spectral clustering algorithms. Then in section 3.3 we will define the graph Laplacian matrices which are the main tools of these algorithms. We will see in section 3.4 how these matrices are used to approximate the optimal solutions of the classical NCut and RatioCut problems. Finally, we will discuss in section 3.5 the partitioning performance of the RDT’s p-value compared to the Gaussian kernel and conclude. 5
  • 7. RDT and Spectral Clustering Telecom Bretagne 3.2 DEFINITIONS Definition 3.2.1 (Graph). A graph is generally defined by a couple G = (V,E). V is called the set of vertices or nodes and E the set of edges which is a subset of V ×V where A ×B denotes the Cartesian product defined as: A ×B = {(a,b) | a ∈ A and b ∈ B}. Two vertices x and y are connected if (x, y) ∈ E. Definition 3.2.2 (Directed and undirected graphs). A graph G = (V,E) is said to be undi- rected if for all (x, y) in V ×V , (x, y) ∈ E if and only if (y,x) ∈ E. Otherwise, the graph is said to be directed. The direction doesn’t have any effect on the connection between the vertices of an undirected graph. Definition 3.2.3 (Weighted and unweighted graphs - The degree of a vertex). We call weighted a graph where we assign to each element (x, y) ∈ E a real number called the weight of the edge (x, y). When an undirected graph is weighted, the edges (x, y) and (y,x) must be equally- weighted. We call degree of the vertex i the quantity: di := j∈V (i,j)∈E si j where si j denotes the weight of the edge (i, j) (3.1) Definition 3.2.4 (Simple graphs and multigraphs). A simple graph is a graph that contains no loops. A loop is an edge that connects a vertex to itself. A graph where the loops are allowed is called a multigraph. Definition 3.2.5 (Connected components). Two vertices x and y are linked if there is a "path" of edges that connects them. It means in a more formal way that there is a positive integer n and a sequence of vertices (xk)k∈{0,1,···,n} such that: x0 = x, xn = y and ∀k ∈ {0,1,··· ,n − 1} : xi and xi+1 are connected. Denote A a subset of V . A is said to be a connected component if it verifies these two conditions: i. ∀(x, y) ∈ A × A : (x, y) are linked by a path ii. ∀(x, y) ∈ A ×V A : (x, y) are not linked Page 6 of 31
  • 8. RDT and Spectral Clustering Telecom Bretagne Figure 3.1: Graphical representation of a simple, undirected and weighted graph. Here the graph contains 6 vertices labeled by their indexes {0,1,··· ,5} and 5 weighted edges. As we can find a path between any couple of vertices, this graph contains only one connected component. Definition 3.2.6 (Similarity and degree matrices). One of the commonly-used ways to repre- sent a graph is the similarity matrix. It is a square matrix S where the input si j is equal to the weight of the edge connecting the i-th and the j-th vertices if they’re connected. Otherwise, the input si j is null. The degree matrix is defined as: D := diag(d1,··· ,dn) where di denotes the degree of the vertex i. Example : the similarity and degree matrices corresponding to the graph represented in figure 3.1 above: S =             0 0 2 0 0 0 0 0 3.5 0 0 0 2 3.5 0 0.1 0 0 0 0 0.1 0 10 3 0 0 0 10 0 0 0 0 0 3 0 0             and D =             2 0 0 0 0 0 0 3.5 0 0 0 0 0 0 5.6 0 0 0 0 0 0 13.1 0 0 0 0 0 0 10 0 0 0 0 0 0 3             In the following, let G = (V,E) denote a simple, undirected and weighted graph where all the edges are with non-negative weights. Thus, the matrix S is symmetric and its diagonal elements are null. Let n =| V | denote the number of its vertices. The figure 3.1 represents an example of such a graph. Page 7 of 31
  • 9. RDT and Spectral Clustering Telecom Bretagne 3.3 GRAPH LAPLACIAN MATRICES The graph Laplacian matrices are along with similarity matrices the key elements of spectral clustering and spectral graph theory (see [5] for an overview of this theory). The three main types are the unnormalized (it is also called standard), the symmetric and the random walk normalized graph Laplacians. Let D and S respectively denote the degree and similarity matrices of the graph G. Graph Laplacian Definition Form Unnormalized graph Laplacian L := D −S L =      d1 −s12 ··· −s1n −s21 d2 ··· −s2n ... ... ... ... −sn1 −sn2 ··· dn      Symmetric normalized graph Laplacian Lsym := D−1/2 LD−1/2 = In −D−1/2 SD−1/2 Lsym =         1 −s12 d1d2 ··· −s1n d1dn −s21 d2d1 1 ··· −s2n d2dn ... ... ... ... −sn1 dnd1 −sn2 dnd2 ··· 1         Random walk normalized graph Laplacian Lr w := D−1 L = In −D−1 S Lr w =       1 −s12 d1 ··· −s1n d1 −s21 d2 1 ··· −s2n d2 ... ... ... ... −sn1 dn −sn2 dn ··· 1       Table 3.1: Definition and form of graph Laplacian matrices. Lsym and Lr w are said to be normalized because the inputs of the unnormalized graph Laplacian Li j are divided either by the degree of the i-th vertex (Lr w ) or by the square root of the product of the latter and the the degree of the j-th vertex (Lsym) (See table 3.1 above). Note that the forms given in table 3.1 are not necessarily the same for all types of graphs. First, recall that G is a simple graph. The weights sii are therefore null and don’t appear in diagonal inputs. Second, if there is any isolated vertex i, all the terms si j and sji such that j ∈ {1,··· ,n} would be null and so would be di . In that case, D−1 would be substituted by the generalized inverse of D to avoid the problems of dividing by zero while computing Lsym and Lr w . In that case, the i-th row would be null in both Lsym and Lr w and so would be the i-th column in Lsym. Now that we have defined graph Laplacian matrices, it is time to show some of their mathematical properties that will be helpful to understand spectral clustering algorithms. Page 8 of 31
  • 10. RDT and Spectral Clustering Telecom Bretagne Proposition 3.3.1 (Properties of L). i. ∀f = (f1,··· , fn)T ∈ Rn : f T L f = 1 2 1≤i≤n 1≤j≤n si j (fi − fj )2 . ii. L is a symmetric positive semi-definite matrix with n real non-negative eigenvalues. iii. 0 is an eigenvalue of L with the constant one vector 1 as a corresponding eigenvector. iv. The number of connected components in the graph G equals the multiplicity of the eigenvalue 0. Let A1,··· , Ak denote the connected components of the graph G. Then the eigenspace of 0 is spanned by the indicator vectors of these subsets 1A1 ,··· ,1Ak : 1Aj (i) =    1, if i ∈ Aj 0, if i ∉ Aj Proof. i. Let f = (f1,··· , fn)T be a vector of Rn . Then: f T L f = f T D f − f T S f = n i=1 di f 2 i − n i,j=1 si j fi fj = 1 2 n i=1 di f 2 i −2 n i,j=1 si j fi fj + n i=1 dj f 2 j According to the definition of di , we have: di = n j=1 si j = n j=1 sji because the graph is undirected and therefore its corresponding similarity matrix S is symmetric. Thus: f T L f = 1 2 n i,j=1 si j f 2 i −2fi fj + f 2 j = 1 2 n i,j=1 si j fi − fj 2 ii. L = D −S. G is undirected, then S is symmetric. D is diagonal and therefore symmetric. Then L is symmetric as a difference of two symmetric matrices. We have supposed that all the weights are non-negative, then it follows from i. that L is positive semi-definite and therefore it has n real and non-negative eigenvalues. iii. The sum of each row of L is null by the definition of the degree of a vertex. Thus, 0 is an eigenvalue of L and the constant one vector 1 is a corresponding eigenvector. iv. See section 3.1 in [4] Page 9 of 31
  • 11. RDT and Spectral Clustering Telecom Bretagne Proposition 3.3.2 (Properties of Lsym and Lr w ). i. ∀f = (f1,··· , fn)T ∈ Rn : f T Lsym f = 1 2 1≤i≤n 1≤j≤n si j fi di − fj dj 2 ii. Lsym is symmetric positive semi-definite with n real non-negative eigenvalues. iii. λ is an eigenvalue of Lr w with eigenvector u if and only if λ is an eigenvalue of Lsym with eigenvector w = D1/2 u. iv. λ is an eigenvalue of Lr w with eigenvector u if and only if λ and u solve the generalized eigenproblem Lu = λDu. v. 0 is an eigenvalue of Lr w with eigenvector the constant one vector 1 and an eigenvalue of Lsym with eigenvector D1/2 1. vi. The number of connected components of the graph G is equal to the multiplicity of the eigenvalue 0 of Lr w and Lsym. Let A1,··· , Ak denote the connected components of the graph G. Then the eigenspace of 0 of Lr w is spanned by the indicator vectors of these subsets 1A1 ,··· ,1Ak . For, Lsym, the eigenspace of 0 is spanned by the vectors D1/21A1 ,··· ,D1/21Ak . Proof. Here we suppose that all the vertices are with non-zero degrees. i. Let f = (f1,··· , fn)T be a vector of Rn . Then: f T Lsym f = f T D−1/2 LD−1/2 f = (D−1/2 f )T LD−1/2 f = gT Lg where g = D−1/2 f g = f1 d1 ,··· , fn dn T . Then the statement comes from i. of proposition 3.3.1 ii. LT sym = D−1/2 LD−1/2 T = D−1/2 LT D−1/2 = D−1/2 LD−1/2 = Lsym. Then Lsym is symmet- ric (Thence the name of this graph Laplacian). It follows from i. that Lsym is positive semi-definite. iii. Let λ be an eigenvalue of Lr w with eigenvector u. Then Lr w u = λu and D−1 Lu = λu. Thus, D−1/2 Lu = λD1/2 u and D−1/2 LD−1/2 D1/2 u = λD1/2 u. Therefore, LsymD1/2 u = λD1/2 u and λ is an eigenvalue of Lsym with eigenvector w = D1/2 u. On the other hand, if λ is an eigenvalue of Lsym with eigenvector w then Lsymw = λw. By multiplying by D−1/2 from the left we get D−1 LD−1/2 w = λD−1/2 w and therefore Lr w u = λu. Page 10 of 31
  • 12. RDT and Spectral Clustering Telecom Bretagne iv. λ is an eigenvalue of Lr w with eigenvector u if only if D−1 Lu = λu if only if Lu = λDu. v. It follows from proposition 3.3.1 that: L1 = 0, then D−1 L1 = 0 and Lr w 1 = 0. The second part o f the statement follows from iii. vi. It follows from iv. of proposition 3.3.1 and the previous statements of proposition 3.3.2 3.4 NCUT AND RATIOCUT PROBLEMS Given a graph G = (V,E) and a positive integer k, the NCut and RatioCut problems aim at finding a partition A1,··· , Ak of the set V . A good graph partitioning must satisfy two conditions. The first one is maximizing the within-cluster connectivity and the second is minimizing the connectivity between different clusters. Sections 5 and 8.5 in [4] were very useful for us to understand the approach of these two graph partitioning problems. Given two subsets A and B of V , let A, |A|, vol(A) and W (A,B) denote the following quantities: A := the complement of A |A| := the number of vertices in A vol(A) := i∈A di W (A,B) := i∈A,j∈B si j The quantities |A| and vol(A) are two ways to measure the size of the subset A whereas the quantity W (A,B) measures the connectivity between the two subsets A and B. A simple way to find a suitable partition is to find the subsets A1,··· , Ak that minimize Cut(A1,..., Ak) defined as: Cut(A1,..., Ak) := 1 2 k i=1 W (Ai , Ai ) (3.2) In fact, we still consider in this section that the graph G is supposed to be undirected, simple and with positively-weighted edges. Thus, minimizing the positive quantity Cut(A1,..., Ak) is the same as minimizing the connectivity between the clusters Ai and the rest of the graph. As for the factor 1 2, it is introduced so as to prevent the repetition of the edges due to the fact that G is undirected. Page 11 of 31
  • 13. RDT and Spectral Clustering Telecom Bretagne However, this method is not really appealing because it doesn’t impose any condition upon the size of the clusters. For example when k = 2, it often leads to separating one vertex from the rest of the graph (more specifically the vertex that has the lowest degree). The NCut and the RatioCut problems (introduced respectively by Shi and Malik in [3] and Hagen and kahng in [6]) use the degree of a subset and the number of its vertices respectively to overcome the problem of the "badly-sized" clusters. In the next two subsections, we will define these problems, see how the graph Laplacians are used to relax them and introduce the spectral clustering algorithms that correspond to each one of them. 3.4.1 RatioCut Let’s conserve the same notations above. The RatioCut is defined with respect to the partition A1,··· , Ak as: RatioCut(A1,··· , Ak) := 1 2 k i=1 W (Ai , Ai ) |Ai | = k i=1 Cut(Ai , Ai ) |Ai | (3.3) The objective of this problem is to find the partition that minimizes the quantity 3.3. Here, it is clear that a subset with a too small number of vertices would raise the RatioCut. That’s how the RatioCut avoids the problem of badly-sized subsets. The only issue here is that the RatioCut problem is NP-hard, thence the necessity of a relaxation method. Case k = 2 Let’s start with the case k = 2 which is easy to express and understand. Here we try to find a subset A that minimizes RatioCut(A, A) = Cut(A,A) |A| + Cut(A,A) |A| . Let A denote a subset of V and f a vector of Rn defined as: f = (f1,··· , fn)T such that: fi =    |A| |A|, if i ∈ A − |A| |A| , if i ∉ A (3.4) According to proposition 3.3.1: f T L f = 1 2 1≤i≤n 1≤j≤n si j (fi − fj )2 = 1 2 i∈A,j∈A si j (fi − fj )2 + 1 2 i∈A,j∈A si j (fi − fj )2 Page 12 of 31
  • 14. RDT and Spectral Clustering Telecom Bretagne In fact, the sums i∈A,j∈A si j (fi − fj )2 and i∈A,j∈A si j (fi − fj )2 are null because the entries fi and fj are equal in both sums. Then: f T L f = 1 2 i∈A,j∈A si j   |A| |A| + |A| |A|   2 + 1 2 i∈A,j∈A si j  − |A| |A| − |A| |A|   2 = Cut(A, A) |A| |A| + |A| |A| +2 Thus: f T L f = Cut(A, A) |A|+|A| |A| + |A|+|A| |A| = |V |RatioCut(|A|,|A|) (3.5) Let < . > denote the canonical inner product of Rn and . 2 the euclidean norm. We can notice that: < f ,1 >= n i=1 fi = i∈A fi + i∈A fi = |A| |A| |A| −|A| |A| |A| = 0, then f is orthogonal to 1. We can also notice that: f 2 2 = n i=1 f 2 i = i∈A f 2 i + i∈A f 2 i = |A| |A| |A| +|A| |A| |A| = n. Then the RatioCut minimization problem in case k = 2 can be expressed as finding: min A⊂V f ⊥1, f 2= n f T L f where f can take entries fi as defined in equation 3.4 which is a discrete optimization problem since the entries fi are only allowed to take two different values. The relaxation of this problem consists on allowing these entries to take any real-valued number. The relaxed RatioCut minimization problem can be therefore written as follows: min f ∈Rn f ⊥1, f 2= n f T L f (3.6) which can be easily solved using basic linear algebra. In fact, the solution of this relaxed problem is the eigenvector corresponding to the second smallest eigenvalue of the standard graph Laplacian matrix (Remember that the smallest is zero with eigenvector 1). This can be proven using the Rayleigh-Ritz theorem. Thus, this eigenvector can be used to approximate the RatioCut minimizing solution. Note that the latter is not necessarily the optimal solution. In fact the entries of the optimal solution must be as defined in equation 3.4, however the solution of the relaxed problem is a randomly real-valued vector. Then we should find a way to transform it into a discrete indicator vector so as to classify the vertices according to the latter. A simple way consists on using the sign of the entry fi of the solution vector to decide Page 13 of 31
  • 15. RDT and Spectral Clustering Telecom Bretagne whether the vertex i is in A or not. But most of spectral clustering algorithms use the k-means clustering instead. In fact, the sign-based heuristic is a "too simple" mainly when we have to deal with more than two clusters. We will see below how we can relax a k ≥ 2 RatioCut minimization problem using the properties of the standard graph Laplacian L. General case k ≥ 2 Let A1,··· , Ak denote k a partition of V and h1,··· ,hk denote k vectors of Rn defined as: hj = (h1j ,··· ,hn j )T where hi j =    1 |Aj | , if i ∈ Aj 0, if i ∉ Aj (3.7) Let H denote the n-by-k matrix where the entries are equal to hi j . Since A1,··· , Ak is sup- posed to be a partition of V , each row of H contains only one non-zero entry which is the j-th one if i ∈ Aj . The factor 1 |Aj | is used in order to normalize the indicator vectors so as to have: < hi ,hj >= δi j where δi j denotes the Kronecker delta which is equal to one if i = j. Otherwise, it is null. We can notice that HT H = Ik. We have: hT l Lhl = 1 2 1≤i≤n 1≤j≤n si j (hi −hj )2 = 1 2 i∈Al ,j∈Al si j (hi −hj )2 + 1 2 i∈Al ,j∈Al si j (hi −hj )2 Thus: hT l Lhl = 1 2 i∈Al ,j∈Al si j 1 Al 2 + 1 2 i∈Al ,j∈Al si j − 1 Al 2 = Cut(Al , Al ) |Al | Let El denote the l-th element of the canonical basis of Rk . We then have: (HT LH)ll = ET l HT LHEl = (HEl )T LHEl = hT l Lhl Thus: (HT LH)ll = hT l Lhl = Cut(Al ,Al ) |Al | We finally obtain: RatioCut(A1,··· , Ak) = k l=1 Cut(Al , Al ) |Al | = k l=1 (HT LH)ll = Tr(HT LH) (3.8) Where Tr denotes the trace of a matrix. According to equation 3.8, the k ≥ 2 RatioCut minimization problem can be rewritten as follows: min A1,···,Ak ⊂V HT H=Ik Tr(HT LH) (3.9) Page 14 of 31
  • 16. RDT and Spectral Clustering Telecom Bretagne Such that H can take entries hi j as defined in 3.7. As in case k = 2, this is a NP-hard discrete optimization problem. To relax it, we will allow the entries hi j of the matrix H to browse all the real-valued numbers. Thus, the relaxed problem can be rewritten as follows: min H∈Rn×k HT H=Ik Tr(HT LH) (3.10) According to the Rayleigh-Ritz theorem, the solution of the relaxed problem is the matrix U containing the k first eigenvectors of the standard graph Laplacian. By "k first eigenvectors" we mean the k first column vectors of the orthogonal matrix P such that: L = Pdiag(λ1,··· ,λn)PT and λ1 ≤ ··· ≤ λn Again the solution of the relaxed problem does not necessarily contain discrete indicator vectors like the ones defined in 3.7. Thence the necessity of a clustering algorithm. Most of spectral clustering algorithms use the k-means clustering on the rows of U. Let Y1,··· ,Yn denote the n rows of U. The outputs are then clusters C1,··· ,Ck. Then we classify the vertices by choosing i ∈ Aj if Yi ∈ Cj . In fact, two vertices i and j belong to the same cluster if and only if the corresponding rows Yi and Yj of the matrix U belong to the same cluster as well. To understand this equivalence, let’s consider a simple case where the solution U is optimal. Then, its column vectors would be similar to the indicator vectors that we defined in 3.7. It is clear that two vertices i and j belong to the subset Al if and only if hil = hjl and therefore the i-th row and the j-th rows are equal as well. The spectral clustering algorithm corresponding to the RatioCut minimization problem is the unnormalized spectral clustering described below: Algorithm 1 Unnormalized spectral clustering 1: Input: n vectors X1,··· ,Xn of Rd encoding the data and the number of clusters k to construct 2: Output: Clusters A1,··· , Ak 3: Build the similarity graph (see section 3.5 below) and compute the similarity matrix S 4: Compute the standard graph Laplacian L. 5: Compute v1,··· ,vk, the k first eigenvectors of L. 6: U ← the matrix containing v1,··· ,vk. 7: for i = 1,··· ,n do 8: Yi ← the i−th row of U 9: Cluster the vectors Y1,··· ,Yn of Rk into clusters C1,··· ,Ck 10: Aj = {Xi |Yi ∈ Cj } Page 15 of 31
  • 17. RDT and Spectral Clustering Telecom Bretagne 3.4.2 NCut As in the case of RatioCut, we conserve the notations of section 3.4. The NCut is then defined with respect to the partition A1,··· , Ak as: NCut(A1,··· , Ak) := 1 2 k i=1 W (Ai , Ai ) vol(Ai ) = k i=1 Cut(Ai , Ai ) vol(Ai ) (3.11) The NCut problem measures the size of a subset A ⊂ V by the quantity vol(A). Recall that vol(A) is equal to the sum of the degrees of the vertices of A. It is clear that a small value of vol(Ai ) will raise the value of the NCut. This is the way how the NCut problem avoids the problem of badly-sized subsets (see section 3.4 above). The advantage of using vol(A) instead of |A| is that vol(A) measures how "strong" the connectivity between the vertices of A is. Unlike |A| which doesn’t give any information about this connectivity. Another advantage of using the NCut instead of the RatioCut is that the NCut minimizes the probability that a random walker goes from a cluster to another. See proposition 5 in section 6 of [4] for more explanations. As in the case of the RatioCut problem, the objective here is to minimize the quantity 3.11 which is a NP-hard problem to solve. We will see below how the graph Laplacian matrices are used to relax it. Case k = 2 The objective here is to find a subset A ⊂ V that minimizes NCut(A, A) = Cut(A,A) vol(A) + Cut(A,A) vol(A) . Given a subset A ⊂ V , let f = (f1,··· , fn)T a vector of Rn such that: fi =    vol(A) vol(A), if i ∈ A − vol(A) vol(A) , if i ∉ A (3.12) As we did in the RatioCut’s case, we can check that: f T L f = vol(V )NCut(A, A), f T D f = vol(V ) and (D f )T 1 = 0 Then we can write the NCut minimization problem: min A⊂V D f ⊥1, f T D f =vol(V ) f T L f (3.13) Page 16 of 31
  • 18. RDT and Spectral Clustering Telecom Bretagne such that f can only take entries fi as defined in 3.12. Again we allow the entries fi to browse all the real-valued numbers in order to relax the NP-hard NCut minimization problem. Therefore the relaxed problem can be written as follows: min f ∈Rn D f ⊥1, f T D f =vol(V ) f T L f (3.14) It can be also rewritten in an equivalent way using the substitution g := D1/2 f : min g∈Rn g⊥D1/2 1, g 2 2=vol(V ) gT D−1/2 LD−1/2 g = min g∈Rn g⊥D1/2 1, g 2 2=vol(V ) gT Lsymg (3.15) According to propositions 3.3.1 and 3.3.2, D1/2 1 is the first eigenvector of Lsym. Then, using the Rayleigh-Ritz theorem, we can conclude that the solution g of the relaxed problem is equal to the second eigenvector of Lsym. If we re-substitute f = D−1/2 g, we can notice using the proposition 3.3.2 that f is equal to the second eigenvector of Lr w or the generalized eigenproblem L f = λD f . After finding the solution of the relaxed problem we carry on the classification process the same way we did in the case of the RatioCut minimization problem. General case k ≥ 2 As in the case of the RatioCut, given a partition A1,··· , Ak of the set of vertices V , we define the vectors h1,··· ,hk of Rn where hi = (h1j ,··· ,hn j )T such that: hi j =    1 vol(Aj ) , if i ∈ Aj 0, if i ∉ Aj (3.16) Again let H denote the n-by-k matrix containing the vectors h1,··· ,hk as columns. We can check that: (HT LH)ll = hT l Lhl = Cut(Al ,Al ) vol(Al ) and therefore: NCut(A1,··· , Ak) = k l=1 Cut(Al , Al ) vol(Al ) = k l=1 (HT LH)ll = Tr(HT LH) (3.17) Page 17 of 31
  • 19. RDT and Spectral Clustering Telecom Bretagne We can also see that: HT DH = Ik. Therefore the NCut minimization problem can be re- expressed as: min A1,···,Ak ⊂V HT DH=Ik Tr(HT LH) (3.18) where H can take the entries hi j as defined in 3.15. Again, the relaxation consists on allowing these entries to take any real number. The relaxed problem can be then rewritten as follows: min H∈Rn×k HT DH=Ik Tr(HT LH) (3.19) Or after the substitution F := D1/2 H: min F∈Rn×k FT F=Ik Tr(FT D−1/2 LD−1/2 F) = min F∈Rn×k FT F=Ik Tr(FT LsymF) (3.20) As in the case of the RatioCut problem, he Rayleigh-Ritz theorem allows us to conclude that the solution of the relaxed problem is given by the matrix F containing the first k eigenvectors of the symmetric normalized graph Laplacian Lsym. Then if we re-substitute H = D−1/2 F, we can notice using proposition 3.3.2 that H is equal to the matrix containing the k first eigenvectors of Lr w or the generalized eigenproblem L f = λD f . The spectral clustering algorithms corresponding to the NCut minimization problem are the random walk normalized spectral clustering introduced by Shi and Malik in [3] and the symmetric normalized spectral clustering introduced by Ng, Jordan and Weiss in [7]. Algorithm 2 Random walk normalized spectral clustering (Shi and Malik in [3]) 1: Input: n vectors X1,··· ,Xn of Rd encoding the data and the number of clusters k to construct 2: Output: Clusters A1,··· , Ak 3: Build the similarity graph (see section 3.5 below) and compute the similarity matrix S 4: Compute the standard graph Laplacian L. 5: Compute v1,··· ,vk, the k first eigenvectors of the generalized eigenproblem L f = λDu (which are also the k first eigenvectors of Lr w ). 6: U ← the matrix containing v1,··· ,vk 7: for i = 1,··· ,n do 8: Yi ← the i−th row of U 9: Cluster the vectors Y1,··· ,Yn of Rk into clusters C1,··· ,Ck 10: Aj = {Xi |Yi ∈ Cj } Page 18 of 31
  • 20. RDT and Spectral Clustering Telecom Bretagne Algorithm 3 Symmetric normalized spectral clustering (Ng, Jordan and Weiss in [7]) 1: Input: n vectors X1,··· ,Xn of Rd encoding the data and the number of clusters k to construct 2: Output: Clusters A1,··· , Ak 3: Build the similarity graph (see section 3.5 below) and compute the similarity matrix S 4: Compute the symmetric normalized graph Laplacian Lsym. 5: Compute v1,··· ,vk, the k first eigenvectors of Lsym. 6: U ← the matrix containing v1,··· ,vk. 7: compute T the n-by-k matrix by normalizing the rows of U: 8: for i = 1,··· ,n and j = 1,··· ,k do 9: ti j ← ui j k p=1 u2 ip 10: for i = 1,··· ,n do Yi ← the i−th row of T 11: Cluster the vectors Y1,··· ,Yn of Rk into clusters C1,··· ,Ck 12: Aj = {Xi |Yi ∈ Cj } Note that the normalization step in the symmetric normalized spectral clustering aims at avoiding certain computational problems. The latter are related to the fact that the calculators precision is limited to some bounds. Note also that in their experiments, the authors of [3] and [7] used the fully connected graph with the Gaussian kernel as a similarity graph. So will we do in the next section because the two other options (see section 3.5) require a choice of additional parameters. This choice is generally based on certain heuristics which might have a negative impact on the experiments. 3.5 RANDOM DISTORTION TESTING p-VALUE KERNEL In all spectral clustering algorithms, the first step consists on building a similarity graph. Recall that data partitioning is based on the notion of similarity. In fact, a cluster must contain data that are "similar" within the cluster and "dissimilar" to the rest of the data set. A natural way to model the similarity within the data consists on building a graph whose vertices encode the data. The weights of the connections are then chosen to be proportional to the degree of "similarity" between the vertices. A kernel function (it is also called similarity function) is a map from the set of edges E to the set of real numbers. This map assigns to each pair of vertices a real number called the weight. In practice, the data set is supposed to contain vectors of Rd . Thus we can say that two vectors are similar if the distance between them (e.g the euclidean distance) is "small". Therefore, the smaller the distance between two vertices is the greater the weight of the edge connecting them is. Page 19 of 31
  • 21. RDT and Spectral Clustering Telecom Bretagne The most popular kernel function is the Gaussian kernel which is defined with respect to a parameter σ and two d-dimensional vectors xi and xj as: si j := exp − xi − xj 2 2 2σ2 (3.21) It is obvious that the Gaussian kernel is a decreasing positive function of the euclidean distance. The parameter σ is homogeneous to a standard variation, thence its notation. We can also notice that the Gaussian kernel is homogeneous to a probability. Given a set of d-dimensional vectors and a kernel function, there are different ways to build a similarity graph: • The fully connected graph: here we connect all the vectors and then we weight the edges using the kernel function. • The k-nearest neighbors graph: this method consists on connecting each vector to its k nearest vectors in term of the euclidean distance. We then weight the edges using the kernel function. • The -neighborhood graph: we connect the vectors xi and xj if xi −xj 2 ≤ and then we weight the edges using the kernel function. In this internship we study the impact of choosing the kernel function equal to the p-value of random distortion testing on the partitioning performances. First of all, let’s give a formal definition of the p-value of a random distortion test. Definition 3.5.1 (p-value of a random distortion test). The p-value of a random distortion testing is defined with respect to the tolerance τ of the latter, its corresponding deterministic model θ0, its corresponding noise covariance matrix C and a vector y of Rd as: ˆγ(y) = in f {γ ∈ (0,1) : Tλγ(τ) = 1} = in f {γ ∈ (0,1) : λγ(τ) < y −θ0 } Like the Gaussian kernel, the p-value of RDT is homogeneous to a probability. It can be seen as the optimal value of the false-alarm probability. In fact, given a fixed value of the tolerance τ, it is proven in [1] that the function λγ(τ) is a strictly decreasing function of γ. Hence the p-value is optimal in the sense that it is not too big to detect always false outliers (false alarms) and not too small to miss detecting them. Given a vector y and a tolerance τ, it follows from the definition of the p-value that a small value of the latter corresponds to a high value of λˆγ(y)(τ) and therefore a high value of the distortion y −θ0 and vice versa. Page 20 of 31
  • 22. RDT and Spectral Clustering Telecom Bretagne Now let’s consider the case where θ0 = 0Rd the zero constant vector. Then given two vectors yi and yj of Rd , a small value of ˆγ(yi − yj ) will correspond to a high value of yi − yj and vice versa. This is the first reason why we thought of using the p-value as a similarity (kernel) function. One other reason is that when d = 2, τ = 0, and the covariance matrix C = I2, the p-value and the Gaussian kernel are equal. In fact, since λγ(τ) is a strictly decreasing function of γ, then ˆγ(y) satisfies: λˆγ(y)(τ) = y −θ0 (3.22) It follow then from equation 2.4 (see section 2.3) that: ˆγ(y) = 1−R(τ,λˆγ(y)(τ)) = 1−R(τ, y −θ0 ) = Qd/2(τ, y −θ0 ) (3.23) where QM denotes the Marcum Q-function defined with respect to two real numbers x and y as: QM (x, y) = exp − x2 + y2 2 ∞ k=1−M x y k Ik(xy) (3.24) where Ik denotes the Bessel function of order k. Thus, it follows from 3.23 and 3.24 that in case τ = 0, d = 2 and C = I2: y −θ0 = (y −θ0)T I2(y −θ0) = (y −θ0)T (y −θ0) = y −θ0 2 and: ˆγ(y) = Q1(0, y −θ0 2) = exp − y −θ0 2 2 2 ∞ k=0 0 y −θ0 2 k Ik(0) = exp − y −θ0 2 2 2 (3.25) It follows from these calculations that the p-value can be considered as a generalization of the Gaussian kernel. 3.6 EXPERIMENTS In this section we will compare the clustering performances of the p-value and the Gaussian kernel in the context of disturbed data. In order to visualize the clustering results we will only be interested in the case of bi-dimensional and tri-dimensional data. We still suppose in this section that the data set is supposed to contain vectors Y = Θ+ X of Rd where Θ denotes the vector of interest and X ∼ N (0,σ2 Id ) denotes the noise vector. Page 21 of 31
  • 23. RDT and Spectral Clustering Telecom Bretagne We will start with the unnormalized spectral clustering where the results are very satisfying. Figure 3.2: Two circles with additive noise drawn from the normal distribution. A good clustering must separate the two circles. The number of points in each circle is equal to 50. Figure 3.3: In the left is represented the resulting clustering using the RDT-pvalue kernel. In the middle and the right, the figure represents the resulting clustering of the Gaussian kernel with parameter sigma respectively equal to the standard variation and the double of the latter As we can see in figure 3.3 the performances of the Gaussian kernel and the RDT-pvalue are both satisfying. In fact, the number of points in each circle is enough to end up at good clustering results. Page 22 of 31
  • 24. RDT and Spectral Clustering Telecom Bretagne Figure 3.4: The figure in the left represents the resulting clustering using the RDT-pvalue whereas the two others represent the clustering results of the Gaussian kernel (16 points in each circle). In figure 3.4 above we see the impact of reducing the number of points on the clustering performance. Here, the RDT-pvalue leads to far more interesting results than the Gaussian kernel does. In figure 3.5 below, we try the two similarity functions on tri-dimensional data. Here, a good clustering must separate the two helices. Again we notice that the RDT-pvalue leads to interesting results. Figure 3.5: The figure in the left represents the resulting clustering using the RDT-pvalue while the two others represent the clustering results of the Gaussian kernel. Page 23 of 31
  • 25. RDT and Spectral Clustering Telecom Bretagne The script we used to obtain all the figures above is available in the attachments and it is named "USCmain.m". Put a break-point between the two blocks before running the script. Try to change N the number of points (line 81) to see its impact on the clustering results. In the following we will see how the symmetric and the random walk normalized spectral clustering algorithms don’t end up at good clustering results. But as we will notice in the figures, there is an interval of tolerance values where the RDT-pvalue kernel leads to the correct clusters. Figure 3.6: This figure represents the original data set to partition using the symmetric normalized spectral clustering algorithm. The standard variation of the noise is equal to 0.5. After running several times the symmetric normalized spectral clustering algorithm with different initial conditions, it seems that it doesn’t lead to satisfying results in particular with the Gaussian similarity function. However, choosing an appropriate value of the tolerance τ makes the RDT-pvalue kernel work very well. (See the pictures 3.7 and 3.8 below) As we can see in figure 3.7, the symmetric normalized spectral clustering didn’t manage to end up at the correct clusters for a tolerance between 0 and 1.33. However, for a tolerance between 1.7778 and 4, the results are very satisfying (figure 3.8) To obtain the figures 3.7 and 3.8 run the file "SymSCmain.m" in the attachments. As for the random walk normalized spectral clustering, the results were very similar (See figures 3.9, 3.10 and 3.11 below) they aren’t satisfying for 0 < τ < 1.1111 and 3.3333 < τ < 5 (figures 3.9 and 3.11). However a tolerance 1.6667 < τ < 2.7778 leads to the correct clusters (figure 3.10) Page 24 of 31
  • 26. RDT and Spectral Clustering Telecom Bretagne CONCLUSION The experiments show that choosing the p-value of random distortion tests as a similarity function in spectral clustering algorithms leads to promising results. In fact, we saw how the results were very satisfying when we used the unnormalized spectral clustering algorithm to partition bi-dimensional and tri-dimensional data. In normalized spectral clustering cases, experiments showed that we must find the appropriate tolerance to end up at satisfying results. We have also proved that in some cases the p-value and the Gaussian kernel are equal. Finding the appropriate value of the tolerance can be a difficult task. This is also true for the parameter sigma of the Gaussian kernel. But in general, such parameters are chosen in order to optimize certain criteria. Page 25 of 31
  • 27. RDT and Spectral Clustering Telecom Bretagne REFERENCES [1] DOMINIQUE PASTOR, QUANG-THANG. Nguyen, Random Distortion Testing and Opti- mality of Thresholding Tests, IEEE Transactions on Signal processing, Vol. 61, No. 16, August 15, 2013 [2] MacKay, David (2003). "Chapter 20. An Example Inference Task: Clustering" . Infor- mation Theory, Inference and Learning Algorithms. Cambridge University Press. pp. 284-292. [3] JIANBO SHI, JITENDRA MALIK, Normalized Cuts and Image Segmentation, IEEE TRANS- ACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 8, AUGUST 2000 [4] ULRIKE VON LUXBURG, A Tutorial on Spectral Clustering , Statistics and Computing, 17 (4), 2007 [5] Fan R. K. Chung (1997). Spectral graph theory (Vol. 92 of the CBMS Regional Conference Series in Mathematics). [6] Lars Hagen and Andrew B. Kahng (1992). New spectral methods for ratio cut partitioning and clustering. IEEE Trans. Computer-Aided Design, 11(9), 1074 – 1085. [7] Andrew Y. Ng, Michael I. Jordan and Yair Weiss (2002). On spectral clustering: analy- sis and an algorithm. In T. Dietterich, S. Becker, and Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems 14 Page 26 of 31
  • 28. RDT and Spectral Clustering Telecom Bretagne Figure 3.7: In each row, the figure in the left represents the resulting clustering using the RDT-pvalue while the two others represent the clustering results of the Gaussian kernel. Here 0 < τ < 1.33 Page 27 of 31
  • 29. RDT and Spectral Clustering Telecom Bretagne Figure 3.8: In each row, the figure in the left represents the resulting clustering using the RDT-pvalue while the two others represent the clustering results of the Gaussian kernel. Here 1.7778 < τ < 4 Page 28 of 31
  • 30. RDT and Spectral Clustering Telecom Bretagne Figure 3.9: In each row, the figure in the left represents the resulting clustering using the RDT-pvalue while the two others represent the clustering results of the Gaussian kernel. Here 0 < τ < 1.1111 Page 29 of 31
  • 31. RDT and Spectral Clustering Telecom Bretagne Figure 3.10: In each row, the figure in the left represents the resulting clustering using the RDT-pvalue while the two others represent the clustering results of the Gaussian kernel. Here 1.6667 < τ < 2.7778 Page 30 of 31
  • 32. RDT and Spectral Clustering Telecom Bretagne Figure 3.11: In each row, the figure in the left represents the resulting clustering using the RDT-pvalue while the two others represent the clustering results of the Gaussian kernel. Here 3.3333 < τ < 5 Page 31 of 31