InternshipReport

INTERNSHIP REPORT
RANDOM DISTORTION TESTS APPLIED IN A
SPECTRAL CLUSTERING CONTEXT
October 19, 2016
Hamza Ameur
Institut Mines-Telecom - Telecom Bretagne
Department of Signal and Communications

Contents
1 Introduction 2
2 Random distortion testing 3
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Spectral Clustering 5
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Graph Laplacian matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 NCut and RatioCut problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4.1 RatioCut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4.2 NCut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5 Random distortion testing p-value kernel . . . . . . . . . . . . . . . . . . . . . . . 19
3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1

Chapter 1
Introduction
Spectral clustering algorithms are widely used in machine learning and data partitioning
contexts. They consist on building a graph where the vertices represent data and then
partitioning it into k logically separated clusters. By "logically" we mean that the elements of
each cluster must be "highly-connected" within the cluster and well-separated from the rest
of the graph.
The quality of partitioning is based on a judicious choice of the similarity function between
the vertices of the graph. A natural way is to connect all the vertices of the graph and then
weight the edges using a kernel function. The most popular function used in this regard is
the Gaussian kernel.
Recently at Telecom Bretagne, Dominique Pastor and Quang-Thang Nguyen have developed
in [1] a new theory called Random Distortion Testing (RDT). The objective of this theory is to
guarantee any level of false-alarm probability while detecting distortion in disturbed data
using threshold-based tests. In this internship, we will show how this theoretical framework
leads to choosing the similarity function as equal to the p-value of random distortion tests.
We will study the partitioning performance of this kernel function compared to the Gaussian
kernel in a context of disturbed data.
ACKNOWLEDGMENT
I am very grateful to Dominique Pastor and Redouane Lguensat who supervised my one-
month internship. Their help with the difﬁculties I encountered during my work was of a
valuable interest.
2

Chapter 2
Random distortion testing
2.1 INTRODUCTION
In signal processing contexts, we generally observe the signal of interest in additive inde-
pendent Gaussian noise. In [1] the observed data is supposed to be a vector Y = Θ+ X where
Θ is a random signal of interest with unknown distribution and X is an additive independent
Gaussian noise with positive definite covariance matrix C.
In most of signal processing problems we either try to estimate one or more parameters
of a certain signal of interest (e.g: its mean or its variance) or decide if this signal follows a
certain model as it is the case in [1]. In the latter case threshold-based tests are commonly
used to either confirm the null hypothesis which is that the signal follows actually the known
model or the alternative hypothesis which is that the observed signal is an anomaly. False
alarm probability is one of the parameters we try to optimize in such tests. It is defined as
the probability that we decide to confirm the null hypothesis whereas it is actually false. The
RDT’s theoretical framework enables us to guarantee any level of this probability regardless
of the unknown distribution of the signal of interest. Thence the strength of this framework.
2.2 NOTATION
As we mentioned above, we suppose in [1] that the observed data is a vector Y = Θ+X such
that Θ is a random vector of Rd
with unknown distribution encoding the signal of interest.
Whereas X is an additive independent Gaussian noise with null mean and positive definite
covariance matrix C. Let θ0 denote a known deterministic model of the signal of interest.
3

RDT and Spectral Clustering Telecom Bretagne
Let H0 denote the null hypothesis and H1 the alternative one. We have:



H0 : Θ−θ0 ≤ τ
H1 : Θ−θ0 > τ
(2.1)
where . denotes the Mahalanobis distance defined with respect to the covariance matrix C
as:
∀X ∈ Rd
: X = X T C−1X (2.2)
τ is called the tolerance which is technically the maximum value of the distortion Θ−θ0
that we tolerate. By "tolerate" we mean that we decide that Θ still follows the deterministic
model θ0 if the distortion Θ−θ0 is under this value. Otherwise we decide that there is an
anomaly.
Finally, given any η ≥ 0, a threshold-based test with threshold height η on the Mahalanobis
distance to θ0 is any measurable map Tη of Rd
into {0,1} such that:
∀y ∈ Rd
: Tη(y) =



1 if y −θ0 > τ
0 if y −θ0 ≤ τ
(2.3)
2.3 THEOREM
In most of signal processing problems, we don’t actually decide whether the value of the
distortion Θ−θ0 satisfies H0 or H1 with respect to the real value of Θ−θ0 . In fact, the
observed data Y might be a corrupted version of Θ due to the noise X . In [1] it is proven that
for any γ and τ both greater than 0, the false-alarm probability of the threshold-based test
Tλγ(τ) is less than γ. Here λγ(τ) denotes the unique solution of the equation:
1−R(τ,λγ(τ)) = γ (2.4)
where the function R is defined for any ρ and η both greater than 0 as:
R(ρ,η) = Fχ2
d
(ρ2)(η2
)
where Fχ2
d
(ρ2) denotes the cumulative distribution function of the Chi-squared distribution
with d degrees of freedom and parameter ρ2
. In section 3.5, we will define the p-value of a
random distortion test and study its performance as a kernel function in spectral clustering
contexts.
Page 4 of 31

Chapter 3
Spectral Clustering
3.1 INTRODUCTION
Data partitioning aims at deciding which sets of data have "similar" properties. These sets
are commonly called Clusters. One of the widely-used algorithms in data partitioning is the
k-means clustering. The idea of this algorithm is that, given a data set {Y1,Y2,··· ,Yn}, the
objective is to partition it into k (k ≤ n) sets called clusters, {C1,C2,··· ,Ck} so as to minimize
the quantity: 0<i<k
Yj ∈Ci
Yj − mi
2
where mi denotes the mean of the cluster Ci . For more
information about how to implement the k-means clustering see [2].
The spectral clustering approach consists on building a graph where the vertices encode
the data and the edges encode the "degree of the connection" between these vertices. We
generally weight the edges using a certain similarity matrix S where si j ≥ 0 represents a
certain notion of similarity between the i − th and the j − th elements of the observed data.
We then partition the graph into k clusters of vertices. In each cluster the vertices must be
highly-connected within the cluster and well-separated from the rest of the graph. Since
spectral clustering algorithms are used in graph partitioning contexts, it is obvious that
spectral clustering is highly-related to the graph theory. In the next section, we will introduce
some notions of the graph theory we ﬁnd necessary to broach spectral clustering algorithms.
Then in section 3.3 we will deﬁne the graph Laplacian matrices which are the main tools
of these algorithms. We will see in section 3.4 how these matrices are used to approximate
the optimal solutions of the classical NCut and RatioCut problems. Finally, we will discuss
in section 3.5 the partitioning performance of the RDT’s p-value compared to the Gaussian
kernel and conclude.
5

3.2 DEFINITIONS
Definition 3.2.1 (Graph). A graph is generally defined by a couple G = (V,E). V is called the
set of vertices or nodes and E the set of edges which is a subset of V ×V where A ×B denotes the
Cartesian product defined as: A ×B = {(a,b) | a ∈ A and b ∈ B}. Two vertices x and y are
connected if (x, y) ∈ E.
Definition 3.2.2 (Directed and undirected graphs). A graph G = (V,E) is said to be undi-
rected if for all (x, y) in V ×V , (x, y) ∈ E if and only if (y,x) ∈ E. Otherwise, the graph is said to
be directed. The direction doesn’t have any effect on the connection between the vertices of an
undirected graph.
Definition 3.2.3 (Weighted and unweighted graphs - The degree of a vertex). We call weighted
a graph where we assign to each element (x, y) ∈ E a real number called the weight of the edge
(x, y). When an undirected graph is weighted, the edges (x, y) and (y,x) must be equally-
weighted. We call degree of the vertex i the quantity:
di :=
j∈V
(i,j)∈E
si j where si j denotes the weight of the edge (i, j) (3.1)
Definition 3.2.4 (Simple graphs and multigraphs). A simple graph is a graph that contains
no loops. A loop is an edge that connects a vertex to itself. A graph where the loops are allowed
is called a multigraph.
Definition 3.2.5 (Connected components). Two vertices x and y are linked if there is a "path"
of edges that connects them. It means in a more formal way that there is a positive integer n and
a sequence of vertices (xk)k∈{0,1,···,n} such that: x0 = x, xn = y and ∀k ∈ {0,1,··· ,n − 1} :
xi and xi+1 are connected.
Denote A a subset of V . A is said to be a connected component if it verifies these two conditions:
i. ∀(x, y) ∈ A × A : (x, y) are linked by a path
ii. ∀(x, y) ∈ A ×V A : (x, y) are not linked
Page 6 of 31

Figure 3.1: Graphical representation of a simple, undirected and weighted graph. Here the graph
contains 6 vertices labeled by their indexes {0,1,··· ,5} and 5 weighted edges. As we can find a path
between any couple of vertices, this graph contains only one connected component.
Definition 3.2.6 (Similarity and degree matrices). One of the commonly-used ways to repre-
sent a graph is the similarity matrix. It is a square matrix S where the input si j is equal to the
weight of the edge connecting the i-th and the j-th vertices if they’re connected. Otherwise, the
input si j is null. The degree matrix is defined as: D := diag(d1,··· ,dn) where di denotes the
degree of the vertex i.
Example : the similarity and degree matrices corresponding to the graph represented in
figure 3.1 above:
S =












0 0 2 0 0 0
0 0 3.5 0 0 0
2 3.5 0 0.1 0 0
0 0 0.1 0 10 3
0 0 0 10 0 0
0 0 0 3 0 0












and D =












2 0 0 0 0 0
0 3.5 0 0 0 0
0 0 5.6 0 0 0
0 0 0 13.1 0 0
0 0 0 0 10 0
0 0 0 0 0 3












In the following, let G = (V,E) denote a simple, undirected and weighted graph where all
the edges are with non-negative weights. Thus, the matrix S is symmetric and its diagonal
elements are null. Let n =| V | denote the number of its vertices. The figure 3.1 represents an
example of such a graph.
Page 7 of 31

3.3 GRAPH LAPLACIAN MATRICES
The graph Laplacian matrices are along with similarity matrices the key elements of spectral
clustering and spectral graph theory (see [5] for an overview of this theory). The three main
types are the unnormalized (it is also called standard), the symmetric and the random walk
normalized graph Laplacians.
Let D and S respectively denote the degree and similarity matrices of the graph G.
Graph Laplacian Definition Form
Unnormalized graph
Laplacian
L := D −S L =





d1 −s12 ··· −s1n
−s21 d2 ··· −s2n
...
...
...
...
−sn1 −sn2 ··· dn





Symmetric normalized
graph Laplacian
Lsym := D−1/2
LD−1/2
= In −D−1/2
SD−1/2 Lsym =








1 −s12
d1d2
··· −s1n
d1dn
−s21
d2d1
1 ··· −s2n
d2dn
...
...
...
...
−sn1
dnd1
−sn2
dnd2
··· 1








Random walk
normalized graph
Laplacian
Lr w := D−1
L = In −D−1
S Lr w =






1 −s12
d1
··· −s1n
d1
−s21
d2
1 ··· −s2n
d2
...
...
...
...
−sn1
dn
−sn2
dn
··· 1






Table 3.1: Definition and form of graph Laplacian matrices.
Lsym and Lr w are said to be normalized because the inputs of the unnormalized graph
Laplacian Li j are divided either by the degree of the i-th vertex (Lr w ) or by the square root of
the product of the latter and the the degree of the j-th vertex (Lsym) (See table 3.1 above).
Note that the forms given in table 3.1 are not necessarily the same for all types of graphs.
First, recall that G is a simple graph. The weights sii are therefore null and don’t appear in
diagonal inputs. Second, if there is any isolated vertex i, all the terms si j and sji such that
j ∈ {1,··· ,n} would be null and so would be di . In that case, D−1
would be substituted by the
generalized inverse of D to avoid the problems of dividing by zero while computing Lsym and
Lr w . In that case, the i-th row would be null in both Lsym and Lr w and so would be the i-th
column in Lsym.
Now that we have defined graph Laplacian matrices, it is time to show some of their
mathematical properties that will be helpful to understand spectral clustering algorithms.
Page 8 of 31

Proposition 3.3.1 (Properties of L).
i. ∀f = (f1,··· , fn)T
∈ Rn
: f T
L f = 1
2 1≤i≤n
1≤j≤n
si j (fi − fj )2
.
ii. L is a symmetric positive semi-definite matrix with n real non-negative eigenvalues.
iii. 0 is an eigenvalue of L with the constant one vector 1 as a corresponding eigenvector.
iv. The number of connected components in the graph G equals the multiplicity of the
eigenvalue 0. Let A1,··· , Ak denote the connected components of the graph G. Then the
eigenspace of 0 is spanned by the indicator vectors of these subsets 1A1 ,··· ,1Ak
:
1Aj
(i) =



1, if i ∈ Aj
0, if i ∉ Aj
Proof.
i. Let f = (f1,··· , fn)T
be a vector of Rn
. Then:
f T
L f = f T
D f − f T
S f =
n
i=1
di f 2
i −
n
i,j=1
si j fi fj =
1
2
n
i=1
di f 2
i −2
n
i,j=1
si j fi fj +
n
i=1
dj f 2
j
According to the definition of di , we have: di =
n
j=1
si j =
n
j=1
sji because the graph is
undirected and therefore its corresponding similarity matrix S is symmetric. Thus:
f T
L f =
1
2
n
i,j=1
si j f 2
i −2fi fj + f 2
j =
1
2
n
i,j=1
si j fi − fj
2
ii. L = D −S. G is undirected, then S is symmetric. D is diagonal and therefore symmetric.
Then L is symmetric as a difference of two symmetric matrices. We have supposed that
all the weights are non-negative, then it follows from i. that L is positive semi-definite
and therefore it has n real and non-negative eigenvalues.
iii. The sum of each row of L is null by the definition of the degree of a vertex. Thus, 0 is an
eigenvalue of L and the constant one vector 1 is a corresponding eigenvector.
iv. See section 3.1 in [4]
Page 9 of 31

Proposition 3.3.2 (Properties of Lsym and Lr w ).
i. ∀f = (f1,··· , fn)T
∈ Rn
: f T
Lsym f = 1
2 1≤i≤n
1≤j≤n
si j
fi
di
−
fj
dj
2
ii. Lsym is symmetric positive semi-deﬁnite with n real non-negative eigenvalues.
iii. λ is an eigenvalue of Lr w with eigenvector u if and only if λ is an eigenvalue of Lsym
with eigenvector w = D1/2
u.
iv. λ is an eigenvalue of Lr w with eigenvector u if and only if λ and u solve the generalized
eigenproblem Lu = λDu.
v. 0 is an eigenvalue of Lr w with eigenvector the constant one vector 1 and an eigenvalue of
Lsym with eigenvector D1/2
1.
vi. The number of connected components of the graph G is equal to the multiplicity of
the eigenvalue 0 of Lr w and Lsym. Let A1,··· , Ak denote the connected components
of the graph G. Then the eigenspace of 0 of Lr w is spanned by the indicator vectors
of these subsets 1A1 ,··· ,1Ak
. For, Lsym, the eigenspace of 0 is spanned by the vectors
D1/21A1 ,··· ,D1/21Ak
.
Proof. Here we suppose that all the vertices are with non-zero degrees.
i. Let f = (f1,··· , fn)T
be a vector of Rn
. Then:
f T
Lsym f = f T
D−1/2
LD−1/2
f = (D−1/2
f )T
LD−1/2
f = gT
Lg where g = D−1/2
f
g =
f1
d1
,··· ,
fn
dn
T
. Then the statement comes from i. of proposition 3.3.1
ii. LT
sym = D−1/2
LD−1/2 T
= D−1/2
LT
D−1/2
= D−1/2
LD−1/2
= Lsym. Then Lsym is symmet-
ric (Thence the name of this graph Laplacian). It follows from i. that Lsym is positive
semi-deﬁnite.
iii. Let λ be an eigenvalue of Lr w with eigenvector u. Then Lr w u = λu and D−1
Lu = λu.
Thus, D−1/2
Lu = λD1/2
u and D−1/2
LD−1/2
D1/2
u = λD1/2
u.
Therefore, LsymD1/2
u = λD1/2
u and λ is an eigenvalue of Lsym with eigenvector w =
D1/2
u. On the other hand, if λ is an eigenvalue of Lsym with eigenvector w then
Lsymw = λw. By multiplying by D−1/2
from the left we get D−1
LD−1/2
w = λD−1/2
w
and therefore Lr w u = λu.
Page 10 of 31

iv. λ is an eigenvalue of Lr w with eigenvector u if only if D−1
Lu = λu if only if Lu = λDu.
v. It follows from proposition 3.3.1 that: L1 = 0, then D−1
L1 = 0 and Lr w 1 = 0. The second
part o f the statement follows from iii.
vi. It follows from iv. of proposition 3.3.1 and the previous statements of proposition 3.3.2
3.4 NCUT AND RATIOCUT PROBLEMS
Given a graph G = (V,E) and a positive integer k, the NCut and RatioCut problems aim
at finding a partition A1,··· , Ak of the set V . A good graph partitioning must satisfy two
conditions. The first one is maximizing the within-cluster connectivity and the second is
minimizing the connectivity between different clusters. Sections 5 and 8.5 in [4] were very
useful for us to understand the approach of these two graph partitioning problems.
Given two subsets A and B of V , let A, |A|, vol(A) and W (A,B) denote the following
quantities:
A := the complement of A
|A| := the number of vertices in A
vol(A) :=
i∈A
di
W (A,B) :=
i∈A,j∈B
si j
The quantities |A| and vol(A) are two ways to measure the size of the subset A whereas the
quantity W (A,B) measures the connectivity between the two subsets A and B. A simple
way to find a suitable partition is to find the subsets A1,··· , Ak that minimize Cut(A1,..., Ak)
defined as:
Cut(A1,..., Ak) :=
1
2
k
i=1
W (Ai , Ai ) (3.2)
In fact, we still consider in this section that the graph G is supposed to be undirected, simple
and with positively-weighted edges. Thus, minimizing the positive quantity Cut(A1,..., Ak) is
the same as minimizing the connectivity between the clusters Ai and the rest of the graph.
As for the factor 1
2, it is introduced so as to prevent the repetition of the edges due to the fact
that G is undirected.
Page 11 of 31

However, this method is not really appealing because it doesn’t impose any condition
upon the size of the clusters. For example when k = 2, it often leads to separating one vertex
from the rest of the graph (more specifically the vertex that has the lowest degree).
The NCut and the RatioCut problems (introduced respectively by Shi and Malik in [3] and
Hagen and kahng in [6]) use the degree of a subset and the number of its vertices respectively
to overcome the problem of the "badly-sized" clusters. In the next two subsections, we will
define these problems, see how the graph Laplacians are used to relax them and introduce
the spectral clustering algorithms that correspond to each one of them.
3.4.1 RatioCut
Let’s conserve the same notations above. The RatioCut is defined with respect to the
partition A1,··· , Ak as:
RatioCut(A1,··· , Ak) :=
1
2
k
i=1
W (Ai , Ai )
|Ai |
=
k
i=1
Cut(Ai , Ai )
|Ai |
(3.3)
The objective of this problem is to find the partition that minimizes the quantity 3.3. Here,
it is clear that a subset with a too small number of vertices would raise the RatioCut. That’s
how the RatioCut avoids the problem of badly-sized subsets. The only issue here is that the
RatioCut problem is NP-hard, thence the necessity of a relaxation method.
Case k = 2
Let’s start with the case k = 2 which is easy to express and understand. Here we try to find
a subset A that minimizes RatioCut(A, A) = Cut(A,A)
|A|
+ Cut(A,A)
|A|
. Let A denote a subset of V
and f a vector of Rn
defined as: f = (f1,··· , fn)T
such that:
fi =



|A|
|A|, if i ∈ A
− |A|
|A|
, if i ∉ A
(3.4)
According to proposition 3.3.1:
f T
L f =
1
2 1≤i≤n
1≤j≤n
si j (fi − fj )2
=
1
2
i∈A,j∈A
si j (fi − fj )2
+
1
2
i∈A,j∈A
si j (fi − fj )2
Page 12 of 31

In fact, the sums i∈A,j∈A si j (fi − fj )2
and i∈A,j∈A si j (fi − fj )2
are null because the entries fi
and fj are equal in both sums. Then:
f T
L f =
1
2
i∈A,j∈A
si j

 |A|
|A|
+
|A|
|A|


2
+
1
2
i∈A,j∈A
si j

−
|A|
|A|
−
|A|
|A|


2
= Cut(A, A)
|A|
|A|
+
|A|
|A|
+2
Thus:
f T
L f = Cut(A, A)
|A|+|A|
|A|
+
|A|+|A|
|A|
= |V |RatioCut(|A|,|A|) (3.5)
Let < . > denote the canonical inner product of Rn
and . 2 the euclidean norm. We can
notice that:
< f ,1 >=
n
i=1
fi =
i∈A
fi +
i∈A
fi = |A|
|A|
|A|
−|A|
|A|
|A|
= 0, then f is orthogonal to 1.
We can also notice that: f 2
2 =
n
i=1
f 2
i =
i∈A
f 2
i +
i∈A
f 2
i = |A|
|A|
|A|
+|A|
|A|
|A|
= n. Then the RatioCut
minimization problem in case k = 2 can be expressed as finding: min
A⊂V
f ⊥1, f 2= n
f T
L f where f
can take entries fi as defined in equation 3.4 which is a discrete optimization problem since
the entries fi are only allowed to take two different values. The relaxation of this problem
consists on allowing these entries to take any real-valued number. The relaxed RatioCut
minimization problem can be therefore written as follows:
min
f ∈Rn
f ⊥1, f 2= n
f T
L f (3.6)
which can be easily solved using basic linear algebra. In fact, the solution of this relaxed
problem is the eigenvector corresponding to the second smallest eigenvalue of the standard
graph Laplacian matrix (Remember that the smallest is zero with eigenvector 1). This can be
proven using the Rayleigh-Ritz theorem. Thus, this eigenvector can be used to approximate
the RatioCut minimizing solution. Note that the latter is not necessarily the optimal solution.
In fact the entries of the optimal solution must be as defined in equation 3.4, however the
solution of the relaxed problem is a randomly real-valued vector. Then we should find a way
to transform it into a discrete indicator vector so as to classify the vertices according to the
latter.
A simple way consists on using the sign of the entry fi of the solution vector to decide
Page 13 of 31

whether the vertex i is in A or not. But most of spectral clustering algorithms use the k-means
clustering instead. In fact, the sign-based heuristic is a "too simple" mainly when we have
to deal with more than two clusters. We will see below how we can relax a k ≥ 2 RatioCut
minimization problem using the properties of the standard graph Laplacian L.
General case k ≥ 2
Let A1,··· , Ak denote k a partition of V and h1,··· ,hk denote k vectors of Rn
deﬁned as:
hj = (h1j ,··· ,hn j )T
where hi j =



1
|Aj |
, if i ∈ Aj
0, if i ∉ Aj
(3.7)
Let H denote the n-by-k matrix where the entries are equal to hi j . Since A1,··· , Ak is sup-
posed to be a partition of V , each row of H contains only one non-zero entry which is the
j-th one if i ∈ Aj . The factor 1
|Aj |
is used in order to normalize the indicator vectors so as
to have: < hi ,hj >= δi j where δi j denotes the Kronecker delta which is equal to one if i = j.
Otherwise, it is null. We can notice that HT
H = Ik. We have:
hT
l Lhl =
1
2 1≤i≤n
1≤j≤n
si j (hi −hj )2
=
1
2
i∈Al ,j∈Al
si j (hi −hj )2
+
1
2
i∈Al ,j∈Al
si j (hi −hj )2
Thus:
hT
l Lhl =
1
2
i∈Al ,j∈Al
si j
1
Al
2
+
1
2
i∈Al ,j∈Al
si j −
1
Al
2
=
Cut(Al , Al )
|Al |
Let El denote the l-th element of the canonical basis of Rk
. We then have:
(HT
LH)ll = ET
l HT
LHEl = (HEl )T
LHEl = hT
l Lhl
Thus: (HT
LH)ll = hT
l
Lhl =
Cut(Al ,Al )
|Al |
We ﬁnally obtain:
RatioCut(A1,··· , Ak) =
k
l=1
Cut(Al , Al )
|Al |
=
k
l=1
(HT
LH)ll = Tr(HT
LH) (3.8)
Where Tr denotes the trace of a matrix. According to equation 3.8, the k ≥ 2 RatioCut
minimization problem can be rewritten as follows:
min
A1,···,Ak ⊂V
HT
H=Ik
Tr(HT
LH) (3.9)
Page 14 of 31

Such that H can take entries hi j as defined in 3.7. As in case k = 2, this is a NP-hard discrete
optimization problem. To relax it, we will allow the entries hi j of the matrix H to browse all
the real-valued numbers. Thus, the relaxed problem can be rewritten as follows:
min
H∈Rn×k
HT
H=Ik
Tr(HT
LH) (3.10)
According to the Rayleigh-Ritz theorem, the solution of the relaxed problem is the matrix U
containing the k first eigenvectors of the standard graph Laplacian. By "k first eigenvectors"
we mean the k first column vectors of the orthogonal matrix P such that:
L = Pdiag(λ1,··· ,λn)PT
and λ1 ≤ ··· ≤ λn
Again the solution of the relaxed problem does not necessarily contain discrete indicator
vectors like the ones defined in 3.7. Thence the necessity of a clustering algorithm. Most of
spectral clustering algorithms use the k-means clustering on the rows of U. Let Y1,··· ,Yn
denote the n rows of U. The outputs are then clusters C1,··· ,Ck. Then we classify the vertices
by choosing i ∈ Aj if Yi ∈ Cj . In fact, two vertices i and j belong to the same cluster if and
only if the corresponding rows Yi and Yj of the matrix U belong to the same cluster as well.
To understand this equivalence, let’s consider a simple case where the solution U is optimal.
Then, its column vectors would be similar to the indicator vectors that we defined in 3.7. It is
clear that two vertices i and j belong to the subset Al if and only if hil = hjl and therefore the
i-th row and the j-th rows are equal as well.
The spectral clustering algorithm corresponding to the RatioCut minimization problem
is the unnormalized spectral clustering described below:
Algorithm 1 Unnormalized spectral clustering
1: Input: n vectors X1,··· ,Xn of Rd
encoding the data and the number of clusters k to
construct
2: Output: Clusters A1,··· , Ak
3: Build the similarity graph (see section 3.5 below) and compute the similarity matrix S
4: Compute the standard graph Laplacian L.
5: Compute v1,··· ,vk, the k first eigenvectors of L.
6: U ← the matrix containing v1,··· ,vk.
7: for i = 1,··· ,n do
8: Yi ← the i−th row of U
9: Cluster the vectors Y1,··· ,Yn of Rk
into clusters C1,··· ,Ck
10: Aj = {Xi |Yi ∈ Cj }
Page 15 of 31

3.4.2 NCut
As in the case of RatioCut, we conserve the notations of section 3.4. The NCut is then
deﬁned with respect to the partition A1,··· , Ak as:
NCut(A1,··· , Ak) :=
1
2
k
i=1
W (Ai , Ai )
vol(Ai )
=
k
i=1
Cut(Ai , Ai )
vol(Ai )
(3.11)
The NCut problem measures the size of a subset A ⊂ V by the quantity vol(A). Recall that
vol(A) is equal to the sum of the degrees of the vertices of A. It is clear that a small value
of vol(Ai ) will raise the value of the NCut. This is the way how the NCut problem avoids
the problem of badly-sized subsets (see section 3.4 above). The advantage of using vol(A)
instead of |A| is that vol(A) measures how "strong" the connectivity between the vertices of A
is. Unlike |A| which doesn’t give any information about this connectivity. Another advantage
of using the NCut instead of the RatioCut is that the NCut minimizes the probability that a
random walker goes from a cluster to another. See proposition 5 in section 6 of [4] for more
explanations.
As in the case of the RatioCut problem, the objective here is to minimize the quantity 3.11
which is a NP-hard problem to solve. We will see below how the graph Laplacian matrices are
used to relax it.
Case k = 2
The objective here is to ﬁnd a subset A ⊂ V that minimizes NCut(A, A) = Cut(A,A)
vol(A) + Cut(A,A)
vol(A)
.
Given a subset A ⊂ V , let f = (f1,··· , fn)T
a vector of Rn
such that:
fi =



vol(A)
vol(A), if i ∈ A
− vol(A)
vol(A)
, if i ∉ A
(3.12)
As we did in the RatioCut’s case, we can check that:
f T
L f = vol(V )NCut(A, A), f T
D f = vol(V ) and (D f )T
1 = 0
Then we can write the NCut minimization problem:
min
A⊂V
D f ⊥1, f T
D f =vol(V )
f T
L f (3.13)
Page 16 of 31

such that f can only take entries fi as defined in 3.12. Again we allow the entries fi to
browse all the real-valued numbers in order to relax the NP-hard NCut minimization problem.
Therefore the relaxed problem can be written as follows:
min
f ∈Rn
D f ⊥1, f T
D f =vol(V )
f T
L f (3.14)
It can be also rewritten in an equivalent way using the substitution g := D1/2
f :
min
g∈Rn
g⊥D1/2
1, g 2
2=vol(V )
gT
D−1/2
LD−1/2
g = min
g∈Rn
g⊥D1/2
1, g 2
2=vol(V )
gT
Lsymg (3.15)
According to propositions 3.3.1 and 3.3.2, D1/2
1 is the first eigenvector of Lsym. Then, using
the Rayleigh-Ritz theorem, we can conclude that the solution g of the relaxed problem is
equal to the second eigenvector of Lsym. If we re-substitute f = D−1/2
g, we can notice using
the proposition 3.3.2 that f is equal to the second eigenvector of Lr w or the generalized
eigenproblem L f = λD f .
After finding the solution of the relaxed problem we carry on the classification process
the same way we did in the case of the RatioCut minimization problem.
General case k ≥ 2
As in the case of the RatioCut, given a partition A1,··· , Ak of the set of vertices V , we define
the vectors h1,··· ,hk of Rn
where hi = (h1j ,··· ,hn j )T
such that:
hi j =



1
vol(Aj )
, if i ∈ Aj
0, if i ∉ Aj
(3.16)
Again let H denote the n-by-k matrix containing the vectors h1,··· ,hk as columns. We can
check that: (HT
LH)ll = hT
l
Lhl =
Cut(Al ,Al )
vol(Al )
and therefore:
NCut(A1,··· , Ak) =
k
l=1
Cut(Al , Al )
vol(Al )
=
k
l=1
(HT
LH)ll = Tr(HT
LH) (3.17)
Page 17 of 31

We can also see that: HT
DH = Ik. Therefore the NCut minimization problem can be re-
expressed as:
min
A1,···,Ak ⊂V
HT
DH=Ik
Tr(HT
LH) (3.18)
where H can take the entries hi j as defined in 3.15. Again, the relaxation consists on allowing
these entries to take any real number. The relaxed problem can be then rewritten as follows:
min
H∈Rn×k
HT
DH=Ik
Tr(HT
LH) (3.19)
Or after the substitution F := D1/2
H:
min
F∈Rn×k
FT
F=Ik
Tr(FT
D−1/2
LD−1/2
F) = min
F∈Rn×k
FT
F=Ik
Tr(FT
LsymF) (3.20)
As in the case of the RatioCut problem, he Rayleigh-Ritz theorem allows us to conclude that
the solution of the relaxed problem is given by the matrix F containing the first k eigenvectors
of the symmetric normalized graph Laplacian Lsym. Then if we re-substitute H = D−1/2
F,
we can notice using proposition 3.3.2 that H is equal to the matrix containing the k first
eigenvectors of Lr w or the generalized eigenproblem L f = λD f .
The spectral clustering algorithms corresponding to the NCut minimization problem are
the random walk normalized spectral clustering introduced by Shi and Malik in [3] and the
symmetric normalized spectral clustering introduced by Ng, Jordan and Weiss in [7].
Algorithm 2 Random walk normalized spectral clustering (Shi and Malik in [3])
construct
4: Compute the standard graph Laplacian L.
5: Compute v1,··· ,vk, the k first eigenvectors of the generalized eigenproblem L f = λDu
(which are also the k first eigenvectors of Lr w ).
6: U ← the matrix containing v1,··· ,vk
7: for i = 1,··· ,n do
8: Yi ← the i−th row of U
10: Aj = {Xi |Yi ∈ Cj }
Page 18 of 31

Algorithm 3 Symmetric normalized spectral clustering (Ng, Jordan and Weiss in [7])
construct
4: Compute the symmetric normalized graph Laplacian Lsym.
5: Compute v1,··· ,vk, the k ﬁrst eigenvectors of Lsym.
6: U ← the matrix containing v1,··· ,vk.
7: compute T the n-by-k matrix by normalizing the rows of U:
8: for i = 1,··· ,n and j = 1,··· ,k do
9: ti j ←
ui j
k
p=1 u2
ip
10: for i = 1,··· ,n do Yi ← the i−th row of T
12: Aj = {Xi |Yi ∈ Cj }
Note that the normalization step in the symmetric normalized spectral clustering aims at
avoiding certain computational problems. The latter are related to the fact that the calculators
precision is limited to some bounds. Note also that in their experiments, the authors of [3]
and [7] used the fully connected graph with the Gaussian kernel as a similarity graph. So will
we do in the next section because the two other options (see section 3.5) require a choice of
additional parameters. This choice is generally based on certain heuristics which might have
a negative impact on the experiments.
3.5 RANDOM DISTORTION TESTING p-VALUE KERNEL
In all spectral clustering algorithms, the ﬁrst step consists on building a similarity graph.
Recall that data partitioning is based on the notion of similarity. In fact, a cluster must
contain data that are "similar" within the cluster and "dissimilar" to the rest of the data set.
A natural way to model the similarity within the data consists on building a graph whose
vertices encode the data. The weights of the connections are then chosen to be proportional
to the degree of "similarity" between the vertices. A kernel function (it is also called similarity
function) is a map from the set of edges E to the set of real numbers. This map assigns to
each pair of vertices a real number called the weight. In practice, the data set is supposed to
contain vectors of Rd
. Thus we can say that two vectors are similar if the distance between
them (e.g the euclidean distance) is "small". Therefore, the smaller the distance between two
vertices is the greater the weight of the edge connecting them is.
Page 19 of 31

The most popular kernel function is the Gaussian kernel which is defined with respect to
a parameter σ and two d-dimensional vectors xi and xj as:
si j := exp
− xi − xj
2
2
2σ2
(3.21)
It is obvious that the Gaussian kernel is a decreasing positive function of the euclidean
distance. The parameter σ is homogeneous to a standard variation, thence its notation. We
can also notice that the Gaussian kernel is homogeneous to a probability.
Given a set of d-dimensional vectors and a kernel function, there are different ways to
build a similarity graph:
• The fully connected graph: here we connect all the vectors and then we weight the
edges using the kernel function.
• The k-nearest neighbors graph: this method consists on connecting each vector to its
k nearest vectors in term of the euclidean distance. We then weight the edges using the
kernel function.
• The -neighborhood graph: we connect the vectors xi and xj if xi −xj 2 ≤ and then
we weight the edges using the kernel function.
In this internship we study the impact of choosing the kernel function equal to the p-value
of random distortion testing on the partitioning performances. First of all, let’s give a formal
definition of the p-value of a random distortion test.
Definition 3.5.1 (p-value of a random distortion test). The p-value of a random distortion
testing is defined with respect to the tolerance τ of the latter, its corresponding deterministic
model θ0, its corresponding noise covariance matrix C and a vector y of Rd
as:
ˆγ(y) = in f {γ ∈ (0,1) : Tλγ(τ) = 1} = in f {γ ∈ (0,1) : λγ(τ) < y −θ0 }
Like the Gaussian kernel, the p-value of RDT is homogeneous to a probability. It can be
seen as the optimal value of the false-alarm probability. In fact, given a fixed value of the
tolerance τ, it is proven in [1] that the function λγ(τ) is a strictly decreasing function of γ.
Hence the p-value is optimal in the sense that it is not too big to detect always false outliers
(false alarms) and not too small to miss detecting them. Given a vector y and a tolerance τ,
it follows from the definition of the p-value that a small value of the latter corresponds to a
high value of λˆγ(y)(τ) and therefore a high value of the distortion y −θ0 and vice versa.
Page 20 of 31

Now let’s consider the case where θ0 = 0Rd the zero constant vector. Then given two
vectors yi and yj of Rd
, a small value of ˆγ(yi − yj ) will correspond to a high value of yi − yj
and vice versa. This is the first reason why we thought of using the p-value as a similarity
(kernel) function. One other reason is that when d = 2, τ = 0, and the covariance matrix C = I2,
the p-value and the Gaussian kernel are equal. In fact, since λγ(τ) is a strictly decreasing
function of γ, then ˆγ(y) satisfies:
λˆγ(y)(τ) = y −θ0 (3.22)
It follow then from equation 2.4 (see section 2.3) that:
ˆγ(y) = 1−R(τ,λˆγ(y)(τ)) = 1−R(τ, y −θ0 ) = Qd/2(τ, y −θ0 ) (3.23)
where QM denotes the Marcum Q-function defined with respect to two real numbers x and y
as:
QM (x, y) = exp −
x2
+ y2
2
∞
k=1−M
x
y
k
Ik(xy) (3.24)
where Ik denotes the Bessel function of order k. Thus, it follows from 3.23 and 3.24 that in
case τ = 0, d = 2 and C = I2:
y −θ0 = (y −θ0)T
I2(y −θ0) = (y −θ0)T
(y −θ0) = y −θ0 2
and:
ˆγ(y) = Q1(0, y −θ0 2) = exp −
y −θ0
2
2
2
∞
k=0
0
y −θ0 2
k
Ik(0) = exp −
y −θ0
2
2
2
(3.25)
It follows from these calculations that the p-value can be considered as a generalization of
the Gaussian kernel.
3.6 EXPERIMENTS
In this section we will compare the clustering performances of the p-value and the Gaussian
kernel in the context of disturbed data. In order to visualize the clustering results we will only
be interested in the case of bi-dimensional and tri-dimensional data. We still suppose in this
section that the data set is supposed to contain vectors Y = Θ+ X of Rd
where Θ denotes the
vector of interest and X ∼ N (0,σ2
Id ) denotes the noise vector.
Page 21 of 31

We will start with the unnormalized spectral clustering where the results are very satisfying.
Figure 3.2: Two circles with additive noise drawn from the normal distribution. A good clustering
must separate the two circles. The number of points in each circle is equal to 50.
Figure 3.3: In the left is represented the resulting clustering using the RDT-pvalue kernel. In the middle
and the right, the ﬁgure represents the resulting clustering of the Gaussian kernel with parameter
sigma respectively equal to the standard variation and the double of the latter
As we can see in ﬁgure 3.3 the performances of the Gaussian kernel and the RDT-pvalue
are both satisfying. In fact, the number of points in each circle is enough to end up at good
clustering results.
Page 22 of 31

Figure 3.4: The figure in the left represents the resulting clustering using the RDT-pvalue whereas the
two others represent the clustering results of the Gaussian kernel (16 points in each circle).
In figure 3.4 above we see the impact of reducing the number of points on the clustering
performance. Here, the RDT-pvalue leads to far more interesting results than the Gaussian
kernel does. In figure 3.5 below, we try the two similarity functions on tri-dimensional data.
Here, a good clustering must separate the two helices. Again we notice that the RDT-pvalue
leads to interesting results.
Figure 3.5: The figure in the left represents the resulting clustering using the RDT-pvalue while the
two others represent the clustering results of the Gaussian kernel.
Page 23 of 31

The script we used to obtain all the figures above is available in the attachments and it is
named "USCmain.m". Put a break-point between the two blocks before running the script.
Try to change N the number of points (line 81) to see its impact on the clustering results.
In the following we will see how the symmetric and the random walk normalized spectral
clustering algorithms don’t end up at good clustering results. But as we will notice in the
figures, there is an interval of tolerance values where the RDT-pvalue kernel leads to the
correct clusters.
Figure 3.6: This figure represents the original data set to partition using the symmetric normalized
spectral clustering algorithm. The standard variation of the noise is equal to 0.5.
After running several times the symmetric normalized spectral clustering algorithm with
different initial conditions, it seems that it doesn’t lead to satisfying results in particular with
the Gaussian similarity function. However, choosing an appropriate value of the tolerance τ
makes the RDT-pvalue kernel work very well. (See the pictures 3.7 and 3.8 below) As we can
see in figure 3.7, the symmetric normalized spectral clustering didn’t manage to end up at the
correct clusters for a tolerance between 0 and 1.33. However, for a tolerance between 1.7778
and 4, the results are very satisfying (figure 3.8) To obtain the figures 3.7 and 3.8 run the file
"SymSCmain.m" in the attachments.
As for the random walk normalized spectral clustering, the results were very similar (See
figures 3.9, 3.10 and 3.11 below) they aren’t satisfying for 0 < τ < 1.1111 and 3.3333 < τ < 5
(figures 3.9 and 3.11). However a tolerance 1.6667 < τ < 2.7778 leads to the correct clusters
(figure 3.10)
Page 24 of 31

CONCLUSION
The experiments show that choosing the p-value of random distortion tests as a similarity
function in spectral clustering algorithms leads to promising results. In fact, we saw how the
results were very satisfying when we used the unnormalized spectral clustering algorithm to
partition bi-dimensional and tri-dimensional data. In normalized spectral clustering cases,
experiments showed that we must ﬁnd the appropriate tolerance to end up at satisfying
results. We have also proved that in some cases the p-value and the Gaussian kernel are
equal.
Finding the appropriate value of the tolerance can be a difﬁcult task. This is also true for
the parameter sigma of the Gaussian kernel. But in general, such parameters are chosen in
order to optimize certain criteria.
Page 25 of 31

REFERENCES
[1] DOMINIQUE PASTOR, QUANG-THANG. Nguyen, Random Distortion Testing and Opti-
mality of Thresholding Tests, IEEE Transactions on Signal processing, Vol. 61, No. 16, August
15, 2013
[2] MacKay, David (2003). "Chapter 20. An Example Inference Task: Clustering" . Infor-
mation Theory, Inference and Learning Algorithms. Cambridge University Press. pp. 284-292.
[3] JIANBO SHI, JITENDRA MALIK, Normalized Cuts and Image Segmentation, IEEE TRANS-
ACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 8, AUGUST
2000
[4] ULRIKE VON LUXBURG, A Tutorial on Spectral Clustering , Statistics and Computing, 17
(4), 2007
[5] Fan R. K. Chung (1997). Spectral graph theory (Vol. 92 of the CBMS Regional Conference
Series in Mathematics).
[6] Lars Hagen and Andrew B. Kahng (1992). New spectral methods for ratio cut partitioning
and clustering. IEEE Trans. Computer-Aided Design, 11(9), 1074 – 1085.
[7] Andrew Y. Ng, Michael I. Jordan and Yair Weiss (2002). On spectral clustering: analy-
sis and an algorithm. In T. Dietterich, S. Becker, and Z. Ghahramani (Eds.), Advances in
Neural Information Processing Systems 14
Page 26 of 31

Figure 3.7: In each row, the ﬁgure in the left represents the resulting clustering using the RDT-pvalue
while the two others represent the clustering results of the Gaussian kernel. Here 0 < τ < 1.33
Page 27 of 31

while the two others represent the clustering results of the Gaussian kernel. Here 1.7778 < τ < 4
Page 28 of 31

while the two others represent the clustering results of the Gaussian kernel. Here 0 < τ < 1.1111
Page 29 of 31

while the two others represent the clustering results of the Gaussian kernel. Here 1.6667 < τ < 2.7778
Page 30 of 31

while the two others represent the clustering results of the Gaussian kernel. Here 3.3333 < τ < 5
Page 31 of 31

InternshipReport

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Ähnlich wie InternshipReport

Ähnlich wie InternshipReport (20)

InternshipReport