SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
Graph Cluster
Theory
Graph Cluster Theory
Haris Sarfraz
2
Graph Cluster Theory
Presented to the Faculty of the
Department of Computer Sciences
Federal Urdu Science Art, Science & Technologies.
Gulshan Campus
In the fulfillment of the course
“Machine Learning”
BSCS Evening Program
Semester 7th
Submitted to:
Sir Usman
By:
Haris Sarfraz
Enrollment No.
BS/E/20048/13/CS
3
Table of Contents
Page
1.1 Introduction
1.1.1 Stage 1: A graph theory………………………………………………………… 4-6
1.1.2 Stage 2: Generation models for clustered graphs…………………………. 7-8
1.1.3 Stage 3: Desirable cluster properties………………………………………
.
1.1.4 Stage 4: Representations of clusters for different classes of graphs ……….
8-11
11-12
1.2 Data Item 2
1.2.1 Stage 1: Bipartite graphs …………………………………………………………. 12-13
1.2.2 Stage 2: Directed graphs……………………………………………………. 14-15
1.2.3 Stage 3: Graphs, structure, and optimization……………………………….. 15-17
1.3 Data Item 3
1.3.1 Stage 1: Graph partitioning and clustering………………………………….. 17-18
1.3.2 Stage 2: Graph partitioning applications……………………………………. 18-19
1.3.3 Stage 3: Clustering as a preprocessing step in graph partitioning…………
1.3.4 Stage 4: Clustering in weighted complete versus simple graphs…………….
19-20
21-23
1.4 Data Item 4
1.4.1 Stage 1: Vector clustering and graph clustering…………………………… 24-26
1.4.2 Stage 2: Applications in graph clustering...................................................... 26
1.4.3 Stage 3: Graph clustering and graph partitioning…………………………...
1.4.4 Stage 4: Graph clustering and graph partitioning………………………………
26-28
28-29
4
Introduction
A graph theory, a branch of mathematics, a cluster graph is a
graph formed from the disjoint union of complete graphs.
Equivalently, a graph is a cluster graph if and only if it has no three-
vertex induced path; for this reason, the cluster graphs are also
called P3-free graphs. They are the complement graphs of the
complete multipartite graphs and the 2-leaf powers.
We begin the difficult work of defining what constitutes a cluster in a
graph and what a clustering should be like; we also discuss some
special classes of graphs. In some of the clustering literature, a
cluster in a graph is called a community. Formally, given a data set,
the goal of clustering is to divide the data set into clusters such that
the elements assigned to a particular cluster are similar or
connected in some predefined sense. However, not all graphs have
a structure with natural clusters. Nonetheless, a clustering algorithm
outputs a clustering for any input graph. If the structure of the graph
is completely uniform, with the edges evenly distributed over the set
of vertices, the clustering computed by any algorithm will be rather
arbitrary. Quality measures – and if feasible, visualizations – will help
to determine whether there are significant clusters present in the
graph and whether a given clustering reveals them or not. In order
to give a more concrete idea of what clusters are, we present here
a small example. On the left in Fig. 1 we have an adjacency matrix
of a graph with n=210 vertices and m= 1505 edges: the 2m black
dots (two for each edge) represent the ones of the matrix, whereas
white areas correspond to zero elements. When the vertices are
ordered randomly, there is no apparent structure in the adjacency
matrix and one cannot trivially interpret the presence, number, or
quality of clusters inherent in the graph. However, once we run a
graph clustering algorithm (in this example, an algorithm of
5
Schaeffer [205]) and re-order the vertices according to their
respective
Clusters, we obtain a diagonal zed version of the adjacency matrix,
shown on the right. Now the cluster structure is evident: there are 17
dense clusters of varying orders and some sparse connections
between the clusters. Matrix diagonalization in itself is an important
application
Of clustering algorithms, as there are efficient computational
methods available for processing diagonal zed matrices, for
example, to solve linear systems. Such computations in turn enable
efficient algorithms for graph partitioning, as the graph partitioning
problem can be written in the form of a set of linear equations.
On the left, the vertices are ordered randomly and the graph
structure can hardly be observed. On the right, the vertex ordering is
by cluster and the 17-cluster structure is evident. Each black dot
corresponds to an element of the adjacency matrix that has the
value one; the white areas correspond to elements with the value
zero.
The graph on the left is a uniform random graph of the Gn, m model
[84, 85] and the graph on the right have a relaxed caveman
structure [228]. Both graphs were drawn with spring-force
visualization [203].
Minimize the number of edges that cross from one subgroup of
vertices to another, usually posing limits on the number of groups as
well as to the relative size of the groups.
6
Fig 1 – the adjacency matrix of 210-vertex graph with 1505 edges
composed of 17 dense clusters. On the left, the vertices ore ordered
randomly and the graph structure can hardly be observed. On the
right, the vertex ordering is by cluster and the 17-cluster structure is
evident. Each dot corresponds to an element of the adjacency
matrix that has the value one; the white areas correspond to
elements with the value zero.
Fig.2 – two graphs both of which have 84 vertices and 358 edges.
The graph on the left is a uniform random graph of the gn,m model
[84,85] and the graph on the right has a relaxed caveman structure
[228]. Both graphs were drawn with spring-force visualization [203]
7
Generation models for clustered graphs
Gilbert presented in 1959 a process for generating uniform random
graphs with n vertices: each of the n 2 possible edges is included in
the graph with probability p, considering each pair of vertices
independently. In such uniform random graphs, the degree
distribution is Poisoning. Also, the presence of dense clusters is unlikely
as the edges are distributed by construction uniformly, and hence
no dense clusters can be expected. A generalization of the Gilbert
model, especially designed to reduce clusters, is the planted `-
partition model a graph is generated with n = ` ·k vertices that are
partitioned into ` groups each with k vertices. Two probability
parameters p and q < p are used to construct the edge set: each
pair of vertices that are in the same group share an edge with the
higher probability p, whereas each pair of vertices in different groups
shares an edge with the lower probability r. The goal of the planted
partition problem is to recover such
A planted partition into ` clusters of k vertices each, instead of
optimizing some measure on the partition. McCrery discusses also
planted versions of other problems such as clique and graph
coloring. One of is a uniform random graph and the other has a
clearly clustered structure. The graph on the right is a relaxed
caveman graph. Caveman graphs were an early attempt in social
sciences to capture the clustering properties of social networks,
produced by linking together a ring of small complete graphs called
“caves” by moving one of the edges in each cave to point to
another cave. Also the graph represented was generated with the
caveman model. These graphs are also especially constructed to
contain a clear cluster structure, possibly with a hierarchical structure
where clusters can be further divided into natural sub clusters. The
procedure to create a relaxed caveman graph is the following: a
connection probability p ∈ (0, 1] of the top level of the hierarchy is
8
given as a parameter, together with a scaling coefficient s that
adjusts the density of the lower-level caves. The minimum naming
and maximum name for the numbers of subcomponents (sub caves
higher levels, vertices at the bottom level) are given as parameters.
The generation procedure is recursive. The graph on the right is a
first-level construction. Planted partition models and relaxed cave
man graphs provide good and controllable artificial input data for
evaluating and benchmarking clustering algorithms. The underlying
cluster structure is determined by the construction, but by defining
lowers values for the intercluster densities, the structure can be
blurred, which increases the difficulty of clustering. The planted
partition model is theoretically easier to handle. However, the
relaxed caveman model incorporates the generation of a hierarchy
and the possibility to generate unequal-sized clusters — both often
present in natural data.
Desirable cluster properties
Unfortunately, no single dentition of a cluster in graphs is universally
accepted, and the variants used the literature are numerous. In the
setting of graphs, each cluster should intuitively be connected: there
should be at least one, preferably several paths connecting each
pair of vertices within a cluster. If a vertex u cannot be reached from
a vertex v, they should not be grouped in the same cluster.
Furthermore, the paths should be internal to the cluster: in addition to
the vertex set C being connected in G, the sub graph induced by C
should be connected in itself, meaning that it is not sufficient for two
vertices v and u in C to be connected by a path that passes through
vertices in VC, but they also need to be connected by a path that
only visits vertices included in C. As a consequence, when clustering
a disconnected graph with known components, the clustering
should usually be conducted on each component separately, unless
some global restriction on the resulting clusters is imposed. In some
9
applications one may wish to obtain clusters of similar order and/or
density, in which case the clusters computed in one component also
influence the clustering’s of other components. We classify the
edges incident on v ∈ C into two groups: internal edges that
connect v to other vertices also in C, and external edges that
connect v to vertices that are not included in the cluster C: deign
(vice) =|Γ(v)∩C| digest (vice) =|Γ(v)∩(VC)| deg(v) =degint
(v,C)+degext (v,C). (21) Clearly, degext (v) = 0 implies that C
containing v could be a good cluster, as v has no connections
outside of it. Similarly, if degint (v) = 0, v should not be included in C
as it is not connected to any of the other vertices included. It is
generally agreed upon that a subset of vertices forms a good cluster
if the induced sub graph is dense, but there are relatively few
connections from the included vertices to vertices in the rest of the
graph. One measure that helps to evaluate the sparsely of
Connections from the cluster to the rest of the graph are the cut size
c(C, VC). The smaller the cut size, the better “isolated” the cluster.
Determining when a cluster is dense is naturally achieved by
computing the graph density. We refer to the
All cluster members are drawn in black and their internal edges are
drawn thicker than other edges of the graph. The cluster on the left is
of good quality, dense and introvert. The one in the middle has the
same number of internal edges, but many more to outside vertices,
making it a worse cluster. The cluster on the right has very few
connections outside, but lacks internal density and hence is not a
good cluster.
Density of the sub graph induced by the cluster as the internal or
intra-cluster density: δint(C) = |{{v,u}|v∈C,u∈C}| |C|(|C|−1) . (22)
The intercluster density of a given clustering of a graph G into k
clusters C1,C2,...,Ck is the average of the intercluster densities of the
included clusters:
10
The external or inter-cluster density is of a clustering is defined as the
ratio of intercluster edges to the maximum number of inter clustered
does possible, which is effectively the sum of the cut sizes of all the
clusters, normalized to the range[0,1]: δext(G|C1,...,Ck) . Globally
speaking, the internal density of a good clustering should be notably
higher than the density of the graph δ (G) and the intercluster
density of the clustering should be lower than the graph density
[184]. Considering the above requirements of connectivity and
density, the loosest possible definition of a graph cluster is that of a
connected component, and the strictest definition is that each
cluster should be a maximal clique (i.e. a sub graph into which no
vertex could bead without losing the clique property). In most
occasions, the semantically useful clusters lie somewhere in between
these two extremes. Connected components are easily computed
in O (n + m)-time with a breadth-first search, whereas clique
detection is NP-complete. Typically, interesting clustering measures
tend to correspond to NP-hard decision problems. When the input
graph is very large, such as the Internet at router level or the Web
graph (see Section 3.3.2), it is highly infeasible to rely on algorithms
with exponential running time, as even linear time computation gets
tedious. It is not always clear whether each vertex should be
assigned fully to a cluster or could it instead have different
“Levels of membership” in several clusters? In document clustering,
such a situation is easily imaginable: a document can be mainly
about fishing, for example, but also address sailing-related issues,
and hence could be clustered into “fishing” with 0.9 memberships,
for example, and to “sailing” with a level of 0.3. Another solution
would be creating a super cluster to include all documents related
to fishing and sailing, but the downside is that there can be
documents on fishing that have no relation to sailing whatsoever. For
general clustering tasks, fuzzy clustering algorithms have been
proposed, as well as validity measures. Within graph clustering, not
11
much work can be found on fuzzy clustering, and in general, the
past decade has been quiet on the area of fuzzy clustering. Yawned
Hsiao present a fuzzy graph-clustering algorithm and apply it to
circuit partitioning. A study on general clustering methods using fuzzy
set theory is presented by Dave and Krishnapuram [68]. A fuzzy
graph GR = (V,R) is composed of a set of vertices and a fuzzy edge-
relation R that is reflexive and symmetrical together with a
membership function µR assigns to each fuzzy edge a level of
“presence” in the graph. Different no fuzzy graphs can be obtained
by thresholding µR such that only those edges {v,u}for which µr(v,u) ≥
τ are included as edges in Gτ. The graph Gτ is called a cut graph of
GR. Dong et al. [75] present a clustering method based on a
connectivity property of fuzzy graphs assuming that the vertices
represent a set of objects that is being clustered based on a
distance measure. The algorithm first pre clusters the data into sub
clusters based on the distance measure, after which a fuzzy graph if
constructed for each sub cluster and a cut graph of the resulting
graph is used to defined what constitutes a cluster. Dong et al. also
discuss the modifications needed in the current clustering upon
updates in the database that contains the objects to be clustered.
Fuzzy clustering has not been established as a widely accepted
approach for graph clustering, but it offers a more relaxed
alternative for applications where assigning each vertex to just one
cluster seems restricting while the vertex does relate more strongly to
some of the candidate clusters than to others.
Representations of clusters for different classes of graphs
It is common that in applications, the graphs are not just simple,
unweighted and undirected. If more than one edge is allowed
between two vertices, instead of a binary adjacency matrix it is
customary to use a matrix that determines for each pair of vertices
how many edges they share. Graphs with such edge multiplicities
12
are called multigraphs. Also, should the graph be weighted, cutting
an important edge (with a large weight) when separating a cluster is
to be punished more heavily than cutting a few unimportant edges
(with very small weights). Edge multiplicities can in essence be
treated as edge weights, but the situation naturally gets more
complicated if the multiple edges themselves have weights. Luckily,
many measures extend rather fluently to incorporate weights or
multiplicities. It is especially easy when the possible values are
confined to a known range, as this
range can be transformed into the interval [0,1] where one
corresponds to a “full” edge, intermediate values to “partial” edges,
and zero to there being no edge between two vertices. With such a
transformation, we may compute density not by counting edges but
summing over the edge weights in the unit line.
To account for the degree of “presence” of the edges. Now a
cluster of high density has either many edges or important edges,
and a low-density cluster has either few or unimportant edges. It
may be desirable to do a nonlinear transformation from the original
weight set to the unit line to adjust the distribution of edge
importance if the clustering results obtained by the linear
transformation appear noisy or otherwise unsatisfactory.
Bipartite graphs
A bipartite graph is a graph where the vertex set V can be split in
two sets A and B such that all edges lie between those two sets: if
{v,w} ∈ E, either v ∈ A and w ∈ B or v ∈ B and w ∈ A. Such graphs are
natural for many application areas where the vertices represent two
distinct classes of objects, such as customers and products; an edge
could signify for example that acer tain customer has bought a
certain product. Possible clustering tasks could be grouping the
customers by the types of products they purchase or grouping
13
products purchased by the same people — the motivation could be
targeted marketing, for instance. Carrasco et al. study a graph of
advertisers and keywords used in advertisements to identify
submarkets by clustering. A bipartite graph G = (A∪B,E) with edges
crossing only between A and B and not within can be transformed
into two graphs GA and GB. Consider two vertices v and w in A. As
the graph is bipartite, Γ(v) ⊆B as well as Γ(w) ⊆B and these two
neighbourhoods may overlap. The more neighbours the two vertices
in A share, the more “similar” they are. Hence, we create a graph
GA = (A,EA) such that {v,w}∈EA if and only if (Γ(v)∩Γ(w)) 6=∅. Similarly
a graph GB results from connecting vertices whose neighborhoods in
A overlap. Weighted versions of GA and can be obtained by setting
ω(v,w) =|Γ(v)∩Γ(w)|, possibly normalizing with the maximum degree
of the graph. Clustering can be computed either for the original
bipartite graph G or for the derived graphs GA and GB [41]. An
intuition on how this works can be developed by thinking of the set A
as books in a bookstore and the set B the customers of the store, the
edges connecting a book to a customer if the customer has bought
that book. Two customers have a similar taste in books if they have
bought many of the same books, and two books appeal to a similar
audience if several customers have purchased both. Hence the
overlap of the neighbourhoods the one side of the graph reflects the
similarity of the vertices of the other side and vice versa.
Bipartite graphs often arise as representations of hyper graphs, that
is, graphs where the edges are generalized to subsets of V of
arbitrary order instead of restricting to pairs of vertices. An example is
factor graphs. Direct clustering of such hyper graphs is another
problem of interest.
14
Directed graphs
Up to now, we have been dealing with undirected graphs. Let us
turn into directed graphs, which require special attention, as the
connections that are used in defining the clustering are
asymmetrical and so is the adjacency matrix. This causes relevant
changes in the structure of the Laplacian matrix and hence makes
spectral analysis more complex. Web graphs are directed graphs
formed by web pages as vertices and hyperlinks as edges. A
clustering of a higher level web graph formed by all Chile and
omains was presented by Virtanen. Clustering of web pages can
help identify topics and group similar pages. This opens applications
in search-engine technology; building artificial clusters is known to be
a popular trick among websites of adult content to try to fool the
Page Rank algorithm [35] used by Google to rate the quality of
websites. The basic Page Rank algorithm as signsinitially to each web
page the same importance value, which is then iteratively
distributed uniformly among all of its neighbors. The goal of the Page
Rank algorithm is to reach a value distribution close to a stationary
converged state using relatively little iteration. The amount of
“importance” left at each vertex at the end of the iteration
becomes the Page Rank of that vertex, the higher the better. A
large cluster formation of N 0 vertices can be constructed to
“maintain” the initial values assigned within the cluster without letting
it drain out and eventually accumulating it to a single target vertex
that would hence receive a value close to N times the initial value
assigned for each vertex when the iterative computation of Page
Rank is stopped, even though the true value after global
convergence would be low. Identifying such clusters helps to
overcome the problem. Another example of directed graphs are
web logs (widely known as blogs) that are web pages in which users
can make comments — identifying communities of users regularly
commenting on each other is not only a task of clustering a directed
15
graph, but also involves a temporal element: the blogs evolve over
time and hence the graph is under constant change as new links
between existing blogs appear and new blogs are created. The blog
graph can be viewed directly as the graph of the links between the
blogs or as a bipartite graph of blogs and users. Also the content of
the online shared-effort encyclopedia Wikipedia forms an interesting
directed graph.
Graphs, structure, and optimization
The concept of a simple graph is tightly associated with the study of
its structural properties. Examples are the study of regular objects
such as distance regular graphs, the study of graph spectra and
graph invariants, and the theory of randomly generated graphs. The
latter concept is introduced in some more detail, because it is used
in defining a benchmark model for graph clustering in Chapter 12. A
generic model via which a simple graph on N edges is randomly
generated is to independently realize each edge with some fixed
probability p (a parameter in this model). Typical questions in this
branch of random graph theory concern the behavior of quantities
such as the number of components, the girth, the chromatic
number, the maximum clique size, and the diameter, as n goes to
infinity. One direction is to hold p fixed and look at the behavior of
the quantity under consideration; another is to find bounds for p as a
function of n such that almost every random graph generated by
the model has a prescribed property (e.g. diameter k, k fixed).
In the theory of nonnegative matrices, structural properties of simple
graphs are taken into account, especially regarding the 0/1
structure of matrix powers. Concepts such as cyclicity and primitively
are needed in order to overcome the complications arising from this
16
0/1 structure. A prime application in the theory of Markov matrices is
the classification of states in terms of their graph-theoretical
properties, and the complete understanding of limit matrices in terms
of the analytic and structural (with respect to the underlying graph)
properties of the generating matrix.
A different perspective is found in the retrieval or recognition of
combinatorial notions such as above. In this line of research,
algorithms are sought to find the diameter of a graph, the largest
clique, a maximum matching, whether a second graph is
embeddable in the first, or whether a simple graph has a
Hamiltonian circuit or not. Some of these problems are doable; they
are solvable in polynomial time, like the diameter and the matching
problems. Other problems are extremely hard, like the clique
problem and the problem of telling whether a given graph has a
Hamiltonian cycle or not. For these problems, no algorithm is known
that works in polynomial time, and the existence of such an
algorithm would imply that a very large class of hard problems
admits polynomial time1 algorithms for their solution. Other problems
in this class are (the recognition version of) the Travelling Salesman
Problem, many network flow and scheduling problems, the
satisfiability problem for logic formulas, and (a combinatorial version
of) the clustering problem. Technically, these problems are all in both
the class known as NP, and the subclass of NP–complete problems.
Roughly speaking, the class NP consists of problems for which it can
be verified in polynomial time that a solution to the problem is
indeed what it claims to be. The NP–complete problems area close-
knit party from a complexity point of view, in the sense that if any
NP–complete problem admits a polynomial–time solution, then so do
all NP–complete problems for an extensive presentation of this
subject. The subset of problems in NP for which a polynomial
algorithm exists is known as P. The abbreviations NP and P stand for
nondeterministic polynomial and polynomial respectively.
17
Historically, the class NP was first introduced in terms of
nondeterministic Turing machines. The study of this type of problems
is generally listed under the denominator of combinatorial
optimization.
Graph partitioning and clustering
In graph partitioning well–defined problems are studied which
closely resemble the problem of finding good clustering’s, in the
settings of both simple and weighted graphs respectively. Restricting
attention first to the bi-partitioning of a graph, let(S,Sc) denote a
partition of the vertex set V of a graph (V,E) and let Σ(S,Sc) be the
total sum of weights of edges between S and Sc, which is the cost
associated with (S,Sc). The maximum cut problem is to find a
partition (S,Sc) which maximizes Σ(S,Sc), and the minimum quotient
cut problem is to find a partition which minimizes
Σ(S,SC)/min(|S|,|Sc|). The recognition versions of these two
problems (i.e. given a number L, is there a partition with a cost not
exceeding L?) are NP–complete; the optimization problems
themselves are NP–hard [62], which means that there is essentially
only one way known to prove that a given solution is optimal; list all
other solutions with their associated cost or yield. In the standard
graph partitioning problem, the sizes of the partition elements are
prescribed, and one is asked to minimize the total weight of the
edges connecting nodes in distinct subsets in the partition. Thus sizes
mi > 0, i = 1,...,k, satisfying k<n and P mi = n are specified, where n is
the size of V, and a partition of V in subsets Si of size mi is asked which
minimizes the given cost function. The fact that this problem is NP–
hard (even for simple graphs and k=2) implies that it is inadvisable to
seek exact (best) solutions. Instead, the general policy is to devise
good heuristics and approximation algorithms, and to find bounds
on the optimal solution. The role of a cost/yield function is then
18
mainly useful for comparison of different strategies with respect to
common benchmarks. The clustering problem in the setting of simple
graphs and weighted structured graphs is a kind of mixture of the
standard graph partitioning problem with the minimum quotient cut
problem, where the fact that the partition may freely vary is an
enormously complicating factor.
Graph partitioning applications
The first large application area of graph partitioning (GP) is found in
the setting of sparse matrix multiplication, parallel computation, and
the numerical solving of PDE’s (partial differential equations). Parallel
computation of sparse matrix multiplication requires the distribution
of rows to different processors. Minimizing the amount of
communication between processors is a graph partitioning problem.
The computation of so called fill-reducing orderings of sparse
matrices for solving large systems of linear equations can also be
formulated as a GP–problem [74]. PDE solvers act on a mesh or grid,
and parallel computation of such a solver requires a partitioning of
the mesh, again in order to minimize communication between
processors.
The second large application area of graph partitioning is found in
‘VLSI CAD’, or Very Large Scale Integration (in) Computer Aided
Design. In the survey article [8] Alpert and Kahng list several sub fields
and application areas (citations below are from.
• Design packaging, which is the partitioning of logic arrays into
clustering. “This problem (...)is still the canonical partitioning
application; it arises not only in chip floor planning and placement,
but also at all other levels of the system design.”
• Partitioning in order to improve mappings from HDL2 descriptions
to gate– or cell–level net lists.
19
• Estimation of wiring requirements and system performance in high–
level synthesis and floor planning.
• Partitioning supports “the mapping of complex circuit designs onto
hundreds or even thousands of interconnected FPGA’s” (Field–
Programmable Gate Array).
• A good partition will “minimize the number of inter-block signals
that must be multiplexed onto the bus architecture of a hardware
simulator or mapped to the global interconnect architecture of a
hardware emulator”.
Summing up, partitioning is used to model the distribution of tasks in
groups, or the division of designs in modules, such that there is a
minimal degree of interdependency or connectivity. Other
application are as mentioned by Elsner in are hypertext browsing,
geographic information services, and mapping of DNA sequences.
Clustering as a preprocessing step in graph partitioning
There is an important dichotomy in graph partitioning with respect to
the role of clustering. Clustering can be used as a preprocessing step
if natural groups are present, in terms of graph connectivity. This is
clearly not the case for meshes and grids, but it is the case in the VLSI
CAD applications introduced above, where the input is most often
described in terms of hyper graphs.
The currently prevailing methods in hyper graph partitioning for VLSI
circuits are so called multilevel approaches where hyper graphs are
first transformed in to normal graphs by some heuristic. Subsequently,
the dimensionality of the graph is reduced by clustering; this is usually
referred to as coarsening. Each cluster is contracted to a single node
and edge weights between the new nodes are computed in terms
of the edge weights between the old nodes. The reduced graph is
20
partitioned by (a combination of) dedicated techniques such as
spectral or move-based partitioning.
The statement that multi-level methods are prevalent in graph
partitioning (for VLSI circuits) was communicated by Charles J. Alpert
in private correspondence. Graph partitioning research experiences
rapid growth and development, and seems primarily US– based. It is
a non-trivial task to analyze and classify this research. One reason for
this is the high degree of cohesion. The number of articles with any
combination of the phrases multi-level, multi-way, spectral, hyper
graph, circuit, net list, partitioning, and clustering is huge — witness
e.g. [7, 9, 31, 32, 38, 39, 73, 75, 81, 101, 102, 138, 174]. Spectral
techniques are firmly established in graph partitioning, but allow
different approaches, e.g. different transformation steps, use of
either the Laplacian of a graph or its adjacency matrix, varying
numbers of eigenvectors used, and different ways of mapping
eigenspaces onto partitions. Moreover, spectral techniques are
usually part of a larger process such as the multi-level process
described here. Each step in the process has its own merits. Thus
there are a large number of variables, giving way to an impressive
proliferation of research. The spectral theorems which are
fundamental to the fields are given in Chapter 8, together with
simple generalizations towards the class of matrices central to this
thesis, diagonally symmetric matrices.
21
Clustering in weighted complete versus simple graphs
The MCL algorithm was designed to meet the challenge of finding
cluster structure in simple graphs. In the data model thus far
prevailing in cluster analysis, here termed a vector model (Section
2.3 in the previous chapter), entities are represented by numerical
scores on attributes. The models are obviously totally different, since
the vector model induces a (metric) Euclidean space, along with
geometric notions such as convexity, density, and centers of gravity
(i.e. vector averages), notions not existing in the (simple) graph
model. Additionally, in the vector model all dissimilarities are
immediately available, but not so in the (simple) graph model. In
fact, there is no notion of similarity between disconnected nodes
except for third party support, e.g. the number of paths of higher
length connecting two nodes.
Simple graphs are in a certain sense much closer to the space of
partitions (equivalently, clustering’s) than vector–derived data. The
classes of simple graphs which are disjoint unions of complete graphs
are in a natural 1–1 correspondence with the space of partitions. A
disjoint union of simple complete graphs allows only one sensible
clustering. This clustering is perfect for the graph, and conversely,
there is no other simple graph for which the corresponding partition
is better justified3. Both partitions and simple graphs are generalized
in the notion of a hyper graph. However, there is not a best
constellation of points in Euclidean space for a given partition. The
only candidate is a constellation where vectors in the same partition
element are infinitely close, and where vectors from different
partition elements are infinitely far away from each other.
On the other hand, a cluster hierarchy resulting from a hierarchical
cluster method (most classic methods are of this type) can be
interpreted as a tree. If the input data is Euclidean, then it is natural
to identify the tree with a tree metric (which satisfies the so called
22
ultra metric inequality), and consider a cluster method ‘good’ if the
tree-metric produced by it is close to the metric corresponding with
the input data. Hazewinkel proved that single link clustering is
optimal if the (metric) distance between two metrics is taken to be
the Lipchitz distance. It is noteworthy that the tree associated with
single link clustering can be derived from the minimal spanning tree
of a given dissimilarity space.
What is the relevance of the differences between the data models
for clustering in the respective settings? This is basically answered by
considering what it requires to apply a vector method to the (simple)
graph model, and vice versa, what it requires to apply a graph
method to the vector model. For the former, it is in general an
insurmountable obstacle that generalized (dis)similarities are too
expensive (in terms of space, and consequently also time) to
compute. Dedicated graph methods solve this problem either by
randomization (Section 5.4) or by a combination of localizing and
pruning generalized similarities.
More can be said about the reverse direction, application of graph
methods to vector data. Methods such as single link and complete
link clustering are easily formulated in terms of threshold graphs
associated with a given dissimilarity space. This is mostly a notational
convenience though, as the methods used for going from graph to
clustering are straightforward and geometrically motivated. For
intermediate to large threshold levels the corresponding threshold
graph simply becomes too large to be represented physically if the
dimension of the dissimilarity space is large. Proposals of a
combinatorial nature have been made towards clustering of
threshold graphs, but these are computationally highly infeasible.
Another type of graph encountered in the vector model is that of a
neighborhood graph satisfying certain constraints for a detailed
account. Examples are the Minimum Spanning Tree, the Gabriel
23
Graph, the Relative Neighborhood Graph, and the Delaunay
Triangulation. For example, the Gabriel Graph of a constellation of
vectors is defined as follows. For each pair of vectors (corresponding
with entities) a sphere is drawn which has as centre the geometric
mean of the vectors, with radius equal to half the Euclidean distance
between the two vectors. Thus, the coordinates of the two vectors
identify points in Euclidean space lying diametrically opposed on the
surface of the sphere. Now, two entities are connected in the
associated Gabriel Graph if no other vector lies in the sphere
associated with the entities. The step from neighborhood graph to
clustering is made by formulating heuristics for the removal of edges.
The connected components of the resulting graph are then
interpreted as clusters. The heuristics are in general local in nature
and are based upon knowledge and/or expectations regarding the
process via which the graph was constructed.
The difficulties in applying graph methods to threshold graphs and
neighborhood graphs are illustrated in Section 10.6. The MCL
algorithm is tested for graphs derived by a threshold criterion from a
grid-like constellation of points, and is shown to produce undesirable
clustering’s under certain conditions. The basic problem is that
natural clusters with many elements and with relatively large
diameter place severe constraints on the parameterization of the
MCL process needed to retrieve them. This causes the MCL process
to become expensive in terms of space and time requirements.
Moreover, it is easy to find constellations such that the small space of
successful parameterizations disappears entirely. This experiment
clearly shows that the MCL algorithm is not well suited to certain
classes of graphs, but it is argued that these limitations apply to a
larger class of (tractable) graph clustering algorithms.
24
Vector clustering and graph clustering
Applications are numerous in cluster analysis, and these have led to
a bewildering multitude of methods. However, classic applications
generally have one thing in common: They assume that entities are
represented by vectors, describing how each entity scores on a set
of characteristics or features. The dissimilarity between two entities is
calculated as a distance between the respective vectors describing
them. This implies that the dissimilarity corresponds with a distance in
a Euclidean geometry and that it is very explicit; it is immediately
available. Geometric notions such as centre of gravity, convex hull,
and density naturally come into play. Graphs are objects that have
a much more combinatorial nature, as witnessed by notions such as
degree (number of neighbours), path, cycle, connectedness, et
cetera. Edge weights usually do not correspond with a distance or
proximity embeddable in a Euclidean geometry, nor need they
resemble a metric. I refer to the two settings respectively as vector
clustering and graph clustering.
The graph model and the vector model do not exclude one
another, but one model may inspire methods which are hard to
conceive in the other model, and less fit to apply there. The MCL
process was designed to meet the specific challenge of finding
cluster structure in simple graphs, where dissimilarity between
vertices is implicitly defined by the connectivity characteristics of the
graph. Vector methods have little to offer for the graph model,
except that a generic paradigm such as single link clustering can be
given meaning in the graph model as well. Conversely, many
methods in the vector model construct objects such as proximity
graphs and neighborhood graphs, mostly as a notational
convenience. For these it is interesting to investigate the conditions
under
25
which graph methods can be sensibly applied to such derived
graphs.
Vector clustering and graph clustering. Applications are numerous in
cluster analysis, and these have led to a bewildering multitude of
methods. However, classic applications generally have one thing in
common: They assume that entities are represented by vectors,
describing how each entity scores on a set of characteristics or
features. The dissimilarity between two entities is calculated as a
distance between the respective vectors describing them. This
implies that the dissimilarity corresponds with a distance in a
Euclidean geometry and that it is very explicit; it is immediately
available. Geometric notions such as centre of gravity, convex hull,
and density naturally come into play. Graphs are objects that have
a much more combinatorial nature, as witnessed by notions such as
degree (number of neighbours), path, cycle, connectedness, et
cetera. Edge weights usually do not correspond with a distance or
proximity embeddable in a Euclidean geometry, nor need they
resemble a metric. I refer to the two settings respectively as vector
clustering and graph clustering.
The graph model and the vector model do not exclude one
another, but one model may inspire methods which are hard to
conceive in the other model, and less fit to apply there. The MCL
process was designed to meet the specific challenge of finding
cluster structure in simple graphs, where dissimilarity between
vertices is implicitly defined by the connectivity characteristics of the
graph. Vector methods have little to offer for the graph model,
except that a generic paradigm such as single link clustering can be
given meaning in the graph model as well. Conversely, many
methods in the vector model construct objects such as proximity
graphs and neighborhood graphs, mostly as a notational
26
convenience. For these it is interesting to investigate the conditions
under
Applications in graph clustering
The current need for and development of cluster methods in the
setting of graphs is mostly found within the fields of graph
partitioning, where clustering is also referred to as coarsening, and in
the area of pattern recognition. Clustering serves as a preprocessing
step in some partitioning methods. Pattern recognition sciences can
be viewed as the natural surroundings for the whole of cluster
analysis from an application point of view. Due to its fundamental
and generic cognitive nature the clustering problem occurs in
various solutions for pattern recognition tasks, often as an
intermediate processing step in layered methods. Another reason
why the pattern recognition sciences are interesting is that many
problems and methods are formulated in terms of graphs. This is
natural in view of the fact that higher-level conceptual information is
often stored in the form of graphs, possibly with typed arcs. There are
two stochastic/graph-based models which suggest some kinship with
the MCL algorithm: Hidden Markov models and Markov Random
Fields. The kinship is rather distant; the foremost significance is that it
demonstrates the versatility and power of stochastic graph theory
applied to recognition problems. With the advent of the wired1 and
hyperlinked era the graph model is likely to gain further in
significance.
Graph clustering and graph partitioning
Cluster analysis in the setting of graphs is closely related to the fields
of graph partitioning, where methods are studied to find the optimal
partition of a graph given certain restrictions. The generic meaning
27
of ‘clustering’ is in principle exactly that of ‘partition’. The differences
is that the semantics of ‘clustering’ change when combined with
adjectives such as ‘overlapping’ and ‘hierarchic’. A partition is
strictly defined as a division of some set S into subsets satisfying a) all
pairs of subsets are disjoint and b) the union of all subsets yields S. In
graph partitioning the required partition sizes are specified in
advance. The objective is to minimize some cost function associated
with links connecting different partition elements. Thus the burden of
finding natural groups disappears. The graphs which are to be
partitioned are sometimes extremely homogeneous, like for example
meshes and grids. The concept of ‘natural groups’ is hard to apply to
such graphs. However, graphs possessing natural groups do occur in
graph partitioning: The graph is taken from the article A
computational study of graph partitioning. If natural groups are
present then a clustering of the graph may aid in finding a good
partition, if the granularity of the clustering is find compared with the
granularity of the required partition sizes. There is a small and rather
isolated stream of publications on graph clustering in the setting of
circuit lay-out. One cause for this isolation is that the primary data-
model is that of a hyper graph with nodes that can be of different
type, another is the cohesive nature of the research in this area. The
current need for and development of cluster methods in the setting
of graphs is mostly found within the fields of graph partitioning, where
clustering is also referred to as coarsening, and in the area of pattern
recognition. Clustering serves as a preprocessing step in some
partitioning methods. Pattern recognition sciences can be viewed as
the natural surroundings for the whole of cluster analysis from an
application point of view. Due to its fundamental and generic
cognitive nature the clustering problem occurs in various solutions for
pattern recognition tasks, often as an intermediate processing step
in layered methods. Another reason why the pattern recognition
sciences are interesting is that many problems and methods are
28
formulated in terms of graphs. This is natural in view of the fact that
higher-level conceptual information is often stored in the form of
graphs, possibly with typed arcs. There are two stochastic/graph-
based models which suggest some kinship with the MCL algorithm:
Hidden Markov models and Markov Random Fields. The kinship is
rather distant; the foremost significance is that it demonstrates the
versatility and power of stochastic graph theory applied to
recognition problems. With the advent of the wired1 and hyperlinked
era the graph model is likely to gain further in significance.
Graph clustering and graph partitioning
Cluster analysis in the setting of graphs is closely related to the fields
of graph partitioning, where methods are studied to find the optimal
partition of a graph given certain restrictions. The generic meaning
of ‘clustering’ is in principle exactly that of ‘partition’. The differences
is that the semantics of ‘clustering’ change when combined with
adjectives such as ‘overlapping’ and ‘hierarchic’. A partition is
strictly defined as a division of some set S into subsets satisfying a) all
pairs of subsets are disjoint and b) the union of all subsets yields S. In
graph partitioning the required partition sizes are specified in
advance. The objective is to minimize some cost function associated
with links connecting different partition elements. Thus the burden of
finding natural groups disappears. The graphs which are to be
partitioned are sometimes extremely homogeneous, like for example
meshes and grids. The concept of ‘natural groups’ is hard to apply to
such graphs. However, graphs possessing natural groups do occur in
graph partitioning: The graph in Figure 2 is taken from the article a
computational study of graph partitioning [55]. If natural groups are
present then a clustering of the graph may aid in finding a good
partition, if the granularity of the clustering is fine compared with the
granularity of the required partition sizes. There is a small and rather
29
isolated stream of publications on graph clustering in the setting of
circuit lay-out. One cause for this isolation is that the primary data-
model is that of a hyper graph with nodes that can be of different
type, another is the cohesive nature of the research in this area.

Weitere ähnliche Inhalte

Was ist angesagt?

Connected charts explicit visualization of relationship between data graphics
Connected charts explicit visualization of relationship between data graphicsConnected charts explicit visualization of relationship between data graphics
Connected charts explicit visualization of relationship between data graphicsAsliza Hamzah
 
Vector analysis & matrix
Vector analysis & matrixVector analysis & matrix
Vector analysis & matrixMdTouhidIslam
 
Integration techniques for two dimensional domains
Integration techniques for two dimensional domainsIntegration techniques for two dimensional domains
Integration techniques for two dimensional domainseSAT Publishing House
 
Linear Algebra Applications
Linear Algebra ApplicationsLinear Algebra Applications
Linear Algebra ApplicationsRamesh Shashank
 
Principles of composition by Adeeba Afreen
Principles of composition by Adeeba AfreenPrinciples of composition by Adeeba Afreen
Principles of composition by Adeeba AfreenAdeeba Afreen
 
Learning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RLLearning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RLlauratoni4
 
Geometric and Topological Data Analysis
Geometric and Topological Data AnalysisGeometric and Topological Data Analysis
Geometric and Topological Data AnalysisDon Sheehy
 
Self-Assembling Hyper-heuristics: a proof of concept
Self-Assembling Hyper-heuristics: a proof of conceptSelf-Assembling Hyper-heuristics: a proof of concept
Self-Assembling Hyper-heuristics: a proof of conceptGerman Terrazas
 
Numerical Solution for Two Dimensional Laplace Equation with Dirichlet Bounda...
Numerical Solution for Two Dimensional Laplace Equation with Dirichlet Bounda...Numerical Solution for Two Dimensional Laplace Equation with Dirichlet Bounda...
Numerical Solution for Two Dimensional Laplace Equation with Dirichlet Bounda...IOSR Journals
 
diagrammatic and graphical representation of data
 diagrammatic and graphical representation of data diagrammatic and graphical representation of data
diagrammatic and graphical representation of dataVarun Prem Varu
 
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...lauratoni4
 
Data Presentation
Data PresentationData Presentation
Data Presentationcheergalsal
 
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...lauratoni4
 
Parallel Machine Learning
Parallel Machine LearningParallel Machine Learning
Parallel Machine LearningJanani C
 
Multiscale Mapper Networks
Multiscale Mapper NetworksMultiscale Mapper Networks
Multiscale Mapper NetworksColleen Farrelly
 

Was ist angesagt? (20)

Connected charts explicit visualization of relationship between data graphics
Connected charts explicit visualization of relationship between data graphicsConnected charts explicit visualization of relationship between data graphics
Connected charts explicit visualization of relationship between data graphics
 
Topology for data science
Topology for data scienceTopology for data science
Topology for data science
 
Vector analysis & matrix
Vector analysis & matrixVector analysis & matrix
Vector analysis & matrix
 
Integration techniques for two dimensional domains
Integration techniques for two dimensional domainsIntegration techniques for two dimensional domains
Integration techniques for two dimensional domains
 
Linear Algebra Applications
Linear Algebra ApplicationsLinear Algebra Applications
Linear Algebra Applications
 
Principles of composition by Adeeba Afreen
Principles of composition by Adeeba AfreenPrinciples of composition by Adeeba Afreen
Principles of composition by Adeeba Afreen
 
Learning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RLLearning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RL
 
Graph theory
Graph theoryGraph theory
Graph theory
 
Geometric and Topological Data Analysis
Geometric and Topological Data AnalysisGeometric and Topological Data Analysis
Geometric and Topological Data Analysis
 
Self-Assembling Hyper-heuristics: a proof of concept
Self-Assembling Hyper-heuristics: a proof of conceptSelf-Assembling Hyper-heuristics: a proof of concept
Self-Assembling Hyper-heuristics: a proof of concept
 
Numerical Solution for Two Dimensional Laplace Equation with Dirichlet Bounda...
Numerical Solution for Two Dimensional Laplace Equation with Dirichlet Bounda...Numerical Solution for Two Dimensional Laplace Equation with Dirichlet Bounda...
Numerical Solution for Two Dimensional Laplace Equation with Dirichlet Bounda...
 
diagrammatic and graphical representation of data
 diagrammatic and graphical representation of data diagrammatic and graphical representation of data
diagrammatic and graphical representation of data
 
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
 
Data Presentation
Data PresentationData Presentation
Data Presentation
 
Ae1 1718 class 3_09_26
Ae1 1718 class 3_09_26Ae1 1718 class 3_09_26
Ae1 1718 class 3_09_26
 
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
 
DMSN2
DMSN2DMSN2
DMSN2
 
Ae1 1718 class 3b_09_28
Ae1 1718 class 3b_09_28Ae1 1718 class 3b_09_28
Ae1 1718 class 3b_09_28
 
Parallel Machine Learning
Parallel Machine LearningParallel Machine Learning
Parallel Machine Learning
 
Multiscale Mapper Networks
Multiscale Mapper NetworksMultiscale Mapper Networks
Multiscale Mapper Networks
 

Andere mochten auch

Como influye la tecnologia en las diferentes culturas
Como influye la tecnologia en las diferentes culturasComo influye la tecnologia en las diferentes culturas
Como influye la tecnologia en las diferentes culturasCarlos Andres Lizarazo Romero
 
Quadro de vagas_27-02-14
Quadro de vagas_27-02-14Quadro de vagas_27-02-14
Quadro de vagas_27-02-14Portal NE10
 
Greene Concepts Newsletter Vol. 1 Issue 1
Greene Concepts Newsletter  Vol. 1  Issue 1Greene Concepts Newsletter  Vol. 1  Issue 1
Greene Concepts Newsletter Vol. 1 Issue 1David Johnson
 
Educative Mobile Apps
Educative Mobile AppsEducative Mobile Apps
Educative Mobile AppsWhitenerd
 
Piercings - Roela Applied 2B
Piercings - Roela Applied 2BPiercings - Roela Applied 2B
Piercings - Roela Applied 2Bkarabo kotane
 
Unit 25 coping with change week 2
Unit 25  coping with change week 2Unit 25  coping with change week 2
Unit 25 coping with change week 2debbiejesner
 
Goodwin sheet pdf
Goodwin sheet pdfGoodwin sheet pdf
Goodwin sheet pdfNaamah Hill
 
Section A Digital technology model answers
Section A Digital technology model answersSection A Digital technology model answers
Section A Digital technology model answersNaamah Hill
 
Hotel Babylon Answer
Hotel Babylon AnswerHotel Babylon Answer
Hotel Babylon AnswerNaamah Hill
 
AS level exemplar candidate work unit g322 2015
AS level exemplar candidate work unit g322 2015AS level exemplar candidate work unit g322 2015
AS level exemplar candidate work unit g322 2015Naamah Hill
 
Section A - Digital Technology
Section A - Digital TechnologySection A - Digital Technology
Section A - Digital TechnologyNaamah Hill
 
Exam section A 2016
Exam section A 2016Exam section A 2016
Exam section A 2016Naamah Hill
 
Data Science - Part VII - Cluster Analysis
Data Science - Part VII -  Cluster AnalysisData Science - Part VII -  Cluster Analysis
Data Science - Part VII - Cluster AnalysisDerek Kane
 

Andere mochten auch (18)

My dm ppt
My dm pptMy dm ppt
My dm ppt
 
GEOMETRIA CON TIC'S
GEOMETRIA CON TIC'SGEOMETRIA CON TIC'S
GEOMETRIA CON TIC'S
 
Como influye la tecnologia en las diferentes culturas
Como influye la tecnologia en las diferentes culturasComo influye la tecnologia en las diferentes culturas
Como influye la tecnologia en las diferentes culturas
 
SADM 44 Año IX Sep/Oct 2012
SADM 44 Año IX Sep/Oct 2012SADM 44 Año IX Sep/Oct 2012
SADM 44 Año IX Sep/Oct 2012
 
Quadro de vagas_27-02-14
Quadro de vagas_27-02-14Quadro de vagas_27-02-14
Quadro de vagas_27-02-14
 
Greene Concepts Newsletter Vol. 1 Issue 1
Greene Concepts Newsletter  Vol. 1  Issue 1Greene Concepts Newsletter  Vol. 1  Issue 1
Greene Concepts Newsletter Vol. 1 Issue 1
 
Educative Mobile Apps
Educative Mobile AppsEducative Mobile Apps
Educative Mobile Apps
 
OLD SEAMAN BOOK 6
OLD SEAMAN BOOK 6OLD SEAMAN BOOK 6
OLD SEAMAN BOOK 6
 
Piercings - Roela Applied 2B
Piercings - Roela Applied 2BPiercings - Roela Applied 2B
Piercings - Roela Applied 2B
 
Unit 25 coping with change week 2
Unit 25  coping with change week 2Unit 25  coping with change week 2
Unit 25 coping with change week 2
 
Heathcare miracle in Cuba
Heathcare miracle in CubaHeathcare miracle in Cuba
Heathcare miracle in Cuba
 
Goodwin sheet pdf
Goodwin sheet pdfGoodwin sheet pdf
Goodwin sheet pdf
 
Section A Digital technology model answers
Section A Digital technology model answersSection A Digital technology model answers
Section A Digital technology model answers
 
Hotel Babylon Answer
Hotel Babylon AnswerHotel Babylon Answer
Hotel Babylon Answer
 
AS level exemplar candidate work unit g322 2015
AS level exemplar candidate work unit g322 2015AS level exemplar candidate work unit g322 2015
AS level exemplar candidate work unit g322 2015
 
Section A - Digital Technology
Section A - Digital TechnologySection A - Digital Technology
Section A - Digital Technology
 
Exam section A 2016
Exam section A 2016Exam section A 2016
Exam section A 2016
 
Data Science - Part VII - Cluster Analysis
Data Science - Part VII -  Cluster AnalysisData Science - Part VII -  Cluster Analysis
Data Science - Part VII - Cluster Analysis
 

Ähnlich wie Graph cluster Theory

Scalable Graph Clustering with Pregel
Scalable Graph Clustering with PregelScalable Graph Clustering with Pregel
Scalable Graph Clustering with PregelSqrrl
 
FREQUENT SUBGRAPH MINING ALGORITHMS - A SURVEY AND FRAMEWORK FOR CLASSIFICATION
FREQUENT SUBGRAPH MINING ALGORITHMS - A SURVEY AND FRAMEWORK FOR CLASSIFICATIONFREQUENT SUBGRAPH MINING ALGORITHMS - A SURVEY AND FRAMEWORK FOR CLASSIFICATION
FREQUENT SUBGRAPH MINING ALGORITHMS - A SURVEY AND FRAMEWORK FOR CLASSIFICATIONcscpconf
 
Introduction to graph theory (All chapter)
Introduction to graph theory (All chapter)Introduction to graph theory (All chapter)
Introduction to graph theory (All chapter)sobia1122
 
PAGE NUMBER PROBLEM: AN EVALUATION OF HEURISTICS AND ITS SOLUTION USING A HYB...
PAGE NUMBER PROBLEM: AN EVALUATION OF HEURISTICS AND ITS SOLUTION USING A HYB...PAGE NUMBER PROBLEM: AN EVALUATION OF HEURISTICS AND ITS SOLUTION USING A HYB...
PAGE NUMBER PROBLEM: AN EVALUATION OF HEURISTICS AND ITS SOLUTION USING A HYB...ijcseit
 
Imagethresholding
ImagethresholdingImagethresholding
Imagethresholdingananta200
 
Graph theory ppt.pptx
Graph theory ppt.pptxGraph theory ppt.pptx
Graph theory ppt.pptxsaranyajey
 
GraphSignalProcessingFinalPaper
GraphSignalProcessingFinalPaperGraphSignalProcessingFinalPaper
GraphSignalProcessingFinalPaperChiraz Nafouki
 
Finding Top-k Similar Graphs in Graph Database @ ReadingCircle
Finding Top-k Similar Graphs in Graph Database @ ReadingCircleFinding Top-k Similar Graphs in Graph Database @ ReadingCircle
Finding Top-k Similar Graphs in Graph Database @ ReadingCirclecharlingual
 
Unit II_Graph.pptxkgjrekjgiojtoiejhgnltegjte
Unit II_Graph.pptxkgjrekjgiojtoiejhgnltegjteUnit II_Graph.pptxkgjrekjgiojtoiejhgnltegjte
Unit II_Graph.pptxkgjrekjgiojtoiejhgnltegjtepournima055
 
An analysis between different algorithms for the graph vertex coloring problem
An analysis between different algorithms for the graph vertex coloring problem An analysis between different algorithms for the graph vertex coloring problem
An analysis between different algorithms for the graph vertex coloring problem IJECEIAES
 
Graph Dynamical System on Graph Colouring
Graph Dynamical System on Graph ColouringGraph Dynamical System on Graph Colouring
Graph Dynamical System on Graph ColouringClyde Shen
 
cis98010
cis98010cis98010
cis98010perfj
 
Quantum persistent k cores for community detection
Quantum persistent k cores for community detectionQuantum persistent k cores for community detection
Quantum persistent k cores for community detectionColleen Farrelly
 
On algorithmic problems concerning graphs of higher degree of symmetry
On algorithmic problems concerning graphs of higher degree of symmetryOn algorithmic problems concerning graphs of higher degree of symmetry
On algorithmic problems concerning graphs of higher degree of symmetrygraphhoc
 
Exploration of Imputation Methods for Missingness in Image Segmentation
Exploration of Imputation Methods for Missingness in Image SegmentationExploration of Imputation Methods for Missingness in Image Segmentation
Exploration of Imputation Methods for Missingness in Image SegmentationChristopher Peter Makris
 

Ähnlich wie Graph cluster Theory (20)

Scalable Graph Clustering with Pregel
Scalable Graph Clustering with PregelScalable Graph Clustering with Pregel
Scalable Graph Clustering with Pregel
 
FREQUENT SUBGRAPH MINING ALGORITHMS - A SURVEY AND FRAMEWORK FOR CLASSIFICATION
FREQUENT SUBGRAPH MINING ALGORITHMS - A SURVEY AND FRAMEWORK FOR CLASSIFICATIONFREQUENT SUBGRAPH MINING ALGORITHMS - A SURVEY AND FRAMEWORK FOR CLASSIFICATION
FREQUENT SUBGRAPH MINING ALGORITHMS - A SURVEY AND FRAMEWORK FOR CLASSIFICATION
 
Introduction to graph theory (All chapter)
Introduction to graph theory (All chapter)Introduction to graph theory (All chapter)
Introduction to graph theory (All chapter)
 
PAGE NUMBER PROBLEM: AN EVALUATION OF HEURISTICS AND ITS SOLUTION USING A HYB...
PAGE NUMBER PROBLEM: AN EVALUATION OF HEURISTICS AND ITS SOLUTION USING A HYB...PAGE NUMBER PROBLEM: AN EVALUATION OF HEURISTICS AND ITS SOLUTION USING A HYB...
PAGE NUMBER PROBLEM: AN EVALUATION OF HEURISTICS AND ITS SOLUTION USING A HYB...
 
Imagethresholding
ImagethresholdingImagethresholding
Imagethresholding
 
Graph theory ppt.pptx
Graph theory ppt.pptxGraph theory ppt.pptx
Graph theory ppt.pptx
 
GraphSignalProcessingFinalPaper
GraphSignalProcessingFinalPaperGraphSignalProcessingFinalPaper
GraphSignalProcessingFinalPaper
 
Colloquium.pptx
Colloquium.pptxColloquium.pptx
Colloquium.pptx
 
Finding Top-k Similar Graphs in Graph Database @ ReadingCircle
Finding Top-k Similar Graphs in Graph Database @ ReadingCircleFinding Top-k Similar Graphs in Graph Database @ ReadingCircle
Finding Top-k Similar Graphs in Graph Database @ ReadingCircle
 
Unit II_Graph.pptxkgjrekjgiojtoiejhgnltegjte
Unit II_Graph.pptxkgjrekjgiojtoiejhgnltegjteUnit II_Graph.pptxkgjrekjgiojtoiejhgnltegjte
Unit II_Graph.pptxkgjrekjgiojtoiejhgnltegjte
 
An analysis between different algorithms for the graph vertex coloring problem
An analysis between different algorithms for the graph vertex coloring problem An analysis between different algorithms for the graph vertex coloring problem
An analysis between different algorithms for the graph vertex coloring problem
 
Graph Dynamical System on Graph Colouring
Graph Dynamical System on Graph ColouringGraph Dynamical System on Graph Colouring
Graph Dynamical System on Graph Colouring
 
cis98010
cis98010cis98010
cis98010
 
Graph clustering
Graph clusteringGraph clustering
Graph clustering
 
Quantum persistent k cores for community detection
Quantum persistent k cores for community detectionQuantum persistent k cores for community detection
Quantum persistent k cores for community detection
 
On algorithmic problems concerning graphs of higher degree of symmetry
On algorithmic problems concerning graphs of higher degree of symmetryOn algorithmic problems concerning graphs of higher degree of symmetry
On algorithmic problems concerning graphs of higher degree of symmetry
 
Exhaustive Combinatorial Enumeration
Exhaustive Combinatorial EnumerationExhaustive Combinatorial Enumeration
Exhaustive Combinatorial Enumeration
 
Exploration of Imputation Methods for Missingness in Image Segmentation
Exploration of Imputation Methods for Missingness in Image SegmentationExploration of Imputation Methods for Missingness in Image Segmentation
Exploration of Imputation Methods for Missingness in Image Segmentation
 
Planted Clique Research Paper
Planted Clique Research PaperPlanted Clique Research Paper
Planted Clique Research Paper
 
Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
 

Kürzlich hochgeladen

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 

Kürzlich hochgeladen (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 

Graph cluster Theory

  • 1. Graph Cluster Theory Graph Cluster Theory Haris Sarfraz
  • 2. 2 Graph Cluster Theory Presented to the Faculty of the Department of Computer Sciences Federal Urdu Science Art, Science & Technologies. Gulshan Campus In the fulfillment of the course “Machine Learning” BSCS Evening Program Semester 7th Submitted to: Sir Usman By: Haris Sarfraz Enrollment No. BS/E/20048/13/CS
  • 3. 3 Table of Contents Page 1.1 Introduction 1.1.1 Stage 1: A graph theory………………………………………………………… 4-6 1.1.2 Stage 2: Generation models for clustered graphs…………………………. 7-8 1.1.3 Stage 3: Desirable cluster properties……………………………………… . 1.1.4 Stage 4: Representations of clusters for different classes of graphs ………. 8-11 11-12 1.2 Data Item 2 1.2.1 Stage 1: Bipartite graphs …………………………………………………………. 12-13 1.2.2 Stage 2: Directed graphs……………………………………………………. 14-15 1.2.3 Stage 3: Graphs, structure, and optimization……………………………….. 15-17 1.3 Data Item 3 1.3.1 Stage 1: Graph partitioning and clustering………………………………….. 17-18 1.3.2 Stage 2: Graph partitioning applications……………………………………. 18-19 1.3.3 Stage 3: Clustering as a preprocessing step in graph partitioning………… 1.3.4 Stage 4: Clustering in weighted complete versus simple graphs……………. 19-20 21-23 1.4 Data Item 4 1.4.1 Stage 1: Vector clustering and graph clustering…………………………… 24-26 1.4.2 Stage 2: Applications in graph clustering...................................................... 26 1.4.3 Stage 3: Graph clustering and graph partitioning…………………………... 1.4.4 Stage 4: Graph clustering and graph partitioning……………………………… 26-28 28-29
  • 4. 4 Introduction A graph theory, a branch of mathematics, a cluster graph is a graph formed from the disjoint union of complete graphs. Equivalently, a graph is a cluster graph if and only if it has no three- vertex induced path; for this reason, the cluster graphs are also called P3-free graphs. They are the complement graphs of the complete multipartite graphs and the 2-leaf powers. We begin the difficult work of defining what constitutes a cluster in a graph and what a clustering should be like; we also discuss some special classes of graphs. In some of the clustering literature, a cluster in a graph is called a community. Formally, given a data set, the goal of clustering is to divide the data set into clusters such that the elements assigned to a particular cluster are similar or connected in some predefined sense. However, not all graphs have a structure with natural clusters. Nonetheless, a clustering algorithm outputs a clustering for any input graph. If the structure of the graph is completely uniform, with the edges evenly distributed over the set of vertices, the clustering computed by any algorithm will be rather arbitrary. Quality measures – and if feasible, visualizations – will help to determine whether there are significant clusters present in the graph and whether a given clustering reveals them or not. In order to give a more concrete idea of what clusters are, we present here a small example. On the left in Fig. 1 we have an adjacency matrix of a graph with n=210 vertices and m= 1505 edges: the 2m black dots (two for each edge) represent the ones of the matrix, whereas white areas correspond to zero elements. When the vertices are ordered randomly, there is no apparent structure in the adjacency matrix and one cannot trivially interpret the presence, number, or quality of clusters inherent in the graph. However, once we run a graph clustering algorithm (in this example, an algorithm of
  • 5. 5 Schaeffer [205]) and re-order the vertices according to their respective Clusters, we obtain a diagonal zed version of the adjacency matrix, shown on the right. Now the cluster structure is evident: there are 17 dense clusters of varying orders and some sparse connections between the clusters. Matrix diagonalization in itself is an important application Of clustering algorithms, as there are efficient computational methods available for processing diagonal zed matrices, for example, to solve linear systems. Such computations in turn enable efficient algorithms for graph partitioning, as the graph partitioning problem can be written in the form of a set of linear equations. On the left, the vertices are ordered randomly and the graph structure can hardly be observed. On the right, the vertex ordering is by cluster and the 17-cluster structure is evident. Each black dot corresponds to an element of the adjacency matrix that has the value one; the white areas correspond to elements with the value zero. The graph on the left is a uniform random graph of the Gn, m model [84, 85] and the graph on the right have a relaxed caveman structure [228]. Both graphs were drawn with spring-force visualization [203]. Minimize the number of edges that cross from one subgroup of vertices to another, usually posing limits on the number of groups as well as to the relative size of the groups.
  • 6. 6 Fig 1 – the adjacency matrix of 210-vertex graph with 1505 edges composed of 17 dense clusters. On the left, the vertices ore ordered randomly and the graph structure can hardly be observed. On the right, the vertex ordering is by cluster and the 17-cluster structure is evident. Each dot corresponds to an element of the adjacency matrix that has the value one; the white areas correspond to elements with the value zero. Fig.2 – two graphs both of which have 84 vertices and 358 edges. The graph on the left is a uniform random graph of the gn,m model [84,85] and the graph on the right has a relaxed caveman structure [228]. Both graphs were drawn with spring-force visualization [203]
  • 7. 7 Generation models for clustered graphs Gilbert presented in 1959 a process for generating uniform random graphs with n vertices: each of the n 2 possible edges is included in the graph with probability p, considering each pair of vertices independently. In such uniform random graphs, the degree distribution is Poisoning. Also, the presence of dense clusters is unlikely as the edges are distributed by construction uniformly, and hence no dense clusters can be expected. A generalization of the Gilbert model, especially designed to reduce clusters, is the planted `- partition model a graph is generated with n = ` ·k vertices that are partitioned into ` groups each with k vertices. Two probability parameters p and q < p are used to construct the edge set: each pair of vertices that are in the same group share an edge with the higher probability p, whereas each pair of vertices in different groups shares an edge with the lower probability r. The goal of the planted partition problem is to recover such A planted partition into ` clusters of k vertices each, instead of optimizing some measure on the partition. McCrery discusses also planted versions of other problems such as clique and graph coloring. One of is a uniform random graph and the other has a clearly clustered structure. The graph on the right is a relaxed caveman graph. Caveman graphs were an early attempt in social sciences to capture the clustering properties of social networks, produced by linking together a ring of small complete graphs called “caves” by moving one of the edges in each cave to point to another cave. Also the graph represented was generated with the caveman model. These graphs are also especially constructed to contain a clear cluster structure, possibly with a hierarchical structure where clusters can be further divided into natural sub clusters. The procedure to create a relaxed caveman graph is the following: a connection probability p ∈ (0, 1] of the top level of the hierarchy is
  • 8. 8 given as a parameter, together with a scaling coefficient s that adjusts the density of the lower-level caves. The minimum naming and maximum name for the numbers of subcomponents (sub caves higher levels, vertices at the bottom level) are given as parameters. The generation procedure is recursive. The graph on the right is a first-level construction. Planted partition models and relaxed cave man graphs provide good and controllable artificial input data for evaluating and benchmarking clustering algorithms. The underlying cluster structure is determined by the construction, but by defining lowers values for the intercluster densities, the structure can be blurred, which increases the difficulty of clustering. The planted partition model is theoretically easier to handle. However, the relaxed caveman model incorporates the generation of a hierarchy and the possibility to generate unequal-sized clusters — both often present in natural data. Desirable cluster properties Unfortunately, no single dentition of a cluster in graphs is universally accepted, and the variants used the literature are numerous. In the setting of graphs, each cluster should intuitively be connected: there should be at least one, preferably several paths connecting each pair of vertices within a cluster. If a vertex u cannot be reached from a vertex v, they should not be grouped in the same cluster. Furthermore, the paths should be internal to the cluster: in addition to the vertex set C being connected in G, the sub graph induced by C should be connected in itself, meaning that it is not sufficient for two vertices v and u in C to be connected by a path that passes through vertices in VC, but they also need to be connected by a path that only visits vertices included in C. As a consequence, when clustering a disconnected graph with known components, the clustering should usually be conducted on each component separately, unless some global restriction on the resulting clusters is imposed. In some
  • 9. 9 applications one may wish to obtain clusters of similar order and/or density, in which case the clusters computed in one component also influence the clustering’s of other components. We classify the edges incident on v ∈ C into two groups: internal edges that connect v to other vertices also in C, and external edges that connect v to vertices that are not included in the cluster C: deign (vice) =|Γ(v)∩C| digest (vice) =|Γ(v)∩(VC)| deg(v) =degint (v,C)+degext (v,C). (21) Clearly, degext (v) = 0 implies that C containing v could be a good cluster, as v has no connections outside of it. Similarly, if degint (v) = 0, v should not be included in C as it is not connected to any of the other vertices included. It is generally agreed upon that a subset of vertices forms a good cluster if the induced sub graph is dense, but there are relatively few connections from the included vertices to vertices in the rest of the graph. One measure that helps to evaluate the sparsely of Connections from the cluster to the rest of the graph are the cut size c(C, VC). The smaller the cut size, the better “isolated” the cluster. Determining when a cluster is dense is naturally achieved by computing the graph density. We refer to the All cluster members are drawn in black and their internal edges are drawn thicker than other edges of the graph. The cluster on the left is of good quality, dense and introvert. The one in the middle has the same number of internal edges, but many more to outside vertices, making it a worse cluster. The cluster on the right has very few connections outside, but lacks internal density and hence is not a good cluster. Density of the sub graph induced by the cluster as the internal or intra-cluster density: δint(C) = |{{v,u}|v∈C,u∈C}| |C|(|C|−1) . (22) The intercluster density of a given clustering of a graph G into k clusters C1,C2,...,Ck is the average of the intercluster densities of the included clusters:
  • 10. 10 The external or inter-cluster density is of a clustering is defined as the ratio of intercluster edges to the maximum number of inter clustered does possible, which is effectively the sum of the cut sizes of all the clusters, normalized to the range[0,1]: δext(G|C1,...,Ck) . Globally speaking, the internal density of a good clustering should be notably higher than the density of the graph δ (G) and the intercluster density of the clustering should be lower than the graph density [184]. Considering the above requirements of connectivity and density, the loosest possible definition of a graph cluster is that of a connected component, and the strictest definition is that each cluster should be a maximal clique (i.e. a sub graph into which no vertex could bead without losing the clique property). In most occasions, the semantically useful clusters lie somewhere in between these two extremes. Connected components are easily computed in O (n + m)-time with a breadth-first search, whereas clique detection is NP-complete. Typically, interesting clustering measures tend to correspond to NP-hard decision problems. When the input graph is very large, such as the Internet at router level or the Web graph (see Section 3.3.2), it is highly infeasible to rely on algorithms with exponential running time, as even linear time computation gets tedious. It is not always clear whether each vertex should be assigned fully to a cluster or could it instead have different “Levels of membership” in several clusters? In document clustering, such a situation is easily imaginable: a document can be mainly about fishing, for example, but also address sailing-related issues, and hence could be clustered into “fishing” with 0.9 memberships, for example, and to “sailing” with a level of 0.3. Another solution would be creating a super cluster to include all documents related to fishing and sailing, but the downside is that there can be documents on fishing that have no relation to sailing whatsoever. For general clustering tasks, fuzzy clustering algorithms have been proposed, as well as validity measures. Within graph clustering, not
  • 11. 11 much work can be found on fuzzy clustering, and in general, the past decade has been quiet on the area of fuzzy clustering. Yawned Hsiao present a fuzzy graph-clustering algorithm and apply it to circuit partitioning. A study on general clustering methods using fuzzy set theory is presented by Dave and Krishnapuram [68]. A fuzzy graph GR = (V,R) is composed of a set of vertices and a fuzzy edge- relation R that is reflexive and symmetrical together with a membership function µR assigns to each fuzzy edge a level of “presence” in the graph. Different no fuzzy graphs can be obtained by thresholding µR such that only those edges {v,u}for which µr(v,u) ≥ τ are included as edges in Gτ. The graph Gτ is called a cut graph of GR. Dong et al. [75] present a clustering method based on a connectivity property of fuzzy graphs assuming that the vertices represent a set of objects that is being clustered based on a distance measure. The algorithm first pre clusters the data into sub clusters based on the distance measure, after which a fuzzy graph if constructed for each sub cluster and a cut graph of the resulting graph is used to defined what constitutes a cluster. Dong et al. also discuss the modifications needed in the current clustering upon updates in the database that contains the objects to be clustered. Fuzzy clustering has not been established as a widely accepted approach for graph clustering, but it offers a more relaxed alternative for applications where assigning each vertex to just one cluster seems restricting while the vertex does relate more strongly to some of the candidate clusters than to others. Representations of clusters for different classes of graphs It is common that in applications, the graphs are not just simple, unweighted and undirected. If more than one edge is allowed between two vertices, instead of a binary adjacency matrix it is customary to use a matrix that determines for each pair of vertices how many edges they share. Graphs with such edge multiplicities
  • 12. 12 are called multigraphs. Also, should the graph be weighted, cutting an important edge (with a large weight) when separating a cluster is to be punished more heavily than cutting a few unimportant edges (with very small weights). Edge multiplicities can in essence be treated as edge weights, but the situation naturally gets more complicated if the multiple edges themselves have weights. Luckily, many measures extend rather fluently to incorporate weights or multiplicities. It is especially easy when the possible values are confined to a known range, as this range can be transformed into the interval [0,1] where one corresponds to a “full” edge, intermediate values to “partial” edges, and zero to there being no edge between two vertices. With such a transformation, we may compute density not by counting edges but summing over the edge weights in the unit line. To account for the degree of “presence” of the edges. Now a cluster of high density has either many edges or important edges, and a low-density cluster has either few or unimportant edges. It may be desirable to do a nonlinear transformation from the original weight set to the unit line to adjust the distribution of edge importance if the clustering results obtained by the linear transformation appear noisy or otherwise unsatisfactory. Bipartite graphs A bipartite graph is a graph where the vertex set V can be split in two sets A and B such that all edges lie between those two sets: if {v,w} ∈ E, either v ∈ A and w ∈ B or v ∈ B and w ∈ A. Such graphs are natural for many application areas where the vertices represent two distinct classes of objects, such as customers and products; an edge could signify for example that acer tain customer has bought a certain product. Possible clustering tasks could be grouping the customers by the types of products they purchase or grouping
  • 13. 13 products purchased by the same people — the motivation could be targeted marketing, for instance. Carrasco et al. study a graph of advertisers and keywords used in advertisements to identify submarkets by clustering. A bipartite graph G = (A∪B,E) with edges crossing only between A and B and not within can be transformed into two graphs GA and GB. Consider two vertices v and w in A. As the graph is bipartite, Γ(v) ⊆B as well as Γ(w) ⊆B and these two neighbourhoods may overlap. The more neighbours the two vertices in A share, the more “similar” they are. Hence, we create a graph GA = (A,EA) such that {v,w}∈EA if and only if (Γ(v)∩Γ(w)) 6=∅. Similarly a graph GB results from connecting vertices whose neighborhoods in A overlap. Weighted versions of GA and can be obtained by setting ω(v,w) =|Γ(v)∩Γ(w)|, possibly normalizing with the maximum degree of the graph. Clustering can be computed either for the original bipartite graph G or for the derived graphs GA and GB [41]. An intuition on how this works can be developed by thinking of the set A as books in a bookstore and the set B the customers of the store, the edges connecting a book to a customer if the customer has bought that book. Two customers have a similar taste in books if they have bought many of the same books, and two books appeal to a similar audience if several customers have purchased both. Hence the overlap of the neighbourhoods the one side of the graph reflects the similarity of the vertices of the other side and vice versa. Bipartite graphs often arise as representations of hyper graphs, that is, graphs where the edges are generalized to subsets of V of arbitrary order instead of restricting to pairs of vertices. An example is factor graphs. Direct clustering of such hyper graphs is another problem of interest.
  • 14. 14 Directed graphs Up to now, we have been dealing with undirected graphs. Let us turn into directed graphs, which require special attention, as the connections that are used in defining the clustering are asymmetrical and so is the adjacency matrix. This causes relevant changes in the structure of the Laplacian matrix and hence makes spectral analysis more complex. Web graphs are directed graphs formed by web pages as vertices and hyperlinks as edges. A clustering of a higher level web graph formed by all Chile and omains was presented by Virtanen. Clustering of web pages can help identify topics and group similar pages. This opens applications in search-engine technology; building artificial clusters is known to be a popular trick among websites of adult content to try to fool the Page Rank algorithm [35] used by Google to rate the quality of websites. The basic Page Rank algorithm as signsinitially to each web page the same importance value, which is then iteratively distributed uniformly among all of its neighbors. The goal of the Page Rank algorithm is to reach a value distribution close to a stationary converged state using relatively little iteration. The amount of “importance” left at each vertex at the end of the iteration becomes the Page Rank of that vertex, the higher the better. A large cluster formation of N 0 vertices can be constructed to “maintain” the initial values assigned within the cluster without letting it drain out and eventually accumulating it to a single target vertex that would hence receive a value close to N times the initial value assigned for each vertex when the iterative computation of Page Rank is stopped, even though the true value after global convergence would be low. Identifying such clusters helps to overcome the problem. Another example of directed graphs are web logs (widely known as blogs) that are web pages in which users can make comments — identifying communities of users regularly commenting on each other is not only a task of clustering a directed
  • 15. 15 graph, but also involves a temporal element: the blogs evolve over time and hence the graph is under constant change as new links between existing blogs appear and new blogs are created. The blog graph can be viewed directly as the graph of the links between the blogs or as a bipartite graph of blogs and users. Also the content of the online shared-effort encyclopedia Wikipedia forms an interesting directed graph. Graphs, structure, and optimization The concept of a simple graph is tightly associated with the study of its structural properties. Examples are the study of regular objects such as distance regular graphs, the study of graph spectra and graph invariants, and the theory of randomly generated graphs. The latter concept is introduced in some more detail, because it is used in defining a benchmark model for graph clustering in Chapter 12. A generic model via which a simple graph on N edges is randomly generated is to independently realize each edge with some fixed probability p (a parameter in this model). Typical questions in this branch of random graph theory concern the behavior of quantities such as the number of components, the girth, the chromatic number, the maximum clique size, and the diameter, as n goes to infinity. One direction is to hold p fixed and look at the behavior of the quantity under consideration; another is to find bounds for p as a function of n such that almost every random graph generated by the model has a prescribed property (e.g. diameter k, k fixed). In the theory of nonnegative matrices, structural properties of simple graphs are taken into account, especially regarding the 0/1 structure of matrix powers. Concepts such as cyclicity and primitively are needed in order to overcome the complications arising from this
  • 16. 16 0/1 structure. A prime application in the theory of Markov matrices is the classification of states in terms of their graph-theoretical properties, and the complete understanding of limit matrices in terms of the analytic and structural (with respect to the underlying graph) properties of the generating matrix. A different perspective is found in the retrieval or recognition of combinatorial notions such as above. In this line of research, algorithms are sought to find the diameter of a graph, the largest clique, a maximum matching, whether a second graph is embeddable in the first, or whether a simple graph has a Hamiltonian circuit or not. Some of these problems are doable; they are solvable in polynomial time, like the diameter and the matching problems. Other problems are extremely hard, like the clique problem and the problem of telling whether a given graph has a Hamiltonian cycle or not. For these problems, no algorithm is known that works in polynomial time, and the existence of such an algorithm would imply that a very large class of hard problems admits polynomial time1 algorithms for their solution. Other problems in this class are (the recognition version of) the Travelling Salesman Problem, many network flow and scheduling problems, the satisfiability problem for logic formulas, and (a combinatorial version of) the clustering problem. Technically, these problems are all in both the class known as NP, and the subclass of NP–complete problems. Roughly speaking, the class NP consists of problems for which it can be verified in polynomial time that a solution to the problem is indeed what it claims to be. The NP–complete problems area close- knit party from a complexity point of view, in the sense that if any NP–complete problem admits a polynomial–time solution, then so do all NP–complete problems for an extensive presentation of this subject. The subset of problems in NP for which a polynomial algorithm exists is known as P. The abbreviations NP and P stand for nondeterministic polynomial and polynomial respectively.
  • 17. 17 Historically, the class NP was first introduced in terms of nondeterministic Turing machines. The study of this type of problems is generally listed under the denominator of combinatorial optimization. Graph partitioning and clustering In graph partitioning well–defined problems are studied which closely resemble the problem of finding good clustering’s, in the settings of both simple and weighted graphs respectively. Restricting attention first to the bi-partitioning of a graph, let(S,Sc) denote a partition of the vertex set V of a graph (V,E) and let Σ(S,Sc) be the total sum of weights of edges between S and Sc, which is the cost associated with (S,Sc). The maximum cut problem is to find a partition (S,Sc) which maximizes Σ(S,Sc), and the minimum quotient cut problem is to find a partition which minimizes Σ(S,SC)/min(|S|,|Sc|). The recognition versions of these two problems (i.e. given a number L, is there a partition with a cost not exceeding L?) are NP–complete; the optimization problems themselves are NP–hard [62], which means that there is essentially only one way known to prove that a given solution is optimal; list all other solutions with their associated cost or yield. In the standard graph partitioning problem, the sizes of the partition elements are prescribed, and one is asked to minimize the total weight of the edges connecting nodes in distinct subsets in the partition. Thus sizes mi > 0, i = 1,...,k, satisfying k<n and P mi = n are specified, where n is the size of V, and a partition of V in subsets Si of size mi is asked which minimizes the given cost function. The fact that this problem is NP– hard (even for simple graphs and k=2) implies that it is inadvisable to seek exact (best) solutions. Instead, the general policy is to devise good heuristics and approximation algorithms, and to find bounds on the optimal solution. The role of a cost/yield function is then
  • 18. 18 mainly useful for comparison of different strategies with respect to common benchmarks. The clustering problem in the setting of simple graphs and weighted structured graphs is a kind of mixture of the standard graph partitioning problem with the minimum quotient cut problem, where the fact that the partition may freely vary is an enormously complicating factor. Graph partitioning applications The first large application area of graph partitioning (GP) is found in the setting of sparse matrix multiplication, parallel computation, and the numerical solving of PDE’s (partial differential equations). Parallel computation of sparse matrix multiplication requires the distribution of rows to different processors. Minimizing the amount of communication between processors is a graph partitioning problem. The computation of so called fill-reducing orderings of sparse matrices for solving large systems of linear equations can also be formulated as a GP–problem [74]. PDE solvers act on a mesh or grid, and parallel computation of such a solver requires a partitioning of the mesh, again in order to minimize communication between processors. The second large application area of graph partitioning is found in ‘VLSI CAD’, or Very Large Scale Integration (in) Computer Aided Design. In the survey article [8] Alpert and Kahng list several sub fields and application areas (citations below are from. • Design packaging, which is the partitioning of logic arrays into clustering. “This problem (...)is still the canonical partitioning application; it arises not only in chip floor planning and placement, but also at all other levels of the system design.” • Partitioning in order to improve mappings from HDL2 descriptions to gate– or cell–level net lists.
  • 19. 19 • Estimation of wiring requirements and system performance in high– level synthesis and floor planning. • Partitioning supports “the mapping of complex circuit designs onto hundreds or even thousands of interconnected FPGA’s” (Field– Programmable Gate Array). • A good partition will “minimize the number of inter-block signals that must be multiplexed onto the bus architecture of a hardware simulator or mapped to the global interconnect architecture of a hardware emulator”. Summing up, partitioning is used to model the distribution of tasks in groups, or the division of designs in modules, such that there is a minimal degree of interdependency or connectivity. Other application are as mentioned by Elsner in are hypertext browsing, geographic information services, and mapping of DNA sequences. Clustering as a preprocessing step in graph partitioning There is an important dichotomy in graph partitioning with respect to the role of clustering. Clustering can be used as a preprocessing step if natural groups are present, in terms of graph connectivity. This is clearly not the case for meshes and grids, but it is the case in the VLSI CAD applications introduced above, where the input is most often described in terms of hyper graphs. The currently prevailing methods in hyper graph partitioning for VLSI circuits are so called multilevel approaches where hyper graphs are first transformed in to normal graphs by some heuristic. Subsequently, the dimensionality of the graph is reduced by clustering; this is usually referred to as coarsening. Each cluster is contracted to a single node and edge weights between the new nodes are computed in terms of the edge weights between the old nodes. The reduced graph is
  • 20. 20 partitioned by (a combination of) dedicated techniques such as spectral or move-based partitioning. The statement that multi-level methods are prevalent in graph partitioning (for VLSI circuits) was communicated by Charles J. Alpert in private correspondence. Graph partitioning research experiences rapid growth and development, and seems primarily US– based. It is a non-trivial task to analyze and classify this research. One reason for this is the high degree of cohesion. The number of articles with any combination of the phrases multi-level, multi-way, spectral, hyper graph, circuit, net list, partitioning, and clustering is huge — witness e.g. [7, 9, 31, 32, 38, 39, 73, 75, 81, 101, 102, 138, 174]. Spectral techniques are firmly established in graph partitioning, but allow different approaches, e.g. different transformation steps, use of either the Laplacian of a graph or its adjacency matrix, varying numbers of eigenvectors used, and different ways of mapping eigenspaces onto partitions. Moreover, spectral techniques are usually part of a larger process such as the multi-level process described here. Each step in the process has its own merits. Thus there are a large number of variables, giving way to an impressive proliferation of research. The spectral theorems which are fundamental to the fields are given in Chapter 8, together with simple generalizations towards the class of matrices central to this thesis, diagonally symmetric matrices.
  • 21. 21 Clustering in weighted complete versus simple graphs The MCL algorithm was designed to meet the challenge of finding cluster structure in simple graphs. In the data model thus far prevailing in cluster analysis, here termed a vector model (Section 2.3 in the previous chapter), entities are represented by numerical scores on attributes. The models are obviously totally different, since the vector model induces a (metric) Euclidean space, along with geometric notions such as convexity, density, and centers of gravity (i.e. vector averages), notions not existing in the (simple) graph model. Additionally, in the vector model all dissimilarities are immediately available, but not so in the (simple) graph model. In fact, there is no notion of similarity between disconnected nodes except for third party support, e.g. the number of paths of higher length connecting two nodes. Simple graphs are in a certain sense much closer to the space of partitions (equivalently, clustering’s) than vector–derived data. The classes of simple graphs which are disjoint unions of complete graphs are in a natural 1–1 correspondence with the space of partitions. A disjoint union of simple complete graphs allows only one sensible clustering. This clustering is perfect for the graph, and conversely, there is no other simple graph for which the corresponding partition is better justified3. Both partitions and simple graphs are generalized in the notion of a hyper graph. However, there is not a best constellation of points in Euclidean space for a given partition. The only candidate is a constellation where vectors in the same partition element are infinitely close, and where vectors from different partition elements are infinitely far away from each other. On the other hand, a cluster hierarchy resulting from a hierarchical cluster method (most classic methods are of this type) can be interpreted as a tree. If the input data is Euclidean, then it is natural to identify the tree with a tree metric (which satisfies the so called
  • 22. 22 ultra metric inequality), and consider a cluster method ‘good’ if the tree-metric produced by it is close to the metric corresponding with the input data. Hazewinkel proved that single link clustering is optimal if the (metric) distance between two metrics is taken to be the Lipchitz distance. It is noteworthy that the tree associated with single link clustering can be derived from the minimal spanning tree of a given dissimilarity space. What is the relevance of the differences between the data models for clustering in the respective settings? This is basically answered by considering what it requires to apply a vector method to the (simple) graph model, and vice versa, what it requires to apply a graph method to the vector model. For the former, it is in general an insurmountable obstacle that generalized (dis)similarities are too expensive (in terms of space, and consequently also time) to compute. Dedicated graph methods solve this problem either by randomization (Section 5.4) or by a combination of localizing and pruning generalized similarities. More can be said about the reverse direction, application of graph methods to vector data. Methods such as single link and complete link clustering are easily formulated in terms of threshold graphs associated with a given dissimilarity space. This is mostly a notational convenience though, as the methods used for going from graph to clustering are straightforward and geometrically motivated. For intermediate to large threshold levels the corresponding threshold graph simply becomes too large to be represented physically if the dimension of the dissimilarity space is large. Proposals of a combinatorial nature have been made towards clustering of threshold graphs, but these are computationally highly infeasible. Another type of graph encountered in the vector model is that of a neighborhood graph satisfying certain constraints for a detailed account. Examples are the Minimum Spanning Tree, the Gabriel
  • 23. 23 Graph, the Relative Neighborhood Graph, and the Delaunay Triangulation. For example, the Gabriel Graph of a constellation of vectors is defined as follows. For each pair of vectors (corresponding with entities) a sphere is drawn which has as centre the geometric mean of the vectors, with radius equal to half the Euclidean distance between the two vectors. Thus, the coordinates of the two vectors identify points in Euclidean space lying diametrically opposed on the surface of the sphere. Now, two entities are connected in the associated Gabriel Graph if no other vector lies in the sphere associated with the entities. The step from neighborhood graph to clustering is made by formulating heuristics for the removal of edges. The connected components of the resulting graph are then interpreted as clusters. The heuristics are in general local in nature and are based upon knowledge and/or expectations regarding the process via which the graph was constructed. The difficulties in applying graph methods to threshold graphs and neighborhood graphs are illustrated in Section 10.6. The MCL algorithm is tested for graphs derived by a threshold criterion from a grid-like constellation of points, and is shown to produce undesirable clustering’s under certain conditions. The basic problem is that natural clusters with many elements and with relatively large diameter place severe constraints on the parameterization of the MCL process needed to retrieve them. This causes the MCL process to become expensive in terms of space and time requirements. Moreover, it is easy to find constellations such that the small space of successful parameterizations disappears entirely. This experiment clearly shows that the MCL algorithm is not well suited to certain classes of graphs, but it is argued that these limitations apply to a larger class of (tractable) graph clustering algorithms.
  • 24. 24 Vector clustering and graph clustering Applications are numerous in cluster analysis, and these have led to a bewildering multitude of methods. However, classic applications generally have one thing in common: They assume that entities are represented by vectors, describing how each entity scores on a set of characteristics or features. The dissimilarity between two entities is calculated as a distance between the respective vectors describing them. This implies that the dissimilarity corresponds with a distance in a Euclidean geometry and that it is very explicit; it is immediately available. Geometric notions such as centre of gravity, convex hull, and density naturally come into play. Graphs are objects that have a much more combinatorial nature, as witnessed by notions such as degree (number of neighbours), path, cycle, connectedness, et cetera. Edge weights usually do not correspond with a distance or proximity embeddable in a Euclidean geometry, nor need they resemble a metric. I refer to the two settings respectively as vector clustering and graph clustering. The graph model and the vector model do not exclude one another, but one model may inspire methods which are hard to conceive in the other model, and less fit to apply there. The MCL process was designed to meet the specific challenge of finding cluster structure in simple graphs, where dissimilarity between vertices is implicitly defined by the connectivity characteristics of the graph. Vector methods have little to offer for the graph model, except that a generic paradigm such as single link clustering can be given meaning in the graph model as well. Conversely, many methods in the vector model construct objects such as proximity graphs and neighborhood graphs, mostly as a notational convenience. For these it is interesting to investigate the conditions under
  • 25. 25 which graph methods can be sensibly applied to such derived graphs. Vector clustering and graph clustering. Applications are numerous in cluster analysis, and these have led to a bewildering multitude of methods. However, classic applications generally have one thing in common: They assume that entities are represented by vectors, describing how each entity scores on a set of characteristics or features. The dissimilarity between two entities is calculated as a distance between the respective vectors describing them. This implies that the dissimilarity corresponds with a distance in a Euclidean geometry and that it is very explicit; it is immediately available. Geometric notions such as centre of gravity, convex hull, and density naturally come into play. Graphs are objects that have a much more combinatorial nature, as witnessed by notions such as degree (number of neighbours), path, cycle, connectedness, et cetera. Edge weights usually do not correspond with a distance or proximity embeddable in a Euclidean geometry, nor need they resemble a metric. I refer to the two settings respectively as vector clustering and graph clustering. The graph model and the vector model do not exclude one another, but one model may inspire methods which are hard to conceive in the other model, and less fit to apply there. The MCL process was designed to meet the specific challenge of finding cluster structure in simple graphs, where dissimilarity between vertices is implicitly defined by the connectivity characteristics of the graph. Vector methods have little to offer for the graph model, except that a generic paradigm such as single link clustering can be given meaning in the graph model as well. Conversely, many methods in the vector model construct objects such as proximity graphs and neighborhood graphs, mostly as a notational
  • 26. 26 convenience. For these it is interesting to investigate the conditions under Applications in graph clustering The current need for and development of cluster methods in the setting of graphs is mostly found within the fields of graph partitioning, where clustering is also referred to as coarsening, and in the area of pattern recognition. Clustering serves as a preprocessing step in some partitioning methods. Pattern recognition sciences can be viewed as the natural surroundings for the whole of cluster analysis from an application point of view. Due to its fundamental and generic cognitive nature the clustering problem occurs in various solutions for pattern recognition tasks, often as an intermediate processing step in layered methods. Another reason why the pattern recognition sciences are interesting is that many problems and methods are formulated in terms of graphs. This is natural in view of the fact that higher-level conceptual information is often stored in the form of graphs, possibly with typed arcs. There are two stochastic/graph-based models which suggest some kinship with the MCL algorithm: Hidden Markov models and Markov Random Fields. The kinship is rather distant; the foremost significance is that it demonstrates the versatility and power of stochastic graph theory applied to recognition problems. With the advent of the wired1 and hyperlinked era the graph model is likely to gain further in significance. Graph clustering and graph partitioning Cluster analysis in the setting of graphs is closely related to the fields of graph partitioning, where methods are studied to find the optimal partition of a graph given certain restrictions. The generic meaning
  • 27. 27 of ‘clustering’ is in principle exactly that of ‘partition’. The differences is that the semantics of ‘clustering’ change when combined with adjectives such as ‘overlapping’ and ‘hierarchic’. A partition is strictly defined as a division of some set S into subsets satisfying a) all pairs of subsets are disjoint and b) the union of all subsets yields S. In graph partitioning the required partition sizes are specified in advance. The objective is to minimize some cost function associated with links connecting different partition elements. Thus the burden of finding natural groups disappears. The graphs which are to be partitioned are sometimes extremely homogeneous, like for example meshes and grids. The concept of ‘natural groups’ is hard to apply to such graphs. However, graphs possessing natural groups do occur in graph partitioning: The graph is taken from the article A computational study of graph partitioning. If natural groups are present then a clustering of the graph may aid in finding a good partition, if the granularity of the clustering is find compared with the granularity of the required partition sizes. There is a small and rather isolated stream of publications on graph clustering in the setting of circuit lay-out. One cause for this isolation is that the primary data- model is that of a hyper graph with nodes that can be of different type, another is the cohesive nature of the research in this area. The current need for and development of cluster methods in the setting of graphs is mostly found within the fields of graph partitioning, where clustering is also referred to as coarsening, and in the area of pattern recognition. Clustering serves as a preprocessing step in some partitioning methods. Pattern recognition sciences can be viewed as the natural surroundings for the whole of cluster analysis from an application point of view. Due to its fundamental and generic cognitive nature the clustering problem occurs in various solutions for pattern recognition tasks, often as an intermediate processing step in layered methods. Another reason why the pattern recognition sciences are interesting is that many problems and methods are
  • 28. 28 formulated in terms of graphs. This is natural in view of the fact that higher-level conceptual information is often stored in the form of graphs, possibly with typed arcs. There are two stochastic/graph- based models which suggest some kinship with the MCL algorithm: Hidden Markov models and Markov Random Fields. The kinship is rather distant; the foremost significance is that it demonstrates the versatility and power of stochastic graph theory applied to recognition problems. With the advent of the wired1 and hyperlinked era the graph model is likely to gain further in significance. Graph clustering and graph partitioning Cluster analysis in the setting of graphs is closely related to the fields of graph partitioning, where methods are studied to find the optimal partition of a graph given certain restrictions. The generic meaning of ‘clustering’ is in principle exactly that of ‘partition’. The differences is that the semantics of ‘clustering’ change when combined with adjectives such as ‘overlapping’ and ‘hierarchic’. A partition is strictly defined as a division of some set S into subsets satisfying a) all pairs of subsets are disjoint and b) the union of all subsets yields S. In graph partitioning the required partition sizes are specified in advance. The objective is to minimize some cost function associated with links connecting different partition elements. Thus the burden of finding natural groups disappears. The graphs which are to be partitioned are sometimes extremely homogeneous, like for example meshes and grids. The concept of ‘natural groups’ is hard to apply to such graphs. However, graphs possessing natural groups do occur in graph partitioning: The graph in Figure 2 is taken from the article a computational study of graph partitioning [55]. If natural groups are present then a clustering of the graph may aid in finding a good partition, if the granularity of the clustering is fine compared with the granularity of the required partition sizes. There is a small and rather
  • 29. 29 isolated stream of publications on graph clustering in the setting of circuit lay-out. One cause for this isolation is that the primary data- model is that of a hyper graph with nodes that can be of different type, another is the cohesive nature of the research in this area.