Clique-based Network Clustering

Guang Ouyang
Advisor: Dipak Dey
1

 Facebook
 LinkedIn
 Internet
 Instagram
 Tweets
 Google+
 Quora
 Wechat
 Stack Oversflow
 Research Gate
2

 Small World: everyone and everything is six or
fewer steps away, by way of introduction, from
any other person in the world.
 Power Law: degree distribution has long tail power
law distribution.
 Community Structure: community groups based
on common location, interests, occupations, etc.
are quite common in real networks.
3

 Detect community structure in large and complex
networks.
 Community can be viewed as a summary of the
whole network, and therefore easy to visualize
and analyze.
 Communities provide important information for
applications such as market segmentation,
building recommender system.
4

 Network data is a graph structure made up of
‘nodes’ and ‘links’ that connect them.
 Network data tends to have ‘discrete’ similarity
matrix.
 Most clustering algorithms work on the
“continuous” distance or similarity matrix.
 Real-world networks usually very large. Even
is unbearable for efficiency or space.
5

 Edge list:[(1,2),(1,3),(3,4),(4,5),(5,3), (3,6),(6,1),
(7,4), (6,7)].
 Adjacency matrix:
6

No statistically
precise definition so far
Generally speaking, a
community is a set of
nodes densely
connected internally
Nodes between two
communities are
loosely connected
7

 A random network without real clustering structure
should not be split (type 1 error of over-splitting).
 Two weakly connected communities should not be
merged (type 2 error of under-splitting).
 Modern network data is usually huge, space and
time efficient clustering is needed
8

 Minimum-cut method(spectral clustering)
 Hierarchical clustering
 Girvan-Newman algorithm (betweenness)
 Modularity maximization
 Stochastic block model as well as variants
including mixed membership model
 Finding maximal clique
9

 A measure of strength of division of a network
into clusters or communities.
(1)
where i and j denotes nodes, c denotes clusters,
is the (i,j) entry in adjacency matrix A, is the
degree of node i, m is the total number of links in
a network.
10

11
Degrees of the 7 nodes are:
Total Degree:
The modularity matrix below has (i, j) entry:
j
i
1 2 3 4 5 6 7
1 -0.45 0.55 0.55 0.4 -0.45 -0.3 -0.3
2 0.55 -0.45 0.55 0.4 -0.45 -0.3 -0.3
3 0.55 0.55 -0.45 0.4 -0.45 -0.3 -0.3
4 0.4 0.4 0.4 -0.8 0.4 -0.4 -0.4
5 -0.45 -0.45 -0.45 0.4 -0.45 0.7 0.7
6 -0.3 -0.3 -0.3 -0.4 0.7 -0.2 0.8
7 -0.3 -0.3 -0.3 -0.4 0.7 0.8 -0.2
Node 1, 2, 3, 4 tend to form
one community and node 5,
6, 7 for another. The
Modularity Q based on this
division is the sum of all
green cells in modularity
matrix divided by 2m: 0.355

High modularity implies dense connections inside
communities and sparse connections between
communities.
Approximate maximization algorithms:
• Greedy algorithms
• Simulated annealing
• Leading eigen-vector
• Louvian’s method
• Ensemble learning(Currently fastest)
12

Benchmark model to simulate
stochastic block network 1 with
built-in cluster structures.
where
Each cluster has 40 nodes
Modularity-based clustering on
random network from stochastic
block model.
Modularity maximization
approach works well if clusters
have similar size
13

 Random network without cluster structure may be
splited. (Erodos Renyi network)
 Small clusters in large network may be merged.
(Resolution limitation)
 Multi-resolution method may not reduce both
types of error simultaneously.
 A bottleneck of many other network clustering
algorithms.
14

Erdos Renyi network of
40 nodes, density 0.1
Modularity Maximized
Clustering: Q=0.37
15

Stochastic Block Model 2 with
Two small clusters have 20 nodes,
and the largest clusters have 100
nodes
The largest clusters are splited
Modularity maximization algorithms
tend to fail in networks with clusters of
very different sizes
Modularity maximized
clustering with Q=0.429
16

 Stochastic Block Model 3 with link probability
 Cluster size: [800, 400, 50, 20]
 Modularity method clustering results:
• 7 nodes in cluster 3 are merged with cluster 1
• All the 20 nodes in cluster 4 are merged with cluster 1
17

 Algorithm 1
◦ Global algorithm
◦ Cluster internal link density above user defined threshold
guaranteed
 Algorithm 2
◦ Local algorithm
◦ Risk of splitting a cluster is quantified and under user
control
◦ Risk of merging clusters are minimized
18

Objective Function:
where p is user defined parameter in [0,1], δ is Kronecker delta symbol, A is
adjacency matrix, c is community membership vector, m is total link count
Reward table:
Connected pair
of nodes
Disconnected
pair of nodes
Pair of nodes in
the same cluster
1-p -p
Pair of nodes in
different clusters
-1+p p
(2)
19

 It is guaranteed that every found communities has
internal link density higher than user defined
threshold p.
◦ If p=1, every found communities is a clique.
◦ If p=25%, every communities has internal link density
higher than 25%.
◦ Communities with link density “significantly” higher than p
will not be split.
◦ Communities with link density lower than p will definitely
be split into smaller communities.
20

 Maximize objective function (2):
where s is n by 1 vector of community membership with binary entries 1 or
-1, A is adjacency matrix, J is one matrix, I is the identity matrix
 Search over all possible divisions is N-P hard
 Approximate spectral method:
◦ Find the largest Eigen-value w of p-clique matrix:
◦ Choose a corresponding Eigen-vector v of w
◦ Use the sign of v to split the network of n nodes
(3)
(4)
21

 is the best approximate solution to (3)
 If , division by v will be executed.
 If , but , division by v will still
executed.
 If , and , division by v will be
cancelled
22

 Python-scipy wrapper of ARPACK software
 Iterative matrix-vector product finding Eigen-value
of large sparse or structured matrices.
 is dense but structured
 Matrix-vector product requires much
less than the usual operations
◦ Adjacency matrix is usually sparse
◦ requires only operations
◦ requires only operations
◦ Time complexity: per iteration
◦ Space complexity: (applicable to huge graph)
23

 Usually it is hard to tell how many communities
are there in a large network
 First split the network into two parts, then divide
these two parts, and so forth.
 Use the Bipartition Criteria in slide 21 as the
stopping criteria of these recursive dividing
prodedure
24

p=0.1
p=0.05
Stochastic Block Network 2 with
Two small clusters have 20 nodes, and
the largest clusters have 100 nodes
Expected link density 0.1125
p=0.02
25

 Karate Club Member data (34 people)
 Link density: 0.139
p = 0.1 p = 0.15
26

 Doubtful Sound Dolphin (62 dolphins)
 Link density: 0.084
p = 0.03 p = 0.2
27

 Increasing p: zoom in
◦ Smaller communities are found.
◦ Risk of merging clusters(type 2 error) is lower.
◦ Risk of splitting cluster/Erdos Renyi sub-network (type 1
error) is higher.
 Decreasing p: zoom out
◦ Larger communities are found.
◦ Risk of merging clusters(type 2 error) is higher.
◦ Risk of splitting cluster/Erdos Renyi sub-network (type 1
error) is lower.
28

 Objective: choose parameter p such that at most
2.5% of nodes in an Erdos Renyi sub-network will
be trimmed off.
 Cause of Type 1 Error:
◦ Due to random fluctuation in link formation, 2.5% of
nodes has less than 0.975 np links with the rest 97.5%
nodes.
◦ Threshold p is higher than the link density between the
2.5% group and 97.5% group of nodes
 Strategy:
◦ Choose p to be significantly smaller than observed total
link density .
29

 Solution:
(5)
 Intuition:
◦ Use truncated normal distribution to approximate the
distribution of link density between the 2.5% group and
the 97.5% group.
 Experiment results:
◦ In 100 SBM networks, the type 1 error is bounded by 5%
(mostly 3.5%).
◦ In SBM networks with average degree less than 5, type 1
error is less than 2%.
30

 When two clusters of size and , link probability
will be merged?
where is observed link density.
 The risk of type 2 error will be bounded by 2.5% if
(6)
31

 Challenge:
◦ In splitting a sub-network, we usually don’t know the link
density or between two clusters.
◦ In theory, there maybe cases when inequality (5) and (6)
are a conflict
 Solution:
◦ Choose p to be the upper bound in (5)
◦ Develop a more flexible algorithm which allows p very
from one sub-network to another. This may reduce the
chance of a conflict between inequality (5) and (6).
32

 A measure of consistence between a found
communities R and real communities F.
where I is the Kullback-Leibler divergence, H is entropy, N is diffusion
matrix, and are number of real and found communities.
33
(7)

 Review Stochastic Block network 1 through 3
using NMI
 Results:
◦ Type 1 error is overly controlled for small and sparse
network such as SBM 1.
34
size Link
densit
y
Auto
chosen
p
Average
NMI
s.e. Number
of
simulatio
n
SBM 1 120 0.0723 0.0239 0.8484 0.0195 100
SBM 2 140 0.1125 0.0579 0.9483 0.0078 100
SBM 3 1270 0.0722 0.0574 0.9993 0.0001 100

35
Modularity
p = 0.0888
Stochastic Block model 4 with
Cluster size: (100, 20, 20)
Expected link density:0.1507
Auto-chosen parameter p from (5) : 0.0888
 Using auto-chosen parameter p will end up
with merging small clusters 2 and 3
Cluster 2 and 3 will be divided only if we zoom
in more by increase p
Modularity method not only merged cluster 2
and 3, but also split cluster 1

36
p(S0) p(S0)
S1
S0
S2
p(S1) p(S1) p(S2) p(S2)
C1 S3 C2 C3
p(S3) p(S3)
C4 C5
are observed
link density and node
count in sub-network S

 Maximize localized clique-index
where T is the binary tree representing the
hierarchical clustering process, p(S) is automatic
choice of local threshold parameter p for sub-network
S, is the indicator if node i and j will be
divided in bipartition of S
37
(8)

 Every bipartition in sub-network S will bring
contribution:
 The best bipartition is obtained from the sign of
leading Eigenvector of matrix:
 The bipartition on S will be cancelled if
contribution .
38
(9)
(10)

 Each matrix-vector product takes time O(m)
 Finding leading Eigen-vector takes O(n) matrix-vector
product.
 On average, the height of the binary tree
representing hierarchical clustering procedure is
O(log(n)).
 For both global and localized algorithm, the time
complexity is or for sparse
network.
39

 Stochastic Block model 4 with
 Cluster size: (100, 20, 20)
 Average NMI among 100
simulation is 0.9717
 Localized clustering algorithm
is able to detect the built-in
community structure.
40

 Stochastic Block Model 6 with 7000 nodes and 10 built-in clusters
 Cluster sizes with internal link density:
[(3000,0.08), (2000, 0.09), (1000, 0.1), (400,0.15), (200,0.2), (100,
0.25), (100, 0.25), (100, 0.25), (80, 0.3), (20, 0.7)]
 Link density between different clusters is 0.005
 Average NMI among 20 simulation is 0.9895
 Average Running time: 1.66 seconds
41

 Stochastic Block Model with 20000 nodes and 25 clusters
 Cluster sizes with internal link density:
[(3350, 0.045), (3000, 0.05),(2000, 0.07),(2000, 0.07),(2000, 0.07),
(1000, 0.09), (1000, 0.09), (1000, 0.09), (1000, 0.09), (500, 0.12),
(500, 0.12), (400, 0.14), (400, 0.14), (400, 0.14), (400, 0.14),
(200, 0.30), (200, 0.30), (200, 0.30), (100, 0.40), (100, 0.40),
(50, 0.80), (50, 0.80), (50, 0.80), (50, 0.80), (50, 0.80)]
 Link density between clusters: 0.0001
 Average NMI among 10 simulations: 0.8960
 Average running time: 12.6 seconds
42

 Review of SBM network 1 through 6:
 Clustering quality is high for large network or
network with high link density
43
Built-in
cluster
size Link
density
Averag
e NMI
s.e. Number
of
Simulatio
n
SBM1 3 120 0.0723 0.8972 0.0195 100
SBM2 3 140 0.1125 0.9476 0.0051 100
SBM3 4 1200 0.0722 0.9687 0.0028 100
SBM4 3 140 0.0888 0.9717 0.0033 100
SBM5 10 7000 0.0285 0.9895 0.0022 20
SBM6 25 20000 0.005 0.8960 0.0029 10

 Global Algorithm:
◦ Good for application with specific requirements in internal
link density of every found communities
 Localized Algorithm:
◦ Good for finding statistically significant communities.
◦ Type 1 error seem to be overly controlled for sparse
network.
◦ The conflict between type 1 error and type 2 error is
effectively avoided in sample simulated network.
44

 Erdos Renyi Model may not serve as a good Null
Model of random network without built-in
communities structures. Statistically significant
community for other null model need
consideration.
 Extend the algorithm to directed network, network
with numerical values in adjacency matrix, and
network with additional profile information in each
node.
 Develop close to linear time clustering algorithm.
45

Clique-based Network Clustering

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Clique-based Network Clustering

Ähnlich wie Clique-based Network Clustering (20)

Clique-based Network Clustering