3. Small World: everyone and everything is six or
fewer steps away, by way of introduction, from
any other person in the world.
Power Law: degree distribution has long tail power
law distribution.
Community Structure: community groups based
on common location, interests, occupations, etc.
are quite common in real networks.
3
4. Detect community structure in large and complex
networks.
Community can be viewed as a summary of the
whole network, and therefore easy to visualize
and analyze.
Communities provide important information for
applications such as market segmentation,
building recommender system.
4
5. Network data is a graph structure made up of
‘nodes’ and ‘links’ that connect them.
Network data tends to have ‘discrete’ similarity
matrix.
Most clustering algorithms work on the
“continuous” distance or similarity matrix.
Real-world networks usually very large. Even
is unbearable for efficiency or space.
5
7. No statistically
precise definition so far
Generally speaking, a
community is a set of
nodes densely
connected internally
Nodes between two
communities are
loosely connected
7
8. A random network without real clustering structure
should not be split (type 1 error of over-splitting).
Two weakly connected communities should not be
merged (type 2 error of under-splitting).
Modern network data is usually huge, space and
time efficient clustering is needed
8
9. Minimum-cut method(spectral clustering)
Hierarchical clustering
Girvan-Newman algorithm (betweenness)
Modularity maximization
Stochastic block model as well as variants
including mixed membership model
Finding maximal clique
9
10. A measure of strength of division of a network
into clusters or communities.
(1)
where i and j denotes nodes, c denotes clusters,
is the (i,j) entry in adjacency matrix A, is the
degree of node i, m is the total number of links in
a network.
10
11. 11
Degrees of the 7 nodes are:
Total Degree:
The modularity matrix below has (i, j) entry:
j
i
1 2 3 4 5 6 7
1 -0.45 0.55 0.55 0.4 -0.45 -0.3 -0.3
2 0.55 -0.45 0.55 0.4 -0.45 -0.3 -0.3
3 0.55 0.55 -0.45 0.4 -0.45 -0.3 -0.3
4 0.4 0.4 0.4 -0.8 0.4 -0.4 -0.4
5 -0.45 -0.45 -0.45 0.4 -0.45 0.7 0.7
6 -0.3 -0.3 -0.3 -0.4 0.7 -0.2 0.8
7 -0.3 -0.3 -0.3 -0.4 0.7 0.8 -0.2
Node 1, 2, 3, 4 tend to form
one community and node 5,
6, 7 for another. The
Modularity Q based on this
division is the sum of all
green cells in modularity
matrix divided by 2m: 0.355
12. High modularity implies dense connections inside
communities and sparse connections between
communities.
Approximate maximization algorithms:
• Greedy algorithms
• Simulated annealing
• Leading eigen-vector
• Louvian’s method
• Ensemble learning(Currently fastest)
12
13. Benchmark model to simulate
stochastic block network 1 with
built-in cluster structures.
where
Each cluster has 40 nodes
Modularity-based clustering on
random network from stochastic
block model.
Modularity maximization
approach works well if clusters
have similar size
13
14. Random network without cluster structure may be
splited. (Erodos Renyi network)
Small clusters in large network may be merged.
(Resolution limitation)
Multi-resolution method may not reduce both
types of error simultaneously.
A bottleneck of many other network clustering
algorithms.
14
15. Erdos Renyi network of
40 nodes, density 0.1
Modularity Maximized
Clustering: Q=0.37
15
16. Stochastic Block Model 2 with
Two small clusters have 20 nodes,
and the largest clusters have 100
nodes
The largest clusters are splited
Modularity maximization algorithms
tend to fail in networks with clusters of
very different sizes
Modularity maximized
clustering with Q=0.429
16
17. Stochastic Block Model 3 with link probability
Cluster size: [800, 400, 50, 20]
Modularity method clustering results:
• 7 nodes in cluster 3 are merged with cluster 1
• All the 20 nodes in cluster 4 are merged with cluster 1
17
18. Algorithm 1
◦ Global algorithm
◦ Cluster internal link density above user defined threshold
guaranteed
Algorithm 2
◦ Local algorithm
◦ Risk of splitting a cluster is quantified and under user
control
◦ Risk of merging clusters are minimized
18
19. Objective Function:
where p is user defined parameter in [0,1], δ is Kronecker delta symbol, A is
adjacency matrix, c is community membership vector, m is total link count
Reward table:
Connected pair
of nodes
Disconnected
pair of nodes
Pair of nodes in
the same cluster
1-p -p
Pair of nodes in
different clusters
-1+p p
(2)
19
20. It is guaranteed that every found communities has
internal link density higher than user defined
threshold p.
◦ If p=1, every found communities is a clique.
◦ If p=25%, every communities has internal link density
higher than 25%.
◦ Communities with link density “significantly” higher than p
will not be split.
◦ Communities with link density lower than p will definitely
be split into smaller communities.
20
21. Maximize objective function (2):
where s is n by 1 vector of community membership with binary entries 1 or
-1, A is adjacency matrix, J is one matrix, I is the identity matrix
Search over all possible divisions is N-P hard
Approximate spectral method:
◦ Find the largest Eigen-value w of p-clique matrix:
◦ Choose a corresponding Eigen-vector v of w
◦ Use the sign of v to split the network of n nodes
(3)
(4)
21
22. is the best approximate solution to (3)
If , division by v will be executed.
If , but , division by v will still
executed.
If , and , division by v will be
cancelled
22
23. Python-scipy wrapper of ARPACK software
Iterative matrix-vector product finding Eigen-value
of large sparse or structured matrices.
is dense but structured
Matrix-vector product requires much
less than the usual operations
◦ Adjacency matrix is usually sparse
◦ requires only operations
◦ requires only operations
◦ Time complexity: per iteration
◦ Space complexity: (applicable to huge graph)
23
24. Usually it is hard to tell how many communities
are there in a large network
First split the network into two parts, then divide
these two parts, and so forth.
Use the Bipartition Criteria in slide 21 as the
stopping criteria of these recursive dividing
prodedure
24
25. p=0.1
p=0.05
Stochastic Block Network 2 with
Two small clusters have 20 nodes, and
the largest clusters have 100 nodes
Expected link density 0.1125
p=0.02
25
26. Karate Club Member data (34 people)
Link density: 0.139
p = 0.1 p = 0.15
26
27. Doubtful Sound Dolphin (62 dolphins)
Link density: 0.084
p = 0.03 p = 0.2
27
28. Increasing p: zoom in
◦ Smaller communities are found.
◦ Risk of merging clusters(type 2 error) is lower.
◦ Risk of splitting cluster/Erdos Renyi sub-network (type 1
error) is higher.
Decreasing p: zoom out
◦ Larger communities are found.
◦ Risk of merging clusters(type 2 error) is higher.
◦ Risk of splitting cluster/Erdos Renyi sub-network (type 1
error) is lower.
28
29. Objective: choose parameter p such that at most
2.5% of nodes in an Erdos Renyi sub-network will
be trimmed off.
Cause of Type 1 Error:
◦ Due to random fluctuation in link formation, 2.5% of
nodes has less than 0.975 np links with the rest 97.5%
nodes.
◦ Threshold p is higher than the link density between the
2.5% group and 97.5% group of nodes
Strategy:
◦ Choose p to be significantly smaller than observed total
link density .
29
30. Solution:
(5)
Intuition:
◦ Use truncated normal distribution to approximate the
distribution of link density between the 2.5% group and
the 97.5% group.
Experiment results:
◦ In 100 SBM networks, the type 1 error is bounded by 5%
(mostly 3.5%).
◦ In SBM networks with average degree less than 5, type 1
error is less than 2%.
30
31. When two clusters of size and , link probability
will be merged?
where is observed link density.
The risk of type 2 error will be bounded by 2.5% if
(6)
31
32. Challenge:
◦ In splitting a sub-network, we usually don’t know the link
density or between two clusters.
◦ In theory, there maybe cases when inequality (5) and (6)
are a conflict
Solution:
◦ Choose p to be the upper bound in (5)
◦ Develop a more flexible algorithm which allows p very
from one sub-network to another. This may reduce the
chance of a conflict between inequality (5) and (6).
32
33. A measure of consistence between a found
communities R and real communities F.
where I is the Kullback-Leibler divergence, H is entropy, N is diffusion
matrix, and are number of real and found communities.
33
(7)
34. Review Stochastic Block network 1 through 3
using NMI
Results:
◦ Type 1 error is overly controlled for small and sparse
network such as SBM 1.
34
size Link
densit
y
Auto
chosen
p
Average
NMI
s.e. Number
of
simulatio
n
SBM 1 120 0.0723 0.0239 0.8484 0.0195 100
SBM 2 140 0.1125 0.0579 0.9483 0.0078 100
SBM 3 1270 0.0722 0.0574 0.9993 0.0001 100
35. 35
Modularity
p = 0.0888
Stochastic Block model 4 with
Cluster size: (100, 20, 20)
Expected link density:0.1507
Auto-chosen parameter p from (5) : 0.0888
Using auto-chosen parameter p will end up
with merging small clusters 2 and 3
Cluster 2 and 3 will be divided only if we zoom
in more by increase p
Modularity method not only merged cluster 2
and 3, but also split cluster 1
36. 36
p(S0) p(S0)
S1
S0
S2
p(S1) p(S1) p(S2) p(S2)
C1 S3 C2 C3
p(S3) p(S3)
C4 C5
are observed
link density and node
count in sub-network S
37. Maximize localized clique-index
where T is the binary tree representing the
hierarchical clustering process, p(S) is automatic
choice of local threshold parameter p for sub-network
S, is the indicator if node i and j will be
divided in bipartition of S
37
(8)
38. Every bipartition in sub-network S will bring
contribution:
The best bipartition is obtained from the sign of
leading Eigenvector of matrix:
The bipartition on S will be cancelled if
contribution .
38
(9)
(10)
39. Each matrix-vector product takes time O(m)
Finding leading Eigen-vector takes O(n) matrix-vector
product.
On average, the height of the binary tree
representing hierarchical clustering procedure is
O(log(n)).
For both global and localized algorithm, the time
complexity is or for sparse
network.
39
40. Stochastic Block model 4 with
Cluster size: (100, 20, 20)
Average NMI among 100
simulation is 0.9717
Localized clustering algorithm
is able to detect the built-in
community structure.
40
41. Stochastic Block Model 6 with 7000 nodes and 10 built-in clusters
Cluster sizes with internal link density:
[(3000,0.08), (2000, 0.09), (1000, 0.1), (400,0.15), (200,0.2), (100,
0.25), (100, 0.25), (100, 0.25), (80, 0.3), (20, 0.7)]
Link density between different clusters is 0.005
Average NMI among 20 simulation is 0.9895
Average Running time: 1.66 seconds
41
42. Stochastic Block Model with 20000 nodes and 25 clusters
Cluster sizes with internal link density:
[(3350, 0.045), (3000, 0.05),(2000, 0.07),(2000, 0.07),(2000, 0.07),
(1000, 0.09), (1000, 0.09), (1000, 0.09), (1000, 0.09), (500, 0.12),
(500, 0.12), (400, 0.14), (400, 0.14), (400, 0.14), (400, 0.14),
(200, 0.30), (200, 0.30), (200, 0.30), (100, 0.40), (100, 0.40),
(50, 0.80), (50, 0.80), (50, 0.80), (50, 0.80), (50, 0.80)]
Link density between clusters: 0.0001
Average NMI among 10 simulations: 0.8960
Average running time: 12.6 seconds
42
43. Review of SBM network 1 through 6:
Clustering quality is high for large network or
network with high link density
43
Built-in
cluster
size Link
density
Averag
e NMI
s.e. Number
of
Simulatio
n
SBM1 3 120 0.0723 0.8972 0.0195 100
SBM2 3 140 0.1125 0.9476 0.0051 100
SBM3 4 1200 0.0722 0.9687 0.0028 100
SBM4 3 140 0.0888 0.9717 0.0033 100
SBM5 10 7000 0.0285 0.9895 0.0022 20
SBM6 25 20000 0.005 0.8960 0.0029 10
44. Global Algorithm:
◦ Good for application with specific requirements in internal
link density of every found communities
Localized Algorithm:
◦ Good for finding statistically significant communities.
◦ Type 1 error seem to be overly controlled for sparse
network.
◦ The conflict between type 1 error and type 2 error is
effectively avoided in sample simulated network.
44
45. Erdos Renyi Model may not serve as a good Null
Model of random network without built-in
communities structures. Statistically significant
community for other null model need
consideration.
Extend the algorithm to directed network, network
with numerical values in adjacency matrix, and
network with additional profile information in each
node.
Develop close to linear time clustering algorithm.
45