SlideShare ist ein Scribd-Unternehmen logo
1 von 45
Guang Ouyang 
Advisor: Dipak Dey 
1
 Facebook 
 LinkedIn 
 Internet 
 Instagram 
 Tweets 
 Google+ 
 Quora 
 Wechat 
 Stack Oversflow 
 Research Gate 
2
 Small World: everyone and everything is six or 
fewer steps away, by way of introduction, from 
any other person in the world. 
 Power Law: degree distribution has long tail power 
law distribution. 
 Community Structure: community groups based 
on common location, interests, occupations, etc. 
are quite common in real networks. 
3
 Detect community structure in large and complex 
networks. 
 Community can be viewed as a summary of the 
whole network, and therefore easy to visualize 
and analyze. 
 Communities provide important information for 
applications such as market segmentation, 
building recommender system. 
4
 Network data is a graph structure made up of 
‘nodes’ and ‘links’ that connect them. 
 Network data tends to have ‘discrete’ similarity 
matrix. 
 Most clustering algorithms work on the 
“continuous” distance or similarity matrix. 
 Real-world networks usually very large. Even 
is unbearable for efficiency or space. 
5
 Edge list:[(1,2),(1,3),(3,4),(4,5),(5,3), (3,6),(6,1), 
(7,4), (6,7)]. 
 Adjacency matrix: 
6
No statistically 
precise definition so far 
Generally speaking, a 
community is a set of 
nodes densely 
connected internally 
Nodes between two 
communities are 
loosely connected 
7
 A random network without real clustering structure 
should not be split (type 1 error of over-splitting). 
 Two weakly connected communities should not be 
merged (type 2 error of under-splitting). 
 Modern network data is usually huge, space and 
time efficient clustering is needed 
8
 Minimum-cut method(spectral clustering) 
 Hierarchical clustering 
 Girvan-Newman algorithm (betweenness) 
 Modularity maximization 
 Stochastic block model as well as variants 
including mixed membership model 
 Finding maximal clique 
9
 A measure of strength of division of a network 
into clusters or communities. 
(1) 
where i and j denotes nodes, c denotes clusters, 
is the (i,j) entry in adjacency matrix A, is the 
degree of node i, m is the total number of links in 
a network. 
10
11 
Degrees of the 7 nodes are: 
Total Degree: 
The modularity matrix below has (i, j) entry: 
j 
i 
1 2 3 4 5 6 7 
1 -0.45 0.55 0.55 0.4 -0.45 -0.3 -0.3 
2 0.55 -0.45 0.55 0.4 -0.45 -0.3 -0.3 
3 0.55 0.55 -0.45 0.4 -0.45 -0.3 -0.3 
4 0.4 0.4 0.4 -0.8 0.4 -0.4 -0.4 
5 -0.45 -0.45 -0.45 0.4 -0.45 0.7 0.7 
6 -0.3 -0.3 -0.3 -0.4 0.7 -0.2 0.8 
7 -0.3 -0.3 -0.3 -0.4 0.7 0.8 -0.2 
Node 1, 2, 3, 4 tend to form 
one community and node 5, 
6, 7 for another. The 
Modularity Q based on this 
division is the sum of all 
green cells in modularity 
matrix divided by 2m: 0.355
High modularity implies dense connections inside 
communities and sparse connections between 
communities. 
Approximate maximization algorithms: 
• Greedy algorithms 
• Simulated annealing 
• Leading eigen-vector 
• Louvian’s method 
• Ensemble learning(Currently fastest) 
12
Benchmark model to simulate 
stochastic block network 1 with 
built-in cluster structures. 
where 
Each cluster has 40 nodes 
Modularity-based clustering on 
random network from stochastic 
block model. 
Modularity maximization 
approach works well if clusters 
have similar size 
13
 Random network without cluster structure may be 
splited. (Erodos Renyi network) 
 Small clusters in large network may be merged. 
(Resolution limitation) 
 Multi-resolution method may not reduce both 
types of error simultaneously. 
 A bottleneck of many other network clustering 
algorithms. 
14
Erdos Renyi network of 
40 nodes, density 0.1 
Modularity Maximized 
Clustering: Q=0.37 
15
Stochastic Block Model 2 with 
Two small clusters have 20 nodes, 
and the largest clusters have 100 
nodes 
The largest clusters are splited 
Modularity maximization algorithms 
tend to fail in networks with clusters of 
very different sizes 
Modularity maximized 
clustering with Q=0.429 
16
 Stochastic Block Model 3 with link probability 
 Cluster size: [800, 400, 50, 20] 
 Modularity method clustering results: 
• 7 nodes in cluster 3 are merged with cluster 1 
• All the 20 nodes in cluster 4 are merged with cluster 1 
17
 Algorithm 1 
◦ Global algorithm 
◦ Cluster internal link density above user defined threshold 
guaranteed 
 Algorithm 2 
◦ Local algorithm 
◦ Risk of splitting a cluster is quantified and under user 
control 
◦ Risk of merging clusters are minimized 
18
Objective Function: 
where p is user defined parameter in [0,1], δ is Kronecker delta symbol, A is 
adjacency matrix, c is community membership vector, m is total link count 
Reward table: 
Connected pair 
of nodes 
Disconnected 
pair of nodes 
Pair of nodes in 
the same cluster 
1-p -p 
Pair of nodes in 
different clusters 
-1+p p 
(2) 
19
 It is guaranteed that every found communities has 
internal link density higher than user defined 
threshold p. 
◦ If p=1, every found communities is a clique. 
◦ If p=25%, every communities has internal link density 
higher than 25%. 
◦ Communities with link density “significantly” higher than p 
will not be split. 
◦ Communities with link density lower than p will definitely 
be split into smaller communities. 
20
 Maximize objective function (2): 
where s is n by 1 vector of community membership with binary entries 1 or 
-1, A is adjacency matrix, J is one matrix, I is the identity matrix 
 Search over all possible divisions is N-P hard 
 Approximate spectral method: 
◦ Find the largest Eigen-value w of p-clique matrix: 
◦ Choose a corresponding Eigen-vector v of w 
◦ Use the sign of v to split the network of n nodes 
(3) 
(4) 
21
 is the best approximate solution to (3) 
 If , division by v will be executed. 
 If , but , division by v will still 
executed. 
 If , and , division by v will be 
cancelled 
22
 Python-scipy wrapper of ARPACK software 
 Iterative matrix-vector product finding Eigen-value 
of large sparse or structured matrices. 
 is dense but structured 
 Matrix-vector product requires much 
less than the usual operations 
◦ Adjacency matrix is usually sparse 
◦ requires only operations 
◦ requires only operations 
◦ Time complexity: per iteration 
◦ Space complexity: (applicable to huge graph) 
23
 Usually it is hard to tell how many communities 
are there in a large network 
 First split the network into two parts, then divide 
these two parts, and so forth. 
 Use the Bipartition Criteria in slide 21 as the 
stopping criteria of these recursive dividing 
prodedure 
24
p=0.1 
p=0.05 
Stochastic Block Network 2 with 
Two small clusters have 20 nodes, and 
the largest clusters have 100 nodes 
Expected link density 0.1125 
p=0.02 
25
 Karate Club Member data (34 people) 
 Link density: 0.139 
p = 0.1 p = 0.15 
26
 Doubtful Sound Dolphin (62 dolphins) 
 Link density: 0.084 
p = 0.03 p = 0.2 
27
 Increasing p: zoom in 
◦ Smaller communities are found. 
◦ Risk of merging clusters(type 2 error) is lower. 
◦ Risk of splitting cluster/Erdos Renyi sub-network (type 1 
error) is higher. 
 Decreasing p: zoom out 
◦ Larger communities are found. 
◦ Risk of merging clusters(type 2 error) is higher. 
◦ Risk of splitting cluster/Erdos Renyi sub-network (type 1 
error) is lower. 
28
 Objective: choose parameter p such that at most 
2.5% of nodes in an Erdos Renyi sub-network will 
be trimmed off. 
 Cause of Type 1 Error: 
◦ Due to random fluctuation in link formation, 2.5% of 
nodes has less than 0.975 np links with the rest 97.5% 
nodes. 
◦ Threshold p is higher than the link density between the 
2.5% group and 97.5% group of nodes 
 Strategy: 
◦ Choose p to be significantly smaller than observed total 
link density . 
29
 Solution: 
(5) 
 Intuition: 
◦ Use truncated normal distribution to approximate the 
distribution of link density between the 2.5% group and 
the 97.5% group. 
 Experiment results: 
◦ In 100 SBM networks, the type 1 error is bounded by 5% 
(mostly 3.5%). 
◦ In SBM networks with average degree less than 5, type 1 
error is less than 2%. 
30
 When two clusters of size and , link probability 
will be merged? 
where is observed link density. 
 The risk of type 2 error will be bounded by 2.5% if 
(6) 
31
 Challenge: 
◦ In splitting a sub-network, we usually don’t know the link 
density or between two clusters. 
◦ In theory, there maybe cases when inequality (5) and (6) 
are a conflict 
 Solution: 
◦ Choose p to be the upper bound in (5) 
◦ Develop a more flexible algorithm which allows p very 
from one sub-network to another. This may reduce the 
chance of a conflict between inequality (5) and (6). 
32
 A measure of consistence between a found 
communities R and real communities F. 
where I is the Kullback-Leibler divergence, H is entropy, N is diffusion 
matrix, and are number of real and found communities. 
33 
(7)
 Review Stochastic Block network 1 through 3 
using NMI 
 Results: 
◦ Type 1 error is overly controlled for small and sparse 
network such as SBM 1. 
34 
size Link 
densit 
y 
Auto 
chosen 
p 
Average 
NMI 
s.e. Number 
of 
simulatio 
n 
SBM 1 120 0.0723 0.0239 0.8484 0.0195 100 
SBM 2 140 0.1125 0.0579 0.9483 0.0078 100 
SBM 3 1270 0.0722 0.0574 0.9993 0.0001 100
35 
Modularity 
p = 0.0888 
Stochastic Block model 4 with 
Cluster size: (100, 20, 20) 
Expected link density:0.1507 
Auto-chosen parameter p from (5) : 0.0888 
 Using auto-chosen parameter p will end up 
with merging small clusters 2 and 3 
Cluster 2 and 3 will be divided only if we zoom 
in more by increase p 
Modularity method not only merged cluster 2 
and 3, but also split cluster 1
36 
p(S0) p(S0) 
S1 
S0 
S2 
p(S1) p(S1) p(S2) p(S2) 
C1 S3 C2 C3 
p(S3) p(S3) 
C4 C5 
are observed 
link density and node 
count in sub-network S
 Maximize localized clique-index 
where T is the binary tree representing the 
hierarchical clustering process, p(S) is automatic 
choice of local threshold parameter p for sub-network 
S, is the indicator if node i and j will be 
divided in bipartition of S 
37 
(8)
 Every bipartition in sub-network S will bring 
contribution: 
 The best bipartition is obtained from the sign of 
leading Eigenvector of matrix: 
 The bipartition on S will be cancelled if 
contribution . 
38 
(9) 
(10)
 Each matrix-vector product takes time O(m) 
 Finding leading Eigen-vector takes O(n) matrix-vector 
product. 
 On average, the height of the binary tree 
representing hierarchical clustering procedure is 
O(log(n)). 
 For both global and localized algorithm, the time 
complexity is or for sparse 
network. 
39
 Stochastic Block model 4 with 
 Cluster size: (100, 20, 20) 
 Average NMI among 100 
simulation is 0.9717 
 Localized clustering algorithm 
is able to detect the built-in 
community structure. 
40
 Stochastic Block Model 6 with 7000 nodes and 10 built-in clusters 
 Cluster sizes with internal link density: 
[(3000,0.08), (2000, 0.09), (1000, 0.1), (400,0.15), (200,0.2), (100, 
0.25), (100, 0.25), (100, 0.25), (80, 0.3), (20, 0.7)] 
 Link density between different clusters is 0.005 
 Average NMI among 20 simulation is 0.9895 
 Average Running time: 1.66 seconds 
41
 Stochastic Block Model with 20000 nodes and 25 clusters 
 Cluster sizes with internal link density: 
[(3350, 0.045), (3000, 0.05),(2000, 0.07),(2000, 0.07),(2000, 0.07), 
(1000, 0.09), (1000, 0.09), (1000, 0.09), (1000, 0.09), (500, 0.12), 
(500, 0.12), (400, 0.14), (400, 0.14), (400, 0.14), (400, 0.14), 
(200, 0.30), (200, 0.30), (200, 0.30), (100, 0.40), (100, 0.40), 
(50, 0.80), (50, 0.80), (50, 0.80), (50, 0.80), (50, 0.80)] 
 Link density between clusters: 0.0001 
 Average NMI among 10 simulations: 0.8960 
 Average running time: 12.6 seconds 
42
 Review of SBM network 1 through 6: 
 Clustering quality is high for large network or 
network with high link density 
43 
Built-in 
cluster 
size Link 
density 
Averag 
e NMI 
s.e. Number 
of 
Simulatio 
n 
SBM1 3 120 0.0723 0.8972 0.0195 100 
SBM2 3 140 0.1125 0.9476 0.0051 100 
SBM3 4 1200 0.0722 0.9687 0.0028 100 
SBM4 3 140 0.0888 0.9717 0.0033 100 
SBM5 10 7000 0.0285 0.9895 0.0022 20 
SBM6 25 20000 0.005 0.8960 0.0029 10
 Global Algorithm: 
◦ Good for application with specific requirements in internal 
link density of every found communities 
 Localized Algorithm: 
◦ Good for finding statistically significant communities. 
◦ Type 1 error seem to be overly controlled for sparse 
network. 
◦ The conflict between type 1 error and type 2 error is 
effectively avoided in sample simulated network. 
44
 Erdos Renyi Model may not serve as a good Null 
Model of random network without built-in 
communities structures. Statistically significant 
community for other null model need 
consideration. 
 Extend the algorithm to directed network, network 
with numerical values in adjacency matrix, and 
network with additional profile information in each 
node. 
 Develop close to linear time clustering algorithm. 
45

Weitere ähnliche Inhalte

Was ist angesagt?

Community detection
Community detectionCommunity detection
Community detectionScott Pauls
 
Scalable community detection with the louvain algorithm
Scalable community detection with the louvain algorithmScalable community detection with the louvain algorithm
Scalable community detection with the louvain algorithmNavid Sedighpour
 
Network sampling, community detection
Network sampling, community detectionNetwork sampling, community detection
Network sampling, community detectionroberval mariano
 
Community detection in graphs
Community detection in graphsCommunity detection in graphs
Community detection in graphsNicola Barbieri
 
Action and content based Community Detection in Social Networks
Action and content based Community Detection in Social NetworksAction and content based Community Detection in Social Networks
Action and content based Community Detection in Social Networksritesh_11
 
Community detection in complex social networks
Community detection in complex social networksCommunity detection in complex social networks
Community detection in complex social networksAboul Ella Hassanien
 
Entropy based algorithm for community detection in augmented networks
Entropy based algorithm for community detection in augmented networksEntropy based algorithm for community detection in augmented networks
Entropy based algorithm for community detection in augmented networksJuan David Cruz-Gómez
 
Social network analysis basics
Social network analysis basicsSocial network analysis basics
Social network analysis basicsPradeep Kumar
 
Exploratory social network analysis with pajek
Exploratory social network analysis with pajekExploratory social network analysis with pajek
Exploratory social network analysis with pajekTHomas Plotkowiak
 
Graph Community Detection Algorithm for Distributed Memory Parallel Computing...
Graph Community Detection Algorithm for Distributed Memory Parallel Computing...Graph Community Detection Algorithm for Distributed Memory Parallel Computing...
Graph Community Detection Algorithm for Distributed Memory Parallel Computing...Alexander Pozdneev
 
CS6010 Social Network Analysis Unit III
CS6010 Social Network Analysis   Unit IIICS6010 Social Network Analysis   Unit III
CS6010 Social Network Analysis Unit IIIpkaviya
 
Taxonomy and survey of community
Taxonomy and survey of communityTaxonomy and survey of community
Taxonomy and survey of communityIJCSES Journal
 
Community Detection in Social Networks: A Brief Overview
Community Detection in Social Networks: A Brief OverviewCommunity Detection in Social Networks: A Brief Overview
Community Detection in Social Networks: A Brief OverviewSatyaki Sikdar
 
Overlapping community detection survey
Overlapping community detection surveyOverlapping community detection survey
Overlapping community detection survey煜林 车
 
Network Visualization guest lecture at #DataVizQMSS at @Columbia / #SNA at PU...
Network Visualization guest lecture at #DataVizQMSS at @Columbia / #SNA at PU...Network Visualization guest lecture at #DataVizQMSS at @Columbia / #SNA at PU...
Network Visualization guest lecture at #DataVizQMSS at @Columbia / #SNA at PU...Denis Parra Santander
 
Community detection in social networks
Community detection in social networksCommunity detection in social networks
Community detection in social networksFrancisco Restivo
 

Was ist angesagt? (20)

Community detection
Community detectionCommunity detection
Community detection
 
Scalable community detection with the louvain algorithm
Scalable community detection with the louvain algorithmScalable community detection with the louvain algorithm
Scalable community detection with the louvain algorithm
 
Network sampling, community detection
Network sampling, community detectionNetwork sampling, community detection
Network sampling, community detection
 
Community detection in graphs
Community detection in graphsCommunity detection in graphs
Community detection in graphs
 
Action and content based Community Detection in Social Networks
Action and content based Community Detection in Social NetworksAction and content based Community Detection in Social Networks
Action and content based Community Detection in Social Networks
 
06 Community Detection
06 Community Detection06 Community Detection
06 Community Detection
 
Community detection in complex social networks
Community detection in complex social networksCommunity detection in complex social networks
Community detection in complex social networks
 
Entropy based algorithm for community detection in augmented networks
Entropy based algorithm for community detection in augmented networksEntropy based algorithm for community detection in augmented networks
Entropy based algorithm for community detection in augmented networks
 
Social network analysis basics
Social network analysis basicsSocial network analysis basics
Social network analysis basics
 
Community detection
Community detectionCommunity detection
Community detection
 
Exploratory social network analysis with pajek
Exploratory social network analysis with pajekExploratory social network analysis with pajek
Exploratory social network analysis with pajek
 
17 Statistical Models for Networks
17 Statistical Models for Networks17 Statistical Models for Networks
17 Statistical Models for Networks
 
Graph Community Detection Algorithm for Distributed Memory Parallel Computing...
Graph Community Detection Algorithm for Distributed Memory Parallel Computing...Graph Community Detection Algorithm for Distributed Memory Parallel Computing...
Graph Community Detection Algorithm for Distributed Memory Parallel Computing...
 
CS6010 Social Network Analysis Unit III
CS6010 Social Network Analysis   Unit IIICS6010 Social Network Analysis   Unit III
CS6010 Social Network Analysis Unit III
 
Taxonomy and survey of community
Taxonomy and survey of communityTaxonomy and survey of community
Taxonomy and survey of community
 
Community Detection in Social Networks: A Brief Overview
Community Detection in Social Networks: A Brief OverviewCommunity Detection in Social Networks: A Brief Overview
Community Detection in Social Networks: A Brief Overview
 
11 Keynote (2017)
11 Keynote (2017)11 Keynote (2017)
11 Keynote (2017)
 
Overlapping community detection survey
Overlapping community detection surveyOverlapping community detection survey
Overlapping community detection survey
 
Network Visualization guest lecture at #DataVizQMSS at @Columbia / #SNA at PU...
Network Visualization guest lecture at #DataVizQMSS at @Columbia / #SNA at PU...Network Visualization guest lecture at #DataVizQMSS at @Columbia / #SNA at PU...
Network Visualization guest lecture at #DataVizQMSS at @Columbia / #SNA at PU...
 
Community detection in social networks
Community detection in social networksCommunity detection in social networks
Community detection in social networks
 

Ähnlich wie Clique-based Network Clustering

A Proposed Algorithm to Detect the Largest Community Based On Depth Level
A Proposed Algorithm to Detect the Largest Community Based On Depth LevelA Proposed Algorithm to Detect the Largest Community Based On Depth Level
A Proposed Algorithm to Detect the Largest Community Based On Depth LevelEswar Publications
 
community Detection.pptx
community Detection.pptxcommunity Detection.pptx
community Detection.pptxBhuvana97
 
Higher-order clustering coefficients
Higher-order clustering coefficientsHigher-order clustering coefficients
Higher-order clustering coefficientsAustin Benson
 
SDC: A Distributed Clustering Protocol
SDC: A Distributed Clustering ProtocolSDC: A Distributed Clustering Protocol
SDC: A Distributed Clustering ProtocolCSCJournals
 
MSCX2023_Sergio Gomez_PartI
MSCX2023_Sergio Gomez_PartIMSCX2023_Sergio Gomez_PartI
MSCX2023_Sergio Gomez_PartImscx
 
Neural Networks in Data Mining - “An Overview”
Neural Networks  in Data Mining -   “An Overview”Neural Networks  in Data Mining -   “An Overview”
Neural Networks in Data Mining - “An Overview”Dr.(Mrs).Gethsiyal Augasta
 
A study of localized algorithm for self organized wireless sensor network and...
A study of localized algorithm for self organized wireless sensor network and...A study of localized algorithm for self organized wireless sensor network and...
A study of localized algorithm for self organized wireless sensor network and...eSAT Publishing House
 
A study of localized algorithm for self organized wireless sensor network and...
A study of localized algorithm for self organized wireless sensor network and...A study of localized algorithm for self organized wireless sensor network and...
A study of localized algorithm for self organized wireless sensor network and...eSAT Journals
 
Higher-order clustering coefficients at Purdue CSoI
Higher-order clustering coefficients at Purdue CSoIHigher-order clustering coefficients at Purdue CSoI
Higher-order clustering coefficients at Purdue CSoIAustin Benson
 
DISTRIBUTION OF MAXIMAL CLIQUE SIZE UNDER THE WATTS-STROGATZ MODEL OF EVOLUTI...
DISTRIBUTION OF MAXIMAL CLIQUE SIZE UNDER THE WATTS-STROGATZ MODEL OF EVOLUTI...DISTRIBUTION OF MAXIMAL CLIQUE SIZE UNDER THE WATTS-STROGATZ MODEL OF EVOLUTI...
DISTRIBUTION OF MAXIMAL CLIQUE SIZE UNDER THE WATTS-STROGATZ MODEL OF EVOLUTI...ijfcstjournal
 
2013 KDD conference presentation--"Multi-Label Relational Neighbor Classifica...
2013 KDD conference presentation--"Multi-Label Relational Neighbor Classifica...2013 KDD conference presentation--"Multi-Label Relational Neighbor Classifica...
2013 KDD conference presentation--"Multi-Label Relational Neighbor Classifica...Xi Wang
 
SCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKS
SCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKSSCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKS
SCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKSIJDKP
 
Scalable Local Community Detection with Mapreduce for Large Networks
Scalable Local Community Detection with Mapreduce for Large NetworksScalable Local Community Detection with Mapreduce for Large Networks
Scalable Local Community Detection with Mapreduce for Large NetworksIJDKP
 
Distribution of maximal clique size under
Distribution of maximal clique size underDistribution of maximal clique size under
Distribution of maximal clique size underijfcstjournal
 
H030102052055
H030102052055H030102052055
H030102052055theijes
 
Jürgens diata12-communities
Jürgens diata12-communitiesJürgens diata12-communities
Jürgens diata12-communitiesPascal Juergens
 
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...Daniel Katz
 
A Survey Paper on Cluster Head Selection Techniques for Mobile Ad-Hoc Network
A Survey Paper on Cluster Head Selection Techniques for Mobile Ad-Hoc NetworkA Survey Paper on Cluster Head Selection Techniques for Mobile Ad-Hoc Network
A Survey Paper on Cluster Head Selection Techniques for Mobile Ad-Hoc NetworkIOSR Journals
 

Ähnlich wie Clique-based Network Clustering (20)

A Proposed Algorithm to Detect the Largest Community Based On Depth Level
A Proposed Algorithm to Detect the Largest Community Based On Depth LevelA Proposed Algorithm to Detect the Largest Community Based On Depth Level
A Proposed Algorithm to Detect the Largest Community Based On Depth Level
 
community Detection.pptx
community Detection.pptxcommunity Detection.pptx
community Detection.pptx
 
Higher-order clustering coefficients
Higher-order clustering coefficientsHigher-order clustering coefficients
Higher-order clustering coefficients
 
SDC: A Distributed Clustering Protocol
SDC: A Distributed Clustering ProtocolSDC: A Distributed Clustering Protocol
SDC: A Distributed Clustering Protocol
 
MSCX2023_Sergio Gomez_PartI
MSCX2023_Sergio Gomez_PartIMSCX2023_Sergio Gomez_PartI
MSCX2023_Sergio Gomez_PartI
 
Neural Networks in Data Mining - “An Overview”
Neural Networks  in Data Mining -   “An Overview”Neural Networks  in Data Mining -   “An Overview”
Neural Networks in Data Mining - “An Overview”
 
A study of localized algorithm for self organized wireless sensor network and...
A study of localized algorithm for self organized wireless sensor network and...A study of localized algorithm for self organized wireless sensor network and...
A study of localized algorithm for self organized wireless sensor network and...
 
A study of localized algorithm for self organized wireless sensor network and...
A study of localized algorithm for self organized wireless sensor network and...A study of localized algorithm for self organized wireless sensor network and...
A study of localized algorithm for self organized wireless sensor network and...
 
Higher-order clustering coefficients at Purdue CSoI
Higher-order clustering coefficients at Purdue CSoIHigher-order clustering coefficients at Purdue CSoI
Higher-order clustering coefficients at Purdue CSoI
 
DISTRIBUTION OF MAXIMAL CLIQUE SIZE UNDER THE WATTS-STROGATZ MODEL OF EVOLUTI...
DISTRIBUTION OF MAXIMAL CLIQUE SIZE UNDER THE WATTS-STROGATZ MODEL OF EVOLUTI...DISTRIBUTION OF MAXIMAL CLIQUE SIZE UNDER THE WATTS-STROGATZ MODEL OF EVOLUTI...
DISTRIBUTION OF MAXIMAL CLIQUE SIZE UNDER THE WATTS-STROGATZ MODEL OF EVOLUTI...
 
2013 KDD conference presentation--"Multi-Label Relational Neighbor Classifica...
2013 KDD conference presentation--"Multi-Label Relational Neighbor Classifica...2013 KDD conference presentation--"Multi-Label Relational Neighbor Classifica...
2013 KDD conference presentation--"Multi-Label Relational Neighbor Classifica...
 
SCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKS
SCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKSSCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKS
SCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKS
 
Scalable Local Community Detection with Mapreduce for Large Networks
Scalable Local Community Detection with Mapreduce for Large NetworksScalable Local Community Detection with Mapreduce for Large Networks
Scalable Local Community Detection with Mapreduce for Large Networks
 
Distribution of maximal clique size under
Distribution of maximal clique size underDistribution of maximal clique size under
Distribution of maximal clique size under
 
H030102052055
H030102052055H030102052055
H030102052055
 
Jürgens diata12-communities
Jürgens diata12-communitiesJürgens diata12-communities
Jürgens diata12-communities
 
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
 
F017123439
F017123439F017123439
F017123439
 
A Survey Paper on Cluster Head Selection Techniques for Mobile Ad-Hoc Network
A Survey Paper on Cluster Head Selection Techniques for Mobile Ad-Hoc NetworkA Survey Paper on Cluster Head Selection Techniques for Mobile Ad-Hoc Network
A Survey Paper on Cluster Head Selection Techniques for Mobile Ad-Hoc Network
 
ResNet.pptx
ResNet.pptxResNet.pptx
ResNet.pptx
 

Clique-based Network Clustering

  • 1. Guang Ouyang Advisor: Dipak Dey 1
  • 2.  Facebook  LinkedIn  Internet  Instagram  Tweets  Google+  Quora  Wechat  Stack Oversflow  Research Gate 2
  • 3.  Small World: everyone and everything is six or fewer steps away, by way of introduction, from any other person in the world.  Power Law: degree distribution has long tail power law distribution.  Community Structure: community groups based on common location, interests, occupations, etc. are quite common in real networks. 3
  • 4.  Detect community structure in large and complex networks.  Community can be viewed as a summary of the whole network, and therefore easy to visualize and analyze.  Communities provide important information for applications such as market segmentation, building recommender system. 4
  • 5.  Network data is a graph structure made up of ‘nodes’ and ‘links’ that connect them.  Network data tends to have ‘discrete’ similarity matrix.  Most clustering algorithms work on the “continuous” distance or similarity matrix.  Real-world networks usually very large. Even is unbearable for efficiency or space. 5
  • 6.  Edge list:[(1,2),(1,3),(3,4),(4,5),(5,3), (3,6),(6,1), (7,4), (6,7)].  Adjacency matrix: 6
  • 7. No statistically precise definition so far Generally speaking, a community is a set of nodes densely connected internally Nodes between two communities are loosely connected 7
  • 8.  A random network without real clustering structure should not be split (type 1 error of over-splitting).  Two weakly connected communities should not be merged (type 2 error of under-splitting).  Modern network data is usually huge, space and time efficient clustering is needed 8
  • 9.  Minimum-cut method(spectral clustering)  Hierarchical clustering  Girvan-Newman algorithm (betweenness)  Modularity maximization  Stochastic block model as well as variants including mixed membership model  Finding maximal clique 9
  • 10.  A measure of strength of division of a network into clusters or communities. (1) where i and j denotes nodes, c denotes clusters, is the (i,j) entry in adjacency matrix A, is the degree of node i, m is the total number of links in a network. 10
  • 11. 11 Degrees of the 7 nodes are: Total Degree: The modularity matrix below has (i, j) entry: j i 1 2 3 4 5 6 7 1 -0.45 0.55 0.55 0.4 -0.45 -0.3 -0.3 2 0.55 -0.45 0.55 0.4 -0.45 -0.3 -0.3 3 0.55 0.55 -0.45 0.4 -0.45 -0.3 -0.3 4 0.4 0.4 0.4 -0.8 0.4 -0.4 -0.4 5 -0.45 -0.45 -0.45 0.4 -0.45 0.7 0.7 6 -0.3 -0.3 -0.3 -0.4 0.7 -0.2 0.8 7 -0.3 -0.3 -0.3 -0.4 0.7 0.8 -0.2 Node 1, 2, 3, 4 tend to form one community and node 5, 6, 7 for another. The Modularity Q based on this division is the sum of all green cells in modularity matrix divided by 2m: 0.355
  • 12. High modularity implies dense connections inside communities and sparse connections between communities. Approximate maximization algorithms: • Greedy algorithms • Simulated annealing • Leading eigen-vector • Louvian’s method • Ensemble learning(Currently fastest) 12
  • 13. Benchmark model to simulate stochastic block network 1 with built-in cluster structures. where Each cluster has 40 nodes Modularity-based clustering on random network from stochastic block model. Modularity maximization approach works well if clusters have similar size 13
  • 14.  Random network without cluster structure may be splited. (Erodos Renyi network)  Small clusters in large network may be merged. (Resolution limitation)  Multi-resolution method may not reduce both types of error simultaneously.  A bottleneck of many other network clustering algorithms. 14
  • 15. Erdos Renyi network of 40 nodes, density 0.1 Modularity Maximized Clustering: Q=0.37 15
  • 16. Stochastic Block Model 2 with Two small clusters have 20 nodes, and the largest clusters have 100 nodes The largest clusters are splited Modularity maximization algorithms tend to fail in networks with clusters of very different sizes Modularity maximized clustering with Q=0.429 16
  • 17.  Stochastic Block Model 3 with link probability  Cluster size: [800, 400, 50, 20]  Modularity method clustering results: • 7 nodes in cluster 3 are merged with cluster 1 • All the 20 nodes in cluster 4 are merged with cluster 1 17
  • 18.  Algorithm 1 ◦ Global algorithm ◦ Cluster internal link density above user defined threshold guaranteed  Algorithm 2 ◦ Local algorithm ◦ Risk of splitting a cluster is quantified and under user control ◦ Risk of merging clusters are minimized 18
  • 19. Objective Function: where p is user defined parameter in [0,1], δ is Kronecker delta symbol, A is adjacency matrix, c is community membership vector, m is total link count Reward table: Connected pair of nodes Disconnected pair of nodes Pair of nodes in the same cluster 1-p -p Pair of nodes in different clusters -1+p p (2) 19
  • 20.  It is guaranteed that every found communities has internal link density higher than user defined threshold p. ◦ If p=1, every found communities is a clique. ◦ If p=25%, every communities has internal link density higher than 25%. ◦ Communities with link density “significantly” higher than p will not be split. ◦ Communities with link density lower than p will definitely be split into smaller communities. 20
  • 21.  Maximize objective function (2): where s is n by 1 vector of community membership with binary entries 1 or -1, A is adjacency matrix, J is one matrix, I is the identity matrix  Search over all possible divisions is N-P hard  Approximate spectral method: ◦ Find the largest Eigen-value w of p-clique matrix: ◦ Choose a corresponding Eigen-vector v of w ◦ Use the sign of v to split the network of n nodes (3) (4) 21
  • 22.  is the best approximate solution to (3)  If , division by v will be executed.  If , but , division by v will still executed.  If , and , division by v will be cancelled 22
  • 23.  Python-scipy wrapper of ARPACK software  Iterative matrix-vector product finding Eigen-value of large sparse or structured matrices.  is dense but structured  Matrix-vector product requires much less than the usual operations ◦ Adjacency matrix is usually sparse ◦ requires only operations ◦ requires only operations ◦ Time complexity: per iteration ◦ Space complexity: (applicable to huge graph) 23
  • 24.  Usually it is hard to tell how many communities are there in a large network  First split the network into two parts, then divide these two parts, and so forth.  Use the Bipartition Criteria in slide 21 as the stopping criteria of these recursive dividing prodedure 24
  • 25. p=0.1 p=0.05 Stochastic Block Network 2 with Two small clusters have 20 nodes, and the largest clusters have 100 nodes Expected link density 0.1125 p=0.02 25
  • 26.  Karate Club Member data (34 people)  Link density: 0.139 p = 0.1 p = 0.15 26
  • 27.  Doubtful Sound Dolphin (62 dolphins)  Link density: 0.084 p = 0.03 p = 0.2 27
  • 28.  Increasing p: zoom in ◦ Smaller communities are found. ◦ Risk of merging clusters(type 2 error) is lower. ◦ Risk of splitting cluster/Erdos Renyi sub-network (type 1 error) is higher.  Decreasing p: zoom out ◦ Larger communities are found. ◦ Risk of merging clusters(type 2 error) is higher. ◦ Risk of splitting cluster/Erdos Renyi sub-network (type 1 error) is lower. 28
  • 29.  Objective: choose parameter p such that at most 2.5% of nodes in an Erdos Renyi sub-network will be trimmed off.  Cause of Type 1 Error: ◦ Due to random fluctuation in link formation, 2.5% of nodes has less than 0.975 np links with the rest 97.5% nodes. ◦ Threshold p is higher than the link density between the 2.5% group and 97.5% group of nodes  Strategy: ◦ Choose p to be significantly smaller than observed total link density . 29
  • 30.  Solution: (5)  Intuition: ◦ Use truncated normal distribution to approximate the distribution of link density between the 2.5% group and the 97.5% group.  Experiment results: ◦ In 100 SBM networks, the type 1 error is bounded by 5% (mostly 3.5%). ◦ In SBM networks with average degree less than 5, type 1 error is less than 2%. 30
  • 31.  When two clusters of size and , link probability will be merged? where is observed link density.  The risk of type 2 error will be bounded by 2.5% if (6) 31
  • 32.  Challenge: ◦ In splitting a sub-network, we usually don’t know the link density or between two clusters. ◦ In theory, there maybe cases when inequality (5) and (6) are a conflict  Solution: ◦ Choose p to be the upper bound in (5) ◦ Develop a more flexible algorithm which allows p very from one sub-network to another. This may reduce the chance of a conflict between inequality (5) and (6). 32
  • 33.  A measure of consistence between a found communities R and real communities F. where I is the Kullback-Leibler divergence, H is entropy, N is diffusion matrix, and are number of real and found communities. 33 (7)
  • 34.  Review Stochastic Block network 1 through 3 using NMI  Results: ◦ Type 1 error is overly controlled for small and sparse network such as SBM 1. 34 size Link densit y Auto chosen p Average NMI s.e. Number of simulatio n SBM 1 120 0.0723 0.0239 0.8484 0.0195 100 SBM 2 140 0.1125 0.0579 0.9483 0.0078 100 SBM 3 1270 0.0722 0.0574 0.9993 0.0001 100
  • 35. 35 Modularity p = 0.0888 Stochastic Block model 4 with Cluster size: (100, 20, 20) Expected link density:0.1507 Auto-chosen parameter p from (5) : 0.0888  Using auto-chosen parameter p will end up with merging small clusters 2 and 3 Cluster 2 and 3 will be divided only if we zoom in more by increase p Modularity method not only merged cluster 2 and 3, but also split cluster 1
  • 36. 36 p(S0) p(S0) S1 S0 S2 p(S1) p(S1) p(S2) p(S2) C1 S3 C2 C3 p(S3) p(S3) C4 C5 are observed link density and node count in sub-network S
  • 37.  Maximize localized clique-index where T is the binary tree representing the hierarchical clustering process, p(S) is automatic choice of local threshold parameter p for sub-network S, is the indicator if node i and j will be divided in bipartition of S 37 (8)
  • 38.  Every bipartition in sub-network S will bring contribution:  The best bipartition is obtained from the sign of leading Eigenvector of matrix:  The bipartition on S will be cancelled if contribution . 38 (9) (10)
  • 39.  Each matrix-vector product takes time O(m)  Finding leading Eigen-vector takes O(n) matrix-vector product.  On average, the height of the binary tree representing hierarchical clustering procedure is O(log(n)).  For both global and localized algorithm, the time complexity is or for sparse network. 39
  • 40.  Stochastic Block model 4 with  Cluster size: (100, 20, 20)  Average NMI among 100 simulation is 0.9717  Localized clustering algorithm is able to detect the built-in community structure. 40
  • 41.  Stochastic Block Model 6 with 7000 nodes and 10 built-in clusters  Cluster sizes with internal link density: [(3000,0.08), (2000, 0.09), (1000, 0.1), (400,0.15), (200,0.2), (100, 0.25), (100, 0.25), (100, 0.25), (80, 0.3), (20, 0.7)]  Link density between different clusters is 0.005  Average NMI among 20 simulation is 0.9895  Average Running time: 1.66 seconds 41
  • 42.  Stochastic Block Model with 20000 nodes and 25 clusters  Cluster sizes with internal link density: [(3350, 0.045), (3000, 0.05),(2000, 0.07),(2000, 0.07),(2000, 0.07), (1000, 0.09), (1000, 0.09), (1000, 0.09), (1000, 0.09), (500, 0.12), (500, 0.12), (400, 0.14), (400, 0.14), (400, 0.14), (400, 0.14), (200, 0.30), (200, 0.30), (200, 0.30), (100, 0.40), (100, 0.40), (50, 0.80), (50, 0.80), (50, 0.80), (50, 0.80), (50, 0.80)]  Link density between clusters: 0.0001  Average NMI among 10 simulations: 0.8960  Average running time: 12.6 seconds 42
  • 43.  Review of SBM network 1 through 6:  Clustering quality is high for large network or network with high link density 43 Built-in cluster size Link density Averag e NMI s.e. Number of Simulatio n SBM1 3 120 0.0723 0.8972 0.0195 100 SBM2 3 140 0.1125 0.9476 0.0051 100 SBM3 4 1200 0.0722 0.9687 0.0028 100 SBM4 3 140 0.0888 0.9717 0.0033 100 SBM5 10 7000 0.0285 0.9895 0.0022 20 SBM6 25 20000 0.005 0.8960 0.0029 10
  • 44.  Global Algorithm: ◦ Good for application with specific requirements in internal link density of every found communities  Localized Algorithm: ◦ Good for finding statistically significant communities. ◦ Type 1 error seem to be overly controlled for sparse network. ◦ The conflict between type 1 error and type 2 error is effectively avoided in sample simulated network. 44
  • 45.  Erdos Renyi Model may not serve as a good Null Model of random network without built-in communities structures. Statistically significant community for other null model need consideration.  Extend the algorithm to directed network, network with numerical values in adjacency matrix, and network with additional profile information in each node.  Develop close to linear time clustering algorithm. 45