SlideShare ist ein Scribd-Unternehmen logo
1 von 96
Community
Analysis
SOCIAL
MEDIA
MINING
2Social Media Mining Measures and Metrics 2Social Media Mining Community Analysishttp://socialmediamining.info/
Dear instructors/users of these slides:
Please feel free to include these slides in your own
material, or modify them as you see fit. If you decide
to incorporate these slides into your presentations,
please include the following note:
R. Zafarani, M. A. Abbasi, and H. Liu, Social Media Mining:
An Introduction, Cambridge University Press, 2014.
Free book and slides at http://socialmediamining.info/
or include a link to the website:
http://socialmediamining.info/
3Social Media Mining Measures and Metrics 3Social Media Mining Community Analysishttp://socialmediamining.info/
Social Community
A group of individuals with common economic,
social, or political interests or characteristics,
often living in relative proximity.
[real-world] community
4Social Media Mining Measures and Metrics 4Social Media Mining Community Analysishttp://socialmediamining.info/
Why analyze communities?
Analyzing communities helps better
understand users
– Users form groups based on their interests
Groups provide a clear global
view of user interactions
• E.g., find polarization
Some behaviors are only observable
in a group setting and not on an
individual level
– Some republican can agree with some
democrats, but their parties can disagree
5Social Media Mining Measures and Metrics 5Social Media Mining Community Analysishttp://socialmediamining.info/
Social Media Communities
• Formation:
– When like-minded users on social media form a link and start
interacting with each other
• More Formal Formation:
1. A set of at least two nodes sharing some interest, and
2. Interactions with respect to that interest.
• Social Media Communities
– Explicit (emic): formed by user subscriptions
– Implicit (etic): implicitly formed by social interactions
• Example: individuals calling Canada from the United States
• Phone operator considers them one community for promotional offers
• Other community names: group, cluster, cohesive subgroup, or module
6Social Media Mining Measures and Metrics 6Social Media Mining Community Analysishttp://socialmediamining.info/
Examples of Explicit Social Media Communities
Facebook has groups and communities. Users can
– post messages and images,
– can comment on other messages,
– can like posts, and
– can view activities of other users
In Google+, Circles represent communities
In Twitter, communities form as lists.
– Users join lists to receive
information in the form of tweets
LinkedIn provides Groups and Associations.
– Users can join professional groups where
they can post and share information related
to the group
7Social Media Mining Measures and Metrics 7Social Media Mining Community Analysishttp://socialmediamining.info/
Finding Implicit Communities: An Example
A simple graph in which
three implicit communities
are found, enclosed by the
dashed circles
8Social Media Mining Measures and Metrics 8Social Media Mining Community Analysishttp://socialmediamining.info/
Overlapping vs. Disjoint Communities
Overlapping Communities Disjoint Communities
9Social Media Mining Measures and Metrics 9Social Media Mining Community Analysishttp://socialmediamining.info/
Implicit communities in other domains
Protein-protein interaction
networks
– Communities are likely to group
proteins having the same specific
function within the cell
World Wide Web
– Communities may correspond to groups of pages dealing with
the same or related topics
Metabolic networks
– Communities may be related to functional modules such as
cycles and pathways
Food webs
– Communities may identify compartments
10Social Media Mining Measures and Metrics 10Social Media Mining Community Analysishttp://socialmediamining.info/
Real-world Implicit Communities
Collaboration network
between scientists
working at the
Santa Fe Institute.
The colors indicate
high level communities
and correspond to
research divisions of
the institute
11Social Media Mining Measures and Metrics 11Social Media Mining Community Analysishttp://socialmediamining.info/
What is Community Analysis?
• Community detection
– Discovering implicit communities
• Community evolution
– Studying temporal evolution of communities
• Community evaluation
– Evaluating Detected Communities
12Social Media Mining Measures and Metrics 12Social Media Mining Community Analysishttp://socialmediamining.info/
Community Detection
13Social Media Mining Measures and Metrics 13Social Media Mining Community Analysishttp://socialmediamining.info/
What is community detection?
• The process of finding clusters of nodes
(‘‘communities’’)
– With Strong internal connections and
– Weak connections between different communities
• Ideal decomposition of a large graph
– Completely disjoint communities
– There are no interactions between different communities.
• In practice,
– find community partitions that are maximally decoupled.
14Social Media Mining Measures and Metrics 14Social Media Mining Community Analysishttp://socialmediamining.info/
Why Detecting Communities is Important?
• The club members split into two groups (gray and white)
• Disagreement between the administrator of the club (node 34)
and the club’s instructor (node 1),
• The members of one group left to start their own club
The same communities can be
found using community detection
Zachary's
karate club
Interactions between
34 members of a
karate club
for over two years
15Social Media Mining Measures and Metrics 15Social Media Mining Community Analysishttp://socialmediamining.info/
Why Community Detection?
Network
Summarization
– A community can
be considered as a
summary of the
whole network
– Easier to visualize
and understand
Preserve Privacy
– [Sometimes] a
community can
reveal some
properties without
releasing the
individuals’ privacy
information.
16Social Media Mining Measures and Metrics 16Social Media Mining Community Analysishttp://socialmediamining.info/
Community Detection vs. Clustering
Clustering
• Data is often non-linked (matrix rows)
• Clustering works on the distance or similarity
matrix, e.g., 𝑘-means.
• If you use 𝑘-means with adjacency matrix rows,
you are only considering the ego-centric network
Community detection
• Data is linked (a graph)
• Network data tends to be “discrete”, leading to
algorithms using the graph property directly
– 𝑘-clique, quasi-clique, or edge-betweenness
17Social Media Mining Measures and Metrics 17Social Media Mining Community Analysishttp://socialmediamining.info/
Community Detection Algorithms
Group Users
based on
Group
attributes
Group Users
based on
Member
attributes
18Social Media Mining Measures and Metrics 18Social Media Mining Community Analysishttp://socialmediamining.info/
Member-Based
Community Detection
19Social Media Mining Measures and Metrics 19Social Media Mining Community Analysishttp://socialmediamining.info/
Member-Based Community Detection
• Look at node characteristics; and
• Identify nodes with similar characteristics and consider
them a community
Node Characteristics
A. Degree
– Nodes with same (or similar) degrees are in one community
– Example: cliques
B. Reachability
– Nodes that are close (small shortest paths) are in one community
– Example: 𝑘-cliques, 𝑘-clubs, and 𝑘-clans
C. Similarity
– Similar nodes are in the same community
20Social Media Mining Measures and Metrics 20Social Media Mining Community Analysishttp://socialmediamining.info/
A. Node Degree
Most common subgraph searched for:
• Clique: a maximum complete
subgraph in which all nodes inside
the subgraph adjacent to each other
Find communities by searching for
1. The maximum clique:
the one with the largest
number of vertices, or
2. All maximal cliques:
cliques that are not
subgraphs of a larger
clique; i.e., cannot be
further expanded
Both problems are NP-hard
To overcome this, we can
I. Brute Force
II. Relax cliques
III. Use cliques as the
core for larger
communities
21Social Media Mining Measures and Metrics 21Social Media Mining Community Analysishttp://socialmediamining.info/
I. Brute-Force Method
Can find all the
maximal cliques
in the graph
For each vertex
𝑣 𝑥, we find the
maximal clique
that contains
node 𝑣 𝑥
Impractical for large networks:
• For a complete graph of only 100 nodes, the
algorithm will generate at least 299 − 1 different
cliques starting from any node in the graph
22Social Media Mining Measures and Metrics 22Social Media Mining Community Analysishttp://socialmediamining.info/
Enhancing the Brute-Force Performance
[Systematic] Pruning can help:
• When searching for cliques of size 𝑘 or larger
• If the clique is found, each node should have a
degree equal to or more than 𝑘 − 1
• We can first prune all nodes (and edges connected
to them) with degrees less than 𝑘 − 1
– More nodes will have degrees less than 𝑘 − 1
– Prune them recursively
• For large 𝑘, many nodes are pruned as social media
networks follow a power-law degree distribution
23Social Media Mining Measures and Metrics 23Social Media Mining Community Analysishttp://socialmediamining.info/
Maximum Clique: Pruning…
Example. to find a clique ≥ 4,
remove all nodes with degree
≤ (4 − 1) − 1 = 2
– Remove nodes 2 and 9
– Remove nodes 1 and 3
– Remove node 4
Even with pruning, cliques are less desirable
– Cliques are rare
– A clique of 1000 nodes, has 999x1000/2 edges
– A single edge removal destroys the clique
– That is less than 0.0002% of the edges!
24Social Media Mining Measures and Metrics 24Social Media Mining Community Analysishttp://socialmediamining.info/
II. Relaxing Cliques
• 𝒌-plex: a set of vertices 𝑉 in which we have
• 𝑑 𝑣 is the degree of 𝑣 in the induced subgraph
– Number of nodes from 𝑉 that are connected to 𝑣
• Clique of size 𝑘 is a 1-plex
• Finding the maximum 𝑘-plex: NP-hard
– In practice, relatively easier due to smaller search space.
Maximal 𝑘-plexes
25Social Media Mining Measures and Metrics 25Social Media Mining Community Analysishttp://socialmediamining.info/
More Cliques Relaxing…
• 𝒌-core: a maximal
connected subgraph in
which all vertices have
degree at least 𝒌
– Difference with 𝒌-plex?
• 𝒌-shell: nodes that are
part of the 𝑘-core, but
are not part of the
(𝑘 + 1)-core.
Questions
• 0-core?
• 0-shell?
• 1-core?
• 𝑘-cores of the complete graph?
A Clique that wants to relax
A 2-plex or a 2-core?
26Social Media Mining Measures and Metrics 26Social Media Mining Community Analysishttp://socialmediamining.info/
III. Using Cliques as a seed of a Community
Clique Percolation Method (CPM)
– Uses cliques as seeds to find larger communities
– CPM finds overlapping communities
• Input
– A parameter 𝑘, and a network
• Procedure
– Find out all cliques of size 𝑘 in the given network
– Construct a clique graph.
• Two cliques are adjacent if they share 𝑘 − 1 nodes
– Each connected components in the clique graph
form a community
27Social Media Mining Measures and Metrics 27Social Media Mining Community Analysishttp://socialmediamining.info/
Clique Percolation Method: Example
Cliques of size 3:
{𝑣1, 𝑣2, 𝑣3}, {𝑣3, 𝑣4, 𝑣5},
𝑣4, 𝑣5, 𝑣6 , {𝑣4, 𝑣5, 𝑣7},
{𝑣4, 𝑣6, 𝑣7}, {𝑣5, 𝑣6, 𝑣7},
{𝑣6, 𝑣7, 𝑣8}, {𝑣8, 𝑣9, 𝑣10}
Communities:
{𝑣1, 𝑣2, 𝑣3},
{𝑣8, 𝑣9, 𝑣10},
{𝑣3, 𝑣4, 𝑣5, 𝑣6, 𝑣7, 𝑣8}
28Social Media Mining Measures and Metrics 28Social Media Mining Community Analysishttp://socialmediamining.info/
B. Node Reachability
The two extremes
Nodes are assumed to be in
the same community
1. If there is a path between
them (regardless of the
distance) or
2. They are so close as to be
immediate neighbors.
How? Find using BFS/DFS
How? Finding Cliques
Challenge: most nodes are in one
community (giant component)
Challenge: Cliques are challenging
to find and are rarely observed
Solution: find communities that are in between cliques
and connected components in terms of connectivity
and have small shortest paths between their nodes
29Social Media Mining Measures and Metrics 29Social Media Mining Community Analysishttp://socialmediamining.info/
Special Subgraphs
1. 𝑘-Clique: a maximal subgraph in which the largest
shortest path distance between any nodes is less
than or equal to 𝑘
2. 𝑘-Club: follows the same definition as a 𝑘-clique
– Additional Constraint: nodes on the shortest paths should
be part of the subgraph (i.e., diameter)
3. 𝑘-Clan: a 𝑘-clique where for all shortest paths within
the subgraph the distance is equal or less than 𝑘.
– All 𝑘-clans are 𝑘-cliques, but not vice versa.
30Social Media Mining Measures and Metrics 30Social Media Mining Community Analysishttp://socialmediamining.info/
More Special Subgraphs!
• 𝒌-truss: the largest subgraph where all edges
belong to 𝑘 − 2 triangles
• What is the relationship between a 𝒌-core and
𝒌-truss?
31Social Media Mining Measures and Metrics 31Social Media Mining Community Analysishttp://socialmediamining.info/
C. Node Similarity
• Similar (or most similar) nodes are assumed to be in the same
community.
– A classical clustering algorithm (e.g., 𝑘-means) is applied to node similarities
to find communities.
• Node similarity can be defined
– Using the similarity of node neighborhoods (Structural Equivalence) – Ch. 3
– Similarity of social circles (Regular Equivalence) – Ch. 3
Structural equivalence: two nodes are structurally equivalent iff. they
are connecting to the same set of actors
Nodes 1 and 3 are
structurally equivalent,
So are nodes 5 and 7.
32Social Media Mining Measures and Metrics 32Social Media Mining Community Analysishttp://socialmediamining.info/
Node Similarity (Structural Equivalence)
Jaccard Similarity
Cosine similarity
33Social Media Mining Measures and Metrics 33Social Media Mining Community Analysishttp://socialmediamining.info/
Group-Based
Community Detection
34Social Media Mining Measures and Metrics 34Social Media Mining Community Analysishttp://socialmediamining.info/
Group-Based Community Detection
Group-based community detection: finding
communities that have certain group properties
Group Properties:
I. Balanced: Spectral clustering
II. Robust: 𝑘-connected graphs
III. Modular: Modularity Maximization
IV. Dense: Quasi-cliques
V. Hierarchical: Hierarchical clustering
35Social Media Mining Measures and Metrics 35Social Media Mining Community Analysishttp://socialmediamining.info/
I. Balanced Communities
• Community detection can be thought of graph clustering
• Graph clustering: we cut the graph into several partitions and
assume these partitions represent communities
• Cut: partitioning (cut) of the graph into two (or more) sets (cutsets)
– The size of the cut is the number of edges that are being cut
• Minimum cut (min-cut) problem: find a graph partition such that
the number of edges between the two sets is minimized
Min-cuts can be
computed efficiently
using the max-flow
min-cut theorem
Min-cut often returns an imbalanced partition, with one set being a singleton
Min-cut
36Social Media Mining Measures and Metrics 36Social Media Mining Community Analysishttp://socialmediamining.info/
Ratio Cut and Normalized Cut
• To mitigate the min-cut problem we can
change the objective function to consider
community size
• 𝑃𝑖 = 𝑉 − 𝑃𝑖 is the complement cut set
• 𝑐𝑢𝑡(𝑃𝑖, 𝑃𝑖) is the size of the cut
• 𝑣𝑜𝑙 𝑃𝑖 = 𝑣∈𝑃 𝑖
𝑑 𝑣
37Social Media Mining Measures and Metrics 37Social Media Mining Community Analysishttp://socialmediamining.info/
Ratio Cut & Normalized Cut: Example
For Cut A
For Cut B
Both ratio cut and normalized cut prefer a balanced partition
A
B
38Social Media Mining Measures and Metrics 38Social Media Mining Community Analysishttp://socialmediamining.info/
Spectral Clustering
Reformulating ratio cut (or normalized cut) in matrix format
• Let 𝑋𝑖𝑗 = 1, when node 𝑖 is member of community 𝑗; 0, otherwise
• Let 𝐷 = 𝑑𝑖𝑎𝑔(𝑑1, 𝑑2, ⋯, 𝑑 𝑛) be the diagonal degree matrix
• The 𝑖th entry on the diagonal of 𝑋 𝑇 𝐴𝑋 is the number of edges that
are inside community 𝑖.
• The 𝑖th element on the diagonal of 𝑋 𝑇 𝐷𝑋 is the number of edges
that are connected to members of community 𝑖.
• The 𝑖th element on the diagonal of 𝑋 𝑇(𝐷 − 𝐴)𝑋 is the number of
edges in the cut that separates community 𝑖 from other nodes.
The 𝑖th diagonal element of
𝑋 𝑇(𝐷 − 𝐴)𝑋 is equivalent to cut(𝑃𝑖, 𝑃𝑖)
39Social Media Mining Measures and Metrics 39Social Media Mining Community Analysishttp://socialmediamining.info/
Spectral Clustering
So ratio cut is
40Social Media Mining Measures and Metrics 40Social Media Mining Community Analysishttp://socialmediamining.info/
Spectral Clustering
Both ratio/normalized cut can be reformulated as
• Spectral relaxation:
𝐷 = 𝑑𝑖𝑎𝑔(𝑑1, 𝑑2, ⋯, 𝑑 𝑛) is a diagonal degree matrix
Optimal Solution
𝑋 is the top eigenvectors with the smallest eigenvalues
41Social Media Mining Measures and Metrics 41Social Media Mining Community Analysishttp://socialmediamining.info/
Recovering Integer Membership Values
• Because we performed spectral relaxation,
the matrix obtained is not integer valued
• To recover 𝑋 from 𝑋 we can run 𝑘-means on 𝑋
42Social Media Mining Measures and Metrics 42Social Media Mining Community Analysishttp://socialmediamining.info/
Spectral Clustering: Example
𝐷 = 𝑑𝑖𝑎𝑔(2, 2, 4, 4, 4, 4, 4, 3, 1)
2 Eigenvectors
i.e., we want
2 communities
Two
communities:
{𝑣1, 𝑣2, 𝑣3}
{𝑣4, 𝑣5, 𝑣6, 𝑣7, 𝑣8, 𝑣9}
𝒌-means,𝒌=𝟐
43Social Media Mining Measures and Metrics 43Social Media Mining Community Analysishttp://socialmediamining.info/
More on Laplacian
Laplacian and Diffusion:
• Consider a substance that diffuses from node to
node on a graph (e.g., gas or fluid)
• The substance quantity for node 𝑗 is 𝜑 𝑗
• The diffusion depends on the difference in
quantity between nodes
– e.g., gas pressure difference
• Then we have
44Social Media Mining Measures and Metrics 44Social Media Mining Community Analysishttp://socialmediamining.info/
Laplacian and Diffusion
Why do we call it the Laplacian? As this is equivalent to heat equation
(
𝒅𝝋
𝒅𝒕
+ 𝛂𝛁 𝟐
𝝋 = 𝟎) and 𝑳 replaces the Laplace operator 𝛁 𝟐
45Social Media Mining Measures and Metrics 45Social Media Mining Community Analysishttp://socialmediamining.info/
Laplacian and Diffusion
State 𝜑(𝑡) in terms of eigenvalue-eigenvectors of 𝐿
– L is symmetric and has real eigenvalues.
Some coefficient!
Eigenvalues are orthogonal, so right multiply by any 𝑣𝑗
46Social Media Mining Measures and Metrics 46Social Media Mining Community Analysishttp://socialmediamining.info/
II. Robust Communities
• The goal is find subgraphs robust enough such that
removing some edges or vertices does not
disconnect the subgraph
• 𝒌-vertex connected (𝒌-connected) graph:
– 𝑘 is the minimum number of nodes that must be
removed to disconnect the graph
• 𝒌-edge connected: at least 𝑘 edges must be
removed to disconnect the graph
• Examples:
– Complete graph of size 𝑛: unique 𝑛-connected graph
– A cycle: 2-connected graph
47Social Media Mining Measures and Metrics 47Social Media Mining Community Analysishttp://socialmediamining.info/
III. Modular Communities
Consider a graph 𝐺(𝑉, 𝐸), where the degrees are known
beforehand however edges are not
– Consider two vertices 𝑣𝑖 and 𝑣𝑗 with degrees 𝑑𝑖 and 𝑑𝑗.
What is an expected number of edges between 𝑣𝑖 and 𝑣𝑗?
• For any edge going out of 𝑣𝑖 randomly the probability
of this edge getting connected to vertex 𝑣𝑗 is
48Social Media Mining Measures and Metrics 48Social Media Mining Community Analysishttp://socialmediamining.info/
Modularity and Modularity Maximization
• Given a degree distribution, we know the
expected number of edges between any pairs
of vertices
• We assume that real-world networks should
be far from random. Therefore, the more
distant they are from this randomly generated
network, the more structural they are.
• Modularity defines this distance and
modularity maximization tries to maximize
this distance
49Social Media Mining Measures and Metrics 49Social Media Mining Community Analysishttp://socialmediamining.info/
Normalized Modularity
Consider a partitioning of the data 𝑃 = (𝑃1, 𝑃2, 𝑃3, … , 𝑃 𝑘)
For partition 𝑃𝑥, this distance
can be defined as
This distance can be
generalized for a partitioning 𝑃
The normalized version of this
distance is defined as Modularity
50Social Media Mining Measures and Metrics 50Social Media Mining Community Analysishttp://socialmediamining.info/
Modularity Maximization
Modularity matrix
𝑑 ∈ ℝ 𝑛×1 is the degree vector for all nodes
Reformulation of the modularity
• 𝑋 ∈ ℝ 𝑛×𝑘 is the indicator (partition membership) function:
 𝑋𝑖𝑗 = 1 iff. 𝑣𝑖 ∈ 𝑃𝑗
• Similar to Spectral clustering,
• We relax 𝑋 to be orthogonal, i.e., matrix 𝑋
• The optimal solution for 𝑋 is the top 𝑘 eigenvectors of 𝐵.
• To recover the original 𝑋, we can run 𝑘-means on 𝑋
• Difference: in Modularity, the top 𝑘 eigenvalues have to be positive!
51Social Media Mining Measures and Metrics 51Social Media Mining Community Analysishttp://socialmediamining.info/
Modularity Maximization: Example
Modularity Matrix
𝑘-means
Two
Communities:
{1, 2, 3, 4}
and
{5, 6, 7, 8, 9}
2 eigenvectors
52Social Media Mining Measures and Metrics 52Social Media Mining Community Analysishttp://socialmediamining.info/
Making Modularity Optimization Faster
• The matrix method discussed is slow
and does not scale to millions of
nodes (and billions of edges)
• We can perform greedy optimization
of modularity to speed-up the process
• Louvain Method
– A greedy modularity optimization
method for community detection
• Invented when all authors affiliated with
Université catholique de Louvain (UCL)
53Social Media Mining Measures and Metrics 53Social Media Mining Community Analysishttp://socialmediamining.info/
Louvain Method
The algorithm has multiple passes
– Each pass has two phases
1. Modularity Optimization
2. Community Aggregation
Image from
Blondel, Vincent D., Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. "Fast unfolding of communities in large
networks." Journal of statistical mechanics: theory and experiment 2008, no. 10 (2008): P10008.
54Social Media Mining Measures and Metrics 54Social Media Mining Community Analysishttp://socialmediamining.info/
Louvain Method
Start with a weighted network where all nodes are in
their own communities (i.e., n communities)
First Phase:
• For each node 𝑣𝑖,
– For all neighbors 𝑣𝑗 ∈ 𝑁(𝑣𝑖):
• compute the modularity gain if 𝑣𝑖 is removed from its
community and placed in the community of 𝑣𝑗.
– Find the community with the maximum modularity gain
– If the maximum gain is positive, remove 𝑣𝑖 from its
community, and place it in that community
– If no positive gain, do not change communities
• Repeat until no node changes its community
55Social Media Mining Measures and Metrics 55Social Media Mining Community Analysishttp://socialmediamining.info/
Important Points about Phase I
• A point can be considered multiple times
• A Local Minima of modularity maximization is
achieved in phase I
• Phase I is order dependent
– The modularity achieved is more or less stable and
is less dependent on the initial order
– The computation time depends on the initial order.
56Social Media Mining Measures and Metrics 56Social Media Mining Community Analysishttp://socialmediamining.info/
Louvain Method
Second Phase:
– Build a new network
• Nodes are communities
• Edges are the edges between nodes in the corresponding
communities (weights are sum of the weights)
• Self-loops represent edges within the community
• The algorithm creates hierarchies of
communities
• It usually ends in less than 10 passes
• It is seems to be an O(𝑛 log 𝑛) algorithm
57Social Media Mining Measures and Metrics 57Social Media Mining Community Analysishttp://socialmediamining.info/
Why is it fast?
Modularity can be restated as
Consider node 𝑖, in an isolated community,
when it moves to community 𝐶
• The gain in modularity can be computed easily
community of node 𝑖
Kronecker delta function
Edges within 𝑪 Edges between 𝒊 and 𝑪 All edges connected to
members of 𝑪
58Social Media Mining Measures and Metrics 58Social Media Mining Community Analysishttp://socialmediamining.info/
IV. Dense Communities: -dense
• The density of a graph defines how close a graph is
to a clique
• A subgraph 𝐺(𝑉, 𝐸) is a -dense (or quasi-clique) if
• A 1-dense graph is a clique
• We can find quasi-cliques using the brute force
algorithm discussed previously, but there are more
efficient methods.
59Social Media Mining Measures and Metrics 59Social Media Mining Community Analysishttp://socialmediamining.info/
Finding Maximal -dense Quasi-Cliques
We can use a two-step procedure consisting of “local search”
and “heuristic pruning”
Local search
• Sample a subgraph, and find a maximal -dense quasi-
clique
• A greedy approach is to expand a quasi-clique by all of its high-
degree neighbors until the density drops below 
Heuristic pruning
• For a -dense quasi-clique of size k, we recursively remove
nodes with degree less than  k and incident edges
• We can start from low-degree nodes and recursively remove all
nodes with degree less that  k
60Social Media Mining Measures and Metrics 60Social Media Mining Community Analysishttp://socialmediamining.info/
V. Hierarchical Communities
• Previous methods consider communities at a
single level
– Communities may have hierarchies.
• Each community can have sub/super communities.
– Hierarchical clustering deals with this scenario and
generates community hierarchies.
• Initially 𝑛 members are considered as either 1
or 𝑛 communities in hierarchical clustering.
These communities are gradually
– merged (agglomerative hierarchical clustering) or
– split (divisive hierarchical clustering)
61Social Media Mining Measures and Metrics 61Social Media Mining Community Analysishttp://socialmediamining.info/
Hierarchical Community Detection
• Goal: build a hierarchical structure of
communities based on network topology
• Allow the analysis of a network at different
resolutions
• Representative approaches:
– Divisive Hierarchical Clustering
– Agglomerative Hierarchical clustering
62Social Media Mining Measures and Metrics 62Social Media Mining Community Analysishttp://socialmediamining.info/
Divisive Hierarchical Clustering
• Divisive clustering
– Partition nodes into several sets
– Each set is further divided into smaller ones
– Network-centric partition can be applied for the partition
• One particular example:
Girvan-Newman Algorithm: recursively remove the
“weakest” links within a “community”
– Find the edge with the weakest link
– Remove the edge and update the corresponding strength of
each edge
• Recursively apply the above two steps until a network
is discomposed into a desired number of connected
components.
• Each component forms a community
63Social Media Mining Measures and Metrics 63Social Media Mining Community Analysishttp://socialmediamining.info/
Edge Betweenness
• To determine weakest links, the algorithm uses
edge betweenness.
Edge betweenness is the number of shortest
paths that pass along with the edge
• Edge betweenness measures the “bridgeness” of
an edge between two communities
• The edge with high betweenness tends to be the
bridge between two communities.
64Social Media Mining Measures and Metrics 64Social Media Mining Community Analysishttp://socialmediamining.info/
Edge Betweenness: Example
The edge betweenness of
𝑒(1, 2) is 6/2 + 1 = 4, as all the
shortest paths from 2 to
{4, 5, 6, 7, 8, 9} have to either pass
𝑒(1, 2) or 𝑒(2, 3), and 𝑒(1,2) is the
shortest path between 1 and 2
65Social Media Mining Measures and Metrics 65Social Media Mining Community Analysishttp://socialmediamining.info/
The Girvan-Newman Algorithm
1. Calculate edge betweenness for all
edges in the graph.
2. Remove the edge with the highest
betweenness.
3. Recalculate betweenness for all
edges affected by the edge removal.
4. Repeat until all edges are removed.
66Social Media Mining Measures and Metrics 66Social Media Mining Community Analysishttp://socialmediamining.info/
Edge Betweenness Divisive Clustering: Example
the first edge that needs to be
removed is e(4, 5) ( or e(4, 6) )
Initial betweenness value
By removing e(4, 5), we compute the edge
betweenness once again; this time, e(4, 6)
has the highest betweenness value: 20.
This is because all shortest paths between
nodes {1,2,3,4} to nodes {5,6,7,8,9} must pass
e(4, 6); therefore, it has betweenness 4×5 = 20.
67Social Media Mining Measures and Metrics 67Social Media Mining Community Analysishttp://socialmediamining.info/
Community Detection Algorithms
68Social Media Mining Measures and Metrics 68Social Media Mining Community Analysishttp://socialmediamining.info/
Community Evolution
69Social Media Mining Measures and Metrics 69Social Media Mining Community Analysishttp://socialmediamining.info/
Network and Community Evolution
• How does a network change over time?
• How does a community change over time?
• What properties do you expect to remain
roughly constant?
• What properties do you expect to change?
• For example,
– Where do you expect new edges to form?
– Which edges do you expect to be dropped?
70Social Media Mining Measures and Metrics 70Social Media Mining Community Analysishttp://socialmediamining.info/
How Networks Evolve?
71Social Media Mining Measures and Metrics 71Social Media Mining Community Analysishttp://socialmediamining.info/
Network Growth Patterns
1. Network Segmentation
2. Graph Densification
3. Diameter Shrinkage
72Social Media Mining Measures and Metrics 72Social Media Mining Community Analysishttp://socialmediamining.info/
1. Network Segmentation
• Often, in evolving networks, segmentation takes
place, where the large network is decomposed
over time into three parts
1. Giant Component: As network
connections stabilize, a giant
component of nodes is formed, with a
large proportion of network nodes and
edges falling into this component.
2. Stars: These are isolated parts of the
network that form star structures. A
star is a tree with one internal node and
n leaves.
3. Singletons: These are orphan nodes
disconnected from all nodes in the
network.
73Social Media Mining Measures and Metrics 73Social Media Mining Community Analysishttp://socialmediamining.info/
2. Graph Densification
• The density of the graph increases as the
network grows
– The number of edges increases faster than the
number of nodes does
• Densification exponent: 1 ≤ 𝛼 ≤ 2:
– 𝛼 = 1: linear growth – constant out-degree
– 𝛼 = 2: quadratic growth – clique
𝐸(𝑡) and 𝑉(𝑡) are numbers of edges and nodes respectively at time 𝑡
74Social Media Mining Measures and Metrics 74Social Media Mining Community Analysishttp://socialmediamining.info/
Densification in Real Networks
V(t)
E(t)
1.69
Physics Citations
V(t)
E(t)
1.66
Patent Citations
75Social Media Mining Measures and Metrics 75Social Media Mining Community Analysishttp://socialmediamining.info/
3. Diameter Shrinking
• In networks diameter shrinks over time
ArXiv citation graph Affiliation Network
76Social Media Mining Measures and Metrics 76Social Media Mining Community Analysishttp://socialmediamining.info/
How Communities Evolve?
77Social Media Mining Measures and Metrics 77Social Media Mining Community Analysishttp://socialmediamining.info/
Community Evolution
• Communities also expand, shrink, or dissolve in
dynamic networks
78Social Media Mining Measures and Metrics 78Social Media Mining Community Analysishttp://socialmediamining.info/
Community Detection in
Evolving Networks
79Social Media Mining Measures and Metrics 79Social Media Mining Community Analysishttp://socialmediamining.info/
Extending Previous Methods
1. Take t snapshots of the network, 𝐺1,𝐺2, ⋯ , 𝐺𝑡
where 𝐺𝑖 is a snapshot at time 𝑖
2. Perform a static community detection
algorithm (all methods discussed before) on
all snapshots independently
3. Assign community members based on
communities found in all 𝑡 time stamps.
– E.g., Assign nodes to communities based on voting
(assign nodes to communities they belong to the
most over time)
Unstable in highly dynamic networks as community
memberships are always changing
80Social Media Mining Measures and Metrics 80Social Media Mining Community Analysishttp://socialmediamining.info/
Evolutionary Clustering
• Assume that communities don’t change most
of the time
• Minimize an objective function that considers
– Snapshot Cost. Communities at different times (SC)
– Temporal Cost. How communities evolve (TC)
• Objective function is defined as
81Social Media Mining Measures and Metrics 81Social Media Mining Community Analysishttp://socialmediamining.info/
Evolutionary Clustering
• If we use spectral clustering for each snapshot
• Then, 𝐶𝑜𝑠𝑡𝑡, the objective function at time 𝑡 is
• 𝑋𝑡 is the community membership matrix at 𝑡
• To define TC we can use
• Challenges with this definition
– Assumes that we have the same number of
communities at time 𝑡 − 1 and 𝑡
– 𝑋𝑡 is non-unique (any orthogonal transformation is still
a solution)
82Social Media Mining Measures and Metrics 82Social Media Mining Community Analysishttp://socialmediamining.info/
Evolutionary Clustering
• We can define TC as
• Hence cost will be
83Social Media Mining Measures and Metrics 83Social Media Mining Community Analysishttp://socialmediamining.info/
Evolutionary Clustering
• Assuming Normalized Laplacian is used
Similar to Spectral Clustering
𝑿 𝒕 can be obtained by taking the top eigenvectors of 𝑳
84Social Media Mining Measures and Metrics 84Social Media Mining Community Analysishttp://socialmediamining.info/
Community Evaluation
85Social Media Mining Measures and Metrics 85Social Media Mining Community Analysishttp://socialmediamining.info/
Evaluating the Communities
• Evaluation with ground truth
• Evaluation without ground truth
We are given objects of two
different kinds (+, ×)
• The perfect
community: all objects
inside the community
are of the same type
86Social Media Mining Measures and Metrics 86Social Media Mining Community Analysishttp://socialmediamining.info/
Evaluation with Ground Truth
• When ground truth is available
– We have partial knowledge of what communities
should look like
– We are given the correct community (clustering)
assignments
• Measures
– Precision and Recall, or F-Measure
– Purity
– Normalized Mutual Information (NMI)
87Social Media Mining Measures and Metrics 87Social Media Mining Community Analysishttp://socialmediamining.info/
Precision and Recall
Precision = Relevant and retrieved
Retrieved
Recall = Relevant and retrieved
Relevant
False Negative (FN) :
• When similar members are
assigned to different communities
• An incorrect decision
False Positive (FP) :
• When dissimilar members are
assigned to the same communities
• An incorrect decision
True Positive (TP) :
• When similar members are
assigned to the same communities
• A correct decision.
True Negative (TN) :
• When dissimilar members are
assigned to different communities
• A correct decision
88Social Media Mining Measures and Metrics 88Social Media Mining Community Analysishttp://socialmediamining.info/
Precision and Recall: Example
89Social Media Mining Measures and Metrics 89Social Media Mining Community Analysishttp://socialmediamining.info/
F-Measure
Either 𝑃 or 𝑅 measures one aspect of the
performance,
– To integrate them into one measure, we can use the
harmonic mean of precision of recall
For the example earlier,
𝐹 = 2 ×
0.6 × 0.61
0.6 + 0.61
= 0.60
90Social Media Mining Measures and Metrics 90Social Media Mining Community Analysishttp://socialmediamining.info/
Purity
We can assume the majority of a community
represents the community
– We use the label of the majority against the label of
each member to evaluate the communities
Purity. The fraction of instances that have labels
equal to the community’s majority label
• 𝑘: the number of communities
• 𝑁: total number of nodes,
• 𝐿𝑗: the set of instances with label 𝑗 in all communities
• 𝐶𝑖: the set of members in community 𝑖
Purity can be easily tampered by
• Points being singleton communities (of size 1); or by
• Very large communities
91Social Media Mining Measures and Metrics 91Social Media Mining Community Analysishttp://socialmediamining.info/
Mutual Information
• Mutual information (MI). The amount of
information that two random variables share.
– By knowing one of the variables, it measures the
amount of uncertainty reduced regarding the others
• 𝐿 and 𝐻 are labels and found communities;
• 𝑛ℎ and 𝑛𝑙 are the number of data points in community
ℎ and with label 𝑙, respectively;
• 𝑛ℎ, 𝑙 is the number of nodes in community ℎ and with
label 𝑙; and 𝑛 is the number of nodes
92Social Media Mining Measures and Metrics 92Social Media Mining Community Analysishttp://socialmediamining.info/
Normalizing Mutual Information (NMI)
• Mutual information (MI) is unbounded
• To address this issue, we can normalize MI
• How? We know that
• 𝐻(. ) is the entropy function
93Social Media Mining Measures and Metrics 93Social Media Mining Community Analysishttp://socialmediamining.info/
Normalized Mutual Information
Normalized Mutual Information
We can also define it as
Note that 𝑀𝐼 < 1/2(𝐻(𝐻) + 𝐻(𝐿)
94Social Media Mining Measures and Metrics 94Social Media Mining Community Analysishttp://socialmediamining.info/
Normalized Mutual Information
• NMI values close to one indicate high similarity
between communities found and labels
• Values close to zero indicate high dissimilarity
between them
• where 𝑙 and ℎ are known (with labels) and found communities, respectively
• 𝑛ℎ and 𝑛𝑙 are the number of members in the community ℎ and 𝑙, respectively,
• 𝑛ℎ, 𝑙 is the number of members in community ℎ and labeled 𝑙,
• 𝑛 is the size of the dataset
95Social Media Mining Measures and Metrics 95Social Media Mining Community Analysishttp://socialmediamining.info/
Normalized Mutual Information: Example
Found communities (H)
– [1,1,1,1,1,1, 2,2,2,2,2,2,2,2]
nh
h=1 6
h=2 8
nl
𝑙 = 1 7
𝑙 = 2 7
nh,l 𝑙 = 1 𝑙 = 2
h=1 5 1
h=2 2 6
𝑛 = 14
Actual Labels (L)
– [2,1,1,1,1,1, 2,2,2,2,2,2,1,1]
96Social Media Mining Measures and Metrics 96Social Media Mining Community Analysishttp://socialmediamining.info/
Evaluation without Ground Truth
• Evaluation with Semantics
– A simple way of analyzing detected communities is to analyze other attributes
(posts, profile information, content generated, etc.) of community members to see
if there is a coherency among community members
– The coherency is often checked via human subjects.
• Or through labor markets: Amazon Mechanical Turk
– To help analyze these communities, one can use word frequencies. By generating
a list of frequent keywords for each community, human subjects determine
whether these keywords represent a coherent topic.
• Evaluation Using Clustering Quality Measures
– Use clustering quality measures (SSE)
– Use more than two community detection algorithms and compare the results and
pick the algorithm with better quality measure

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Community Detection in Social Media
Community Detection in Social MediaCommunity Detection in Social Media
Community Detection in Social Media
 
Social Network Analysis (SNA)
Social Network Analysis (SNA)Social Network Analysis (SNA)
Social Network Analysis (SNA)
 
Social network analysis part ii
Social network analysis part iiSocial network analysis part ii
Social network analysis part ii
 
Social Media Mining: An Introduction
Social Media Mining: An IntroductionSocial Media Mining: An Introduction
Social Media Mining: An Introduction
 
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
 
Social network analysis course 2010 - 2011
Social network analysis course 2010 - 2011Social network analysis course 2010 - 2011
Social network analysis course 2010 - 2011
 
Social Network Analysis power point presentation
Social Network Analysis power point presentation Social Network Analysis power point presentation
Social Network Analysis power point presentation
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network Analysis
 
Data mining in social network
Data mining in social networkData mining in social network
Data mining in social network
 
The Basics of Social Network Analysis
The Basics of Social Network AnalysisThe Basics of Social Network Analysis
The Basics of Social Network Analysis
 
Community detection
Community detectionCommunity detection
Community detection
 
Link prediction
Link predictionLink prediction
Link prediction
 
Social network analysis intro part I
Social network analysis intro part ISocial network analysis intro part I
Social network analysis intro part I
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network Analysis
 
Group and Community Detection in Social Networks
Group and Community Detection in Social NetworksGroup and Community Detection in Social Networks
Group and Community Detection in Social Networks
 
Community detection in social networks
Community detection in social networksCommunity detection in social networks
Community detection in social networks
 
Social Network Analysis Workshop
Social Network Analysis WorkshopSocial Network Analysis Workshop
Social Network Analysis Workshop
 
Community detection in graphs
Community detection in graphsCommunity detection in graphs
Community detection in graphs
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network Analysis
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network Analysis
 

Andere mochten auch

CLIQUE Automatic subspace clustering of high dimensional data for data mining...
CLIQUE Automatic subspace clustering of high dimensional data for data mining...CLIQUE Automatic subspace clustering of high dimensional data for data mining...
CLIQUE Automatic subspace clustering of high dimensional data for data mining...
Raed Aldahdooh
 
평가도구피피티
평가도구피피티평가도구피피티
평가도구피피티
gueste6f13f3
 
Operating Systems - Advanced File Systems
Operating Systems - Advanced File SystemsOperating Systems - Advanced File Systems
Operating Systems - Advanced File Systems
Emery Berger
 

Andere mochten auch (20)

Community Detection in Networks Using Page Rank Vectors
Community Detection in Networks Using Page Rank Vectors Community Detection in Networks Using Page Rank Vectors
Community Detection in Networks Using Page Rank Vectors
 
Processing Large Complex Data
Processing Large Complex DataProcessing Large Complex Data
Processing Large Complex Data
 
Homophily and influence in social networks
Homophily and influence in social networksHomophily and influence in social networks
Homophily and influence in social networks
 
Network sampling, community detection
Network sampling, community detectionNetwork sampling, community detection
Network sampling, community detection
 
Presentazione di Andrea Angeli
Presentazione di Andrea AngeliPresentazione di Andrea Angeli
Presentazione di Andrea Angeli
 
CLIQUE Automatic subspace clustering of high dimensional data for data mining...
CLIQUE Automatic subspace clustering of high dimensional data for data mining...CLIQUE Automatic subspace clustering of high dimensional data for data mining...
CLIQUE Automatic subspace clustering of high dimensional data for data mining...
 
Community Detection with Networkx
Community Detection with NetworkxCommunity Detection with Networkx
Community Detection with Networkx
 
Clique
Clique Clique
Clique
 
LinkedIn for Beginners
LinkedIn for BeginnersLinkedIn for Beginners
LinkedIn for Beginners
 
Octink construction product guide 2013
Octink construction product guide 2013Octink construction product guide 2013
Octink construction product guide 2013
 
평가도구피피티
평가도구피피티평가도구피피티
평가도구피피티
 
Communities: Build Or Join
Communities: Build Or JoinCommunities: Build Or Join
Communities: Build Or Join
 
PRIVATE - Social networking and privacy article
PRIVATE - Social networking and privacy article PRIVATE - Social networking and privacy article
PRIVATE - Social networking and privacy article
 
Cascalog
CascalogCascalog
Cascalog
 
hidayath cv 2016
hidayath cv 2016hidayath cv 2016
hidayath cv 2016
 
Operating Systems - Advanced File Systems
Operating Systems - Advanced File SystemsOperating Systems - Advanced File Systems
Operating Systems - Advanced File Systems
 
Banja Luka
Banja LukaBanja Luka
Banja Luka
 
Guia manual de operacion SNB
Guia manual de operacion SNBGuia manual de operacion SNB
Guia manual de operacion SNB
 
Salud y seguridad de los trabajadores del sector salud.pdf
Salud y seguridad de los trabajadores del sector salud.pdfSalud y seguridad de los trabajadores del sector salud.pdf
Salud y seguridad de los trabajadores del sector salud.pdf
 
Finding a job using social media
Finding a job using social mediaFinding a job using social media
Finding a job using social media
 

Ähnlich wie Social Media Mining - Chapter 6 (Community Analysis)

Northeast Region of The National Council on Black
Northeast Region of The National Council on BlackNortheast Region of The National Council on Black
Northeast Region of The National Council on Black
Dr Fay Maureen
 
Social media presentation - fixed
Social media presentation - fixedSocial media presentation - fixed
Social media presentation - fixed
BobbyS95
 

Ähnlich wie Social Media Mining - Chapter 6 (Community Analysis) (20)

Marc Smith - Charting Collections of Connections in Social Media: Creating Ma...
Marc Smith - Charting Collections of Connections in Social Media: Creating Ma...Marc Smith - Charting Collections of Connections in Social Media: Creating Ma...
Marc Smith - Charting Collections of Connections in Social Media: Creating Ma...
 
Practical Applications for Social Network Analysis in Public Sector Marketing...
Practical Applications for Social Network Analysis in Public Sector Marketing...Practical Applications for Social Network Analysis in Public Sector Marketing...
Practical Applications for Social Network Analysis in Public Sector Marketing...
 
Northeast Region of The National Council on Black
Northeast Region of The National Council on BlackNortheast Region of The National Council on Black
Northeast Region of The National Council on Black
 
An Introduction to NodeXL for Social Scientists
An Introduction to NodeXL for Social ScientistsAn Introduction to NodeXL for Social Scientists
An Introduction to NodeXL for Social Scientists
 
Staying Free from “Corporate Marketing Machines”: Library Policy for Web 2.0 ...
Staying Free from “Corporate Marketing Machines”: Library Policy for Web 2.0 ...Staying Free from “Corporate Marketing Machines”: Library Policy for Web 2.0 ...
Staying Free from “Corporate Marketing Machines”: Library Policy for Web 2.0 ...
 
Staying Free from “Corporate Marketing Machines”: Library Policy for Web 2.0 ...
Staying Free from “Corporate Marketing Machines”: Library Policy for Web 2.0 ...Staying Free from “Corporate Marketing Machines”: Library Policy for Web 2.0 ...
Staying Free from “Corporate Marketing Machines”: Library Policy for Web 2.0 ...
 
2014 TheNextWeb-Mapping connections with NodeXL
2014 TheNextWeb-Mapping connections with NodeXL2014 TheNextWeb-Mapping connections with NodeXL
2014 TheNextWeb-Mapping connections with NodeXL
 
2009 - Connected Action - Marc Smith - Social Media Network Analysis
2009 - Connected Action - Marc Smith - Social Media Network Analysis2009 - Connected Action - Marc Smith - Social Media Network Analysis
2009 - Connected Action - Marc Smith - Social Media Network Analysis
 
SocialCom09-tutorial.pdf
SocialCom09-tutorial.pdfSocialCom09-tutorial.pdf
SocialCom09-tutorial.pdf
 
Online community
Online community Online community
Online community
 
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
 
User behavior model & recommendation on basis of social networks
User behavior model & recommendation on basis of social networks User behavior model & recommendation on basis of social networks
User behavior model & recommendation on basis of social networks
 
Social Νetworks Data Mining
Social Νetworks Data MiningSocial Νetworks Data Mining
Social Νetworks Data Mining
 
Introduction to Social Media and Social Networks.pdf
Introduction to Social Media and Social Networks.pdfIntroduction to Social Media and Social Networks.pdf
Introduction to Social Media and Social Networks.pdf
 
Social media presentation
Social media presentation Social media presentation
Social media presentation
 
Social media presentation - fixed
Social media presentation - fixedSocial media presentation - fixed
Social media presentation - fixed
 
Social media presentation
Social media presentationSocial media presentation
Social media presentation
 
Social media presentation (1)
Social media presentation (1)Social media presentation (1)
Social media presentation (1)
 
Social media presentation
Social media presentationSocial media presentation
Social media presentation
 
Social media community using optimized algorithm by M. Gomathi / Lecturer
Social media community using optimized algorithm by M. Gomathi / LecturerSocial media community using optimized algorithm by M. Gomathi / Lecturer
Social media community using optimized algorithm by M. Gomathi / Lecturer
 

Kürzlich hochgeladen

Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

Kürzlich hochgeladen (20)

On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 

Social Media Mining - Chapter 6 (Community Analysis)

  • 2. 2Social Media Mining Measures and Metrics 2Social Media Mining Community Analysishttp://socialmediamining.info/ Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate these slides into your presentations, please include the following note: R. Zafarani, M. A. Abbasi, and H. Liu, Social Media Mining: An Introduction, Cambridge University Press, 2014. Free book and slides at http://socialmediamining.info/ or include a link to the website: http://socialmediamining.info/
  • 3. 3Social Media Mining Measures and Metrics 3Social Media Mining Community Analysishttp://socialmediamining.info/ Social Community A group of individuals with common economic, social, or political interests or characteristics, often living in relative proximity. [real-world] community
  • 4. 4Social Media Mining Measures and Metrics 4Social Media Mining Community Analysishttp://socialmediamining.info/ Why analyze communities? Analyzing communities helps better understand users – Users form groups based on their interests Groups provide a clear global view of user interactions • E.g., find polarization Some behaviors are only observable in a group setting and not on an individual level – Some republican can agree with some democrats, but their parties can disagree
  • 5. 5Social Media Mining Measures and Metrics 5Social Media Mining Community Analysishttp://socialmediamining.info/ Social Media Communities • Formation: – When like-minded users on social media form a link and start interacting with each other • More Formal Formation: 1. A set of at least two nodes sharing some interest, and 2. Interactions with respect to that interest. • Social Media Communities – Explicit (emic): formed by user subscriptions – Implicit (etic): implicitly formed by social interactions • Example: individuals calling Canada from the United States • Phone operator considers them one community for promotional offers • Other community names: group, cluster, cohesive subgroup, or module
  • 6. 6Social Media Mining Measures and Metrics 6Social Media Mining Community Analysishttp://socialmediamining.info/ Examples of Explicit Social Media Communities Facebook has groups and communities. Users can – post messages and images, – can comment on other messages, – can like posts, and – can view activities of other users In Google+, Circles represent communities In Twitter, communities form as lists. – Users join lists to receive information in the form of tweets LinkedIn provides Groups and Associations. – Users can join professional groups where they can post and share information related to the group
  • 7. 7Social Media Mining Measures and Metrics 7Social Media Mining Community Analysishttp://socialmediamining.info/ Finding Implicit Communities: An Example A simple graph in which three implicit communities are found, enclosed by the dashed circles
  • 8. 8Social Media Mining Measures and Metrics 8Social Media Mining Community Analysishttp://socialmediamining.info/ Overlapping vs. Disjoint Communities Overlapping Communities Disjoint Communities
  • 9. 9Social Media Mining Measures and Metrics 9Social Media Mining Community Analysishttp://socialmediamining.info/ Implicit communities in other domains Protein-protein interaction networks – Communities are likely to group proteins having the same specific function within the cell World Wide Web – Communities may correspond to groups of pages dealing with the same or related topics Metabolic networks – Communities may be related to functional modules such as cycles and pathways Food webs – Communities may identify compartments
  • 10. 10Social Media Mining Measures and Metrics 10Social Media Mining Community Analysishttp://socialmediamining.info/ Real-world Implicit Communities Collaboration network between scientists working at the Santa Fe Institute. The colors indicate high level communities and correspond to research divisions of the institute
  • 11. 11Social Media Mining Measures and Metrics 11Social Media Mining Community Analysishttp://socialmediamining.info/ What is Community Analysis? • Community detection – Discovering implicit communities • Community evolution – Studying temporal evolution of communities • Community evaluation – Evaluating Detected Communities
  • 12. 12Social Media Mining Measures and Metrics 12Social Media Mining Community Analysishttp://socialmediamining.info/ Community Detection
  • 13. 13Social Media Mining Measures and Metrics 13Social Media Mining Community Analysishttp://socialmediamining.info/ What is community detection? • The process of finding clusters of nodes (‘‘communities’’) – With Strong internal connections and – Weak connections between different communities • Ideal decomposition of a large graph – Completely disjoint communities – There are no interactions between different communities. • In practice, – find community partitions that are maximally decoupled.
  • 14. 14Social Media Mining Measures and Metrics 14Social Media Mining Community Analysishttp://socialmediamining.info/ Why Detecting Communities is Important? • The club members split into two groups (gray and white) • Disagreement between the administrator of the club (node 34) and the club’s instructor (node 1), • The members of one group left to start their own club The same communities can be found using community detection Zachary's karate club Interactions between 34 members of a karate club for over two years
  • 15. 15Social Media Mining Measures and Metrics 15Social Media Mining Community Analysishttp://socialmediamining.info/ Why Community Detection? Network Summarization – A community can be considered as a summary of the whole network – Easier to visualize and understand Preserve Privacy – [Sometimes] a community can reveal some properties without releasing the individuals’ privacy information.
  • 16. 16Social Media Mining Measures and Metrics 16Social Media Mining Community Analysishttp://socialmediamining.info/ Community Detection vs. Clustering Clustering • Data is often non-linked (matrix rows) • Clustering works on the distance or similarity matrix, e.g., 𝑘-means. • If you use 𝑘-means with adjacency matrix rows, you are only considering the ego-centric network Community detection • Data is linked (a graph) • Network data tends to be “discrete”, leading to algorithms using the graph property directly – 𝑘-clique, quasi-clique, or edge-betweenness
  • 17. 17Social Media Mining Measures and Metrics 17Social Media Mining Community Analysishttp://socialmediamining.info/ Community Detection Algorithms Group Users based on Group attributes Group Users based on Member attributes
  • 18. 18Social Media Mining Measures and Metrics 18Social Media Mining Community Analysishttp://socialmediamining.info/ Member-Based Community Detection
  • 19. 19Social Media Mining Measures and Metrics 19Social Media Mining Community Analysishttp://socialmediamining.info/ Member-Based Community Detection • Look at node characteristics; and • Identify nodes with similar characteristics and consider them a community Node Characteristics A. Degree – Nodes with same (or similar) degrees are in one community – Example: cliques B. Reachability – Nodes that are close (small shortest paths) are in one community – Example: 𝑘-cliques, 𝑘-clubs, and 𝑘-clans C. Similarity – Similar nodes are in the same community
  • 20. 20Social Media Mining Measures and Metrics 20Social Media Mining Community Analysishttp://socialmediamining.info/ A. Node Degree Most common subgraph searched for: • Clique: a maximum complete subgraph in which all nodes inside the subgraph adjacent to each other Find communities by searching for 1. The maximum clique: the one with the largest number of vertices, or 2. All maximal cliques: cliques that are not subgraphs of a larger clique; i.e., cannot be further expanded Both problems are NP-hard To overcome this, we can I. Brute Force II. Relax cliques III. Use cliques as the core for larger communities
  • 21. 21Social Media Mining Measures and Metrics 21Social Media Mining Community Analysishttp://socialmediamining.info/ I. Brute-Force Method Can find all the maximal cliques in the graph For each vertex 𝑣 𝑥, we find the maximal clique that contains node 𝑣 𝑥 Impractical for large networks: • For a complete graph of only 100 nodes, the algorithm will generate at least 299 − 1 different cliques starting from any node in the graph
  • 22. 22Social Media Mining Measures and Metrics 22Social Media Mining Community Analysishttp://socialmediamining.info/ Enhancing the Brute-Force Performance [Systematic] Pruning can help: • When searching for cliques of size 𝑘 or larger • If the clique is found, each node should have a degree equal to or more than 𝑘 − 1 • We can first prune all nodes (and edges connected to them) with degrees less than 𝑘 − 1 – More nodes will have degrees less than 𝑘 − 1 – Prune them recursively • For large 𝑘, many nodes are pruned as social media networks follow a power-law degree distribution
  • 23. 23Social Media Mining Measures and Metrics 23Social Media Mining Community Analysishttp://socialmediamining.info/ Maximum Clique: Pruning… Example. to find a clique ≥ 4, remove all nodes with degree ≤ (4 − 1) − 1 = 2 – Remove nodes 2 and 9 – Remove nodes 1 and 3 – Remove node 4 Even with pruning, cliques are less desirable – Cliques are rare – A clique of 1000 nodes, has 999x1000/2 edges – A single edge removal destroys the clique – That is less than 0.0002% of the edges!
  • 24. 24Social Media Mining Measures and Metrics 24Social Media Mining Community Analysishttp://socialmediamining.info/ II. Relaxing Cliques • 𝒌-plex: a set of vertices 𝑉 in which we have • 𝑑 𝑣 is the degree of 𝑣 in the induced subgraph – Number of nodes from 𝑉 that are connected to 𝑣 • Clique of size 𝑘 is a 1-plex • Finding the maximum 𝑘-plex: NP-hard – In practice, relatively easier due to smaller search space. Maximal 𝑘-plexes
  • 25. 25Social Media Mining Measures and Metrics 25Social Media Mining Community Analysishttp://socialmediamining.info/ More Cliques Relaxing… • 𝒌-core: a maximal connected subgraph in which all vertices have degree at least 𝒌 – Difference with 𝒌-plex? • 𝒌-shell: nodes that are part of the 𝑘-core, but are not part of the (𝑘 + 1)-core. Questions • 0-core? • 0-shell? • 1-core? • 𝑘-cores of the complete graph? A Clique that wants to relax A 2-plex or a 2-core?
  • 26. 26Social Media Mining Measures and Metrics 26Social Media Mining Community Analysishttp://socialmediamining.info/ III. Using Cliques as a seed of a Community Clique Percolation Method (CPM) – Uses cliques as seeds to find larger communities – CPM finds overlapping communities • Input – A parameter 𝑘, and a network • Procedure – Find out all cliques of size 𝑘 in the given network – Construct a clique graph. • Two cliques are adjacent if they share 𝑘 − 1 nodes – Each connected components in the clique graph form a community
  • 27. 27Social Media Mining Measures and Metrics 27Social Media Mining Community Analysishttp://socialmediamining.info/ Clique Percolation Method: Example Cliques of size 3: {𝑣1, 𝑣2, 𝑣3}, {𝑣3, 𝑣4, 𝑣5}, 𝑣4, 𝑣5, 𝑣6 , {𝑣4, 𝑣5, 𝑣7}, {𝑣4, 𝑣6, 𝑣7}, {𝑣5, 𝑣6, 𝑣7}, {𝑣6, 𝑣7, 𝑣8}, {𝑣8, 𝑣9, 𝑣10} Communities: {𝑣1, 𝑣2, 𝑣3}, {𝑣8, 𝑣9, 𝑣10}, {𝑣3, 𝑣4, 𝑣5, 𝑣6, 𝑣7, 𝑣8}
  • 28. 28Social Media Mining Measures and Metrics 28Social Media Mining Community Analysishttp://socialmediamining.info/ B. Node Reachability The two extremes Nodes are assumed to be in the same community 1. If there is a path between them (regardless of the distance) or 2. They are so close as to be immediate neighbors. How? Find using BFS/DFS How? Finding Cliques Challenge: most nodes are in one community (giant component) Challenge: Cliques are challenging to find and are rarely observed Solution: find communities that are in between cliques and connected components in terms of connectivity and have small shortest paths between their nodes
  • 29. 29Social Media Mining Measures and Metrics 29Social Media Mining Community Analysishttp://socialmediamining.info/ Special Subgraphs 1. 𝑘-Clique: a maximal subgraph in which the largest shortest path distance between any nodes is less than or equal to 𝑘 2. 𝑘-Club: follows the same definition as a 𝑘-clique – Additional Constraint: nodes on the shortest paths should be part of the subgraph (i.e., diameter) 3. 𝑘-Clan: a 𝑘-clique where for all shortest paths within the subgraph the distance is equal or less than 𝑘. – All 𝑘-clans are 𝑘-cliques, but not vice versa.
  • 30. 30Social Media Mining Measures and Metrics 30Social Media Mining Community Analysishttp://socialmediamining.info/ More Special Subgraphs! • 𝒌-truss: the largest subgraph where all edges belong to 𝑘 − 2 triangles • What is the relationship between a 𝒌-core and 𝒌-truss?
  • 31. 31Social Media Mining Measures and Metrics 31Social Media Mining Community Analysishttp://socialmediamining.info/ C. Node Similarity • Similar (or most similar) nodes are assumed to be in the same community. – A classical clustering algorithm (e.g., 𝑘-means) is applied to node similarities to find communities. • Node similarity can be defined – Using the similarity of node neighborhoods (Structural Equivalence) – Ch. 3 – Similarity of social circles (Regular Equivalence) – Ch. 3 Structural equivalence: two nodes are structurally equivalent iff. they are connecting to the same set of actors Nodes 1 and 3 are structurally equivalent, So are nodes 5 and 7.
  • 32. 32Social Media Mining Measures and Metrics 32Social Media Mining Community Analysishttp://socialmediamining.info/ Node Similarity (Structural Equivalence) Jaccard Similarity Cosine similarity
  • 33. 33Social Media Mining Measures and Metrics 33Social Media Mining Community Analysishttp://socialmediamining.info/ Group-Based Community Detection
  • 34. 34Social Media Mining Measures and Metrics 34Social Media Mining Community Analysishttp://socialmediamining.info/ Group-Based Community Detection Group-based community detection: finding communities that have certain group properties Group Properties: I. Balanced: Spectral clustering II. Robust: 𝑘-connected graphs III. Modular: Modularity Maximization IV. Dense: Quasi-cliques V. Hierarchical: Hierarchical clustering
  • 35. 35Social Media Mining Measures and Metrics 35Social Media Mining Community Analysishttp://socialmediamining.info/ I. Balanced Communities • Community detection can be thought of graph clustering • Graph clustering: we cut the graph into several partitions and assume these partitions represent communities • Cut: partitioning (cut) of the graph into two (or more) sets (cutsets) – The size of the cut is the number of edges that are being cut • Minimum cut (min-cut) problem: find a graph partition such that the number of edges between the two sets is minimized Min-cuts can be computed efficiently using the max-flow min-cut theorem Min-cut often returns an imbalanced partition, with one set being a singleton Min-cut
  • 36. 36Social Media Mining Measures and Metrics 36Social Media Mining Community Analysishttp://socialmediamining.info/ Ratio Cut and Normalized Cut • To mitigate the min-cut problem we can change the objective function to consider community size • 𝑃𝑖 = 𝑉 − 𝑃𝑖 is the complement cut set • 𝑐𝑢𝑡(𝑃𝑖, 𝑃𝑖) is the size of the cut • 𝑣𝑜𝑙 𝑃𝑖 = 𝑣∈𝑃 𝑖 𝑑 𝑣
  • 37. 37Social Media Mining Measures and Metrics 37Social Media Mining Community Analysishttp://socialmediamining.info/ Ratio Cut & Normalized Cut: Example For Cut A For Cut B Both ratio cut and normalized cut prefer a balanced partition A B
  • 38. 38Social Media Mining Measures and Metrics 38Social Media Mining Community Analysishttp://socialmediamining.info/ Spectral Clustering Reformulating ratio cut (or normalized cut) in matrix format • Let 𝑋𝑖𝑗 = 1, when node 𝑖 is member of community 𝑗; 0, otherwise • Let 𝐷 = 𝑑𝑖𝑎𝑔(𝑑1, 𝑑2, ⋯, 𝑑 𝑛) be the diagonal degree matrix • The 𝑖th entry on the diagonal of 𝑋 𝑇 𝐴𝑋 is the number of edges that are inside community 𝑖. • The 𝑖th element on the diagonal of 𝑋 𝑇 𝐷𝑋 is the number of edges that are connected to members of community 𝑖. • The 𝑖th element on the diagonal of 𝑋 𝑇(𝐷 − 𝐴)𝑋 is the number of edges in the cut that separates community 𝑖 from other nodes. The 𝑖th diagonal element of 𝑋 𝑇(𝐷 − 𝐴)𝑋 is equivalent to cut(𝑃𝑖, 𝑃𝑖)
  • 39. 39Social Media Mining Measures and Metrics 39Social Media Mining Community Analysishttp://socialmediamining.info/ Spectral Clustering So ratio cut is
  • 40. 40Social Media Mining Measures and Metrics 40Social Media Mining Community Analysishttp://socialmediamining.info/ Spectral Clustering Both ratio/normalized cut can be reformulated as • Spectral relaxation: 𝐷 = 𝑑𝑖𝑎𝑔(𝑑1, 𝑑2, ⋯, 𝑑 𝑛) is a diagonal degree matrix Optimal Solution 𝑋 is the top eigenvectors with the smallest eigenvalues
  • 41. 41Social Media Mining Measures and Metrics 41Social Media Mining Community Analysishttp://socialmediamining.info/ Recovering Integer Membership Values • Because we performed spectral relaxation, the matrix obtained is not integer valued • To recover 𝑋 from 𝑋 we can run 𝑘-means on 𝑋
  • 42. 42Social Media Mining Measures and Metrics 42Social Media Mining Community Analysishttp://socialmediamining.info/ Spectral Clustering: Example 𝐷 = 𝑑𝑖𝑎𝑔(2, 2, 4, 4, 4, 4, 4, 3, 1) 2 Eigenvectors i.e., we want 2 communities Two communities: {𝑣1, 𝑣2, 𝑣3} {𝑣4, 𝑣5, 𝑣6, 𝑣7, 𝑣8, 𝑣9} 𝒌-means,𝒌=𝟐
  • 43. 43Social Media Mining Measures and Metrics 43Social Media Mining Community Analysishttp://socialmediamining.info/ More on Laplacian Laplacian and Diffusion: • Consider a substance that diffuses from node to node on a graph (e.g., gas or fluid) • The substance quantity for node 𝑗 is 𝜑 𝑗 • The diffusion depends on the difference in quantity between nodes – e.g., gas pressure difference • Then we have
  • 44. 44Social Media Mining Measures and Metrics 44Social Media Mining Community Analysishttp://socialmediamining.info/ Laplacian and Diffusion Why do we call it the Laplacian? As this is equivalent to heat equation ( 𝒅𝝋 𝒅𝒕 + 𝛂𝛁 𝟐 𝝋 = 𝟎) and 𝑳 replaces the Laplace operator 𝛁 𝟐
  • 45. 45Social Media Mining Measures and Metrics 45Social Media Mining Community Analysishttp://socialmediamining.info/ Laplacian and Diffusion State 𝜑(𝑡) in terms of eigenvalue-eigenvectors of 𝐿 – L is symmetric and has real eigenvalues. Some coefficient! Eigenvalues are orthogonal, so right multiply by any 𝑣𝑗
  • 46. 46Social Media Mining Measures and Metrics 46Social Media Mining Community Analysishttp://socialmediamining.info/ II. Robust Communities • The goal is find subgraphs robust enough such that removing some edges or vertices does not disconnect the subgraph • 𝒌-vertex connected (𝒌-connected) graph: – 𝑘 is the minimum number of nodes that must be removed to disconnect the graph • 𝒌-edge connected: at least 𝑘 edges must be removed to disconnect the graph • Examples: – Complete graph of size 𝑛: unique 𝑛-connected graph – A cycle: 2-connected graph
  • 47. 47Social Media Mining Measures and Metrics 47Social Media Mining Community Analysishttp://socialmediamining.info/ III. Modular Communities Consider a graph 𝐺(𝑉, 𝐸), where the degrees are known beforehand however edges are not – Consider two vertices 𝑣𝑖 and 𝑣𝑗 with degrees 𝑑𝑖 and 𝑑𝑗. What is an expected number of edges between 𝑣𝑖 and 𝑣𝑗? • For any edge going out of 𝑣𝑖 randomly the probability of this edge getting connected to vertex 𝑣𝑗 is
  • 48. 48Social Media Mining Measures and Metrics 48Social Media Mining Community Analysishttp://socialmediamining.info/ Modularity and Modularity Maximization • Given a degree distribution, we know the expected number of edges between any pairs of vertices • We assume that real-world networks should be far from random. Therefore, the more distant they are from this randomly generated network, the more structural they are. • Modularity defines this distance and modularity maximization tries to maximize this distance
  • 49. 49Social Media Mining Measures and Metrics 49Social Media Mining Community Analysishttp://socialmediamining.info/ Normalized Modularity Consider a partitioning of the data 𝑃 = (𝑃1, 𝑃2, 𝑃3, … , 𝑃 𝑘) For partition 𝑃𝑥, this distance can be defined as This distance can be generalized for a partitioning 𝑃 The normalized version of this distance is defined as Modularity
  • 50. 50Social Media Mining Measures and Metrics 50Social Media Mining Community Analysishttp://socialmediamining.info/ Modularity Maximization Modularity matrix 𝑑 ∈ ℝ 𝑛×1 is the degree vector for all nodes Reformulation of the modularity • 𝑋 ∈ ℝ 𝑛×𝑘 is the indicator (partition membership) function:  𝑋𝑖𝑗 = 1 iff. 𝑣𝑖 ∈ 𝑃𝑗 • Similar to Spectral clustering, • We relax 𝑋 to be orthogonal, i.e., matrix 𝑋 • The optimal solution for 𝑋 is the top 𝑘 eigenvectors of 𝐵. • To recover the original 𝑋, we can run 𝑘-means on 𝑋 • Difference: in Modularity, the top 𝑘 eigenvalues have to be positive!
  • 51. 51Social Media Mining Measures and Metrics 51Social Media Mining Community Analysishttp://socialmediamining.info/ Modularity Maximization: Example Modularity Matrix 𝑘-means Two Communities: {1, 2, 3, 4} and {5, 6, 7, 8, 9} 2 eigenvectors
  • 52. 52Social Media Mining Measures and Metrics 52Social Media Mining Community Analysishttp://socialmediamining.info/ Making Modularity Optimization Faster • The matrix method discussed is slow and does not scale to millions of nodes (and billions of edges) • We can perform greedy optimization of modularity to speed-up the process • Louvain Method – A greedy modularity optimization method for community detection • Invented when all authors affiliated with Université catholique de Louvain (UCL)
  • 53. 53Social Media Mining Measures and Metrics 53Social Media Mining Community Analysishttp://socialmediamining.info/ Louvain Method The algorithm has multiple passes – Each pass has two phases 1. Modularity Optimization 2. Community Aggregation Image from Blondel, Vincent D., Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. "Fast unfolding of communities in large networks." Journal of statistical mechanics: theory and experiment 2008, no. 10 (2008): P10008.
  • 54. 54Social Media Mining Measures and Metrics 54Social Media Mining Community Analysishttp://socialmediamining.info/ Louvain Method Start with a weighted network where all nodes are in their own communities (i.e., n communities) First Phase: • For each node 𝑣𝑖, – For all neighbors 𝑣𝑗 ∈ 𝑁(𝑣𝑖): • compute the modularity gain if 𝑣𝑖 is removed from its community and placed in the community of 𝑣𝑗. – Find the community with the maximum modularity gain – If the maximum gain is positive, remove 𝑣𝑖 from its community, and place it in that community – If no positive gain, do not change communities • Repeat until no node changes its community
  • 55. 55Social Media Mining Measures and Metrics 55Social Media Mining Community Analysishttp://socialmediamining.info/ Important Points about Phase I • A point can be considered multiple times • A Local Minima of modularity maximization is achieved in phase I • Phase I is order dependent – The modularity achieved is more or less stable and is less dependent on the initial order – The computation time depends on the initial order.
  • 56. 56Social Media Mining Measures and Metrics 56Social Media Mining Community Analysishttp://socialmediamining.info/ Louvain Method Second Phase: – Build a new network • Nodes are communities • Edges are the edges between nodes in the corresponding communities (weights are sum of the weights) • Self-loops represent edges within the community • The algorithm creates hierarchies of communities • It usually ends in less than 10 passes • It is seems to be an O(𝑛 log 𝑛) algorithm
  • 57. 57Social Media Mining Measures and Metrics 57Social Media Mining Community Analysishttp://socialmediamining.info/ Why is it fast? Modularity can be restated as Consider node 𝑖, in an isolated community, when it moves to community 𝐶 • The gain in modularity can be computed easily community of node 𝑖 Kronecker delta function Edges within 𝑪 Edges between 𝒊 and 𝑪 All edges connected to members of 𝑪
  • 58. 58Social Media Mining Measures and Metrics 58Social Media Mining Community Analysishttp://socialmediamining.info/ IV. Dense Communities: -dense • The density of a graph defines how close a graph is to a clique • A subgraph 𝐺(𝑉, 𝐸) is a -dense (or quasi-clique) if • A 1-dense graph is a clique • We can find quasi-cliques using the brute force algorithm discussed previously, but there are more efficient methods.
  • 59. 59Social Media Mining Measures and Metrics 59Social Media Mining Community Analysishttp://socialmediamining.info/ Finding Maximal -dense Quasi-Cliques We can use a two-step procedure consisting of “local search” and “heuristic pruning” Local search • Sample a subgraph, and find a maximal -dense quasi- clique • A greedy approach is to expand a quasi-clique by all of its high- degree neighbors until the density drops below  Heuristic pruning • For a -dense quasi-clique of size k, we recursively remove nodes with degree less than  k and incident edges • We can start from low-degree nodes and recursively remove all nodes with degree less that  k
  • 60. 60Social Media Mining Measures and Metrics 60Social Media Mining Community Analysishttp://socialmediamining.info/ V. Hierarchical Communities • Previous methods consider communities at a single level – Communities may have hierarchies. • Each community can have sub/super communities. – Hierarchical clustering deals with this scenario and generates community hierarchies. • Initially 𝑛 members are considered as either 1 or 𝑛 communities in hierarchical clustering. These communities are gradually – merged (agglomerative hierarchical clustering) or – split (divisive hierarchical clustering)
  • 61. 61Social Media Mining Measures and Metrics 61Social Media Mining Community Analysishttp://socialmediamining.info/ Hierarchical Community Detection • Goal: build a hierarchical structure of communities based on network topology • Allow the analysis of a network at different resolutions • Representative approaches: – Divisive Hierarchical Clustering – Agglomerative Hierarchical clustering
  • 62. 62Social Media Mining Measures and Metrics 62Social Media Mining Community Analysishttp://socialmediamining.info/ Divisive Hierarchical Clustering • Divisive clustering – Partition nodes into several sets – Each set is further divided into smaller ones – Network-centric partition can be applied for the partition • One particular example: Girvan-Newman Algorithm: recursively remove the “weakest” links within a “community” – Find the edge with the weakest link – Remove the edge and update the corresponding strength of each edge • Recursively apply the above two steps until a network is discomposed into a desired number of connected components. • Each component forms a community
  • 63. 63Social Media Mining Measures and Metrics 63Social Media Mining Community Analysishttp://socialmediamining.info/ Edge Betweenness • To determine weakest links, the algorithm uses edge betweenness. Edge betweenness is the number of shortest paths that pass along with the edge • Edge betweenness measures the “bridgeness” of an edge between two communities • The edge with high betweenness tends to be the bridge between two communities.
  • 64. 64Social Media Mining Measures and Metrics 64Social Media Mining Community Analysishttp://socialmediamining.info/ Edge Betweenness: Example The edge betweenness of 𝑒(1, 2) is 6/2 + 1 = 4, as all the shortest paths from 2 to {4, 5, 6, 7, 8, 9} have to either pass 𝑒(1, 2) or 𝑒(2, 3), and 𝑒(1,2) is the shortest path between 1 and 2
  • 65. 65Social Media Mining Measures and Metrics 65Social Media Mining Community Analysishttp://socialmediamining.info/ The Girvan-Newman Algorithm 1. Calculate edge betweenness for all edges in the graph. 2. Remove the edge with the highest betweenness. 3. Recalculate betweenness for all edges affected by the edge removal. 4. Repeat until all edges are removed.
  • 66. 66Social Media Mining Measures and Metrics 66Social Media Mining Community Analysishttp://socialmediamining.info/ Edge Betweenness Divisive Clustering: Example the first edge that needs to be removed is e(4, 5) ( or e(4, 6) ) Initial betweenness value By removing e(4, 5), we compute the edge betweenness once again; this time, e(4, 6) has the highest betweenness value: 20. This is because all shortest paths between nodes {1,2,3,4} to nodes {5,6,7,8,9} must pass e(4, 6); therefore, it has betweenness 4×5 = 20.
  • 67. 67Social Media Mining Measures and Metrics 67Social Media Mining Community Analysishttp://socialmediamining.info/ Community Detection Algorithms
  • 68. 68Social Media Mining Measures and Metrics 68Social Media Mining Community Analysishttp://socialmediamining.info/ Community Evolution
  • 69. 69Social Media Mining Measures and Metrics 69Social Media Mining Community Analysishttp://socialmediamining.info/ Network and Community Evolution • How does a network change over time? • How does a community change over time? • What properties do you expect to remain roughly constant? • What properties do you expect to change? • For example, – Where do you expect new edges to form? – Which edges do you expect to be dropped?
  • 70. 70Social Media Mining Measures and Metrics 70Social Media Mining Community Analysishttp://socialmediamining.info/ How Networks Evolve?
  • 71. 71Social Media Mining Measures and Metrics 71Social Media Mining Community Analysishttp://socialmediamining.info/ Network Growth Patterns 1. Network Segmentation 2. Graph Densification 3. Diameter Shrinkage
  • 72. 72Social Media Mining Measures and Metrics 72Social Media Mining Community Analysishttp://socialmediamining.info/ 1. Network Segmentation • Often, in evolving networks, segmentation takes place, where the large network is decomposed over time into three parts 1. Giant Component: As network connections stabilize, a giant component of nodes is formed, with a large proportion of network nodes and edges falling into this component. 2. Stars: These are isolated parts of the network that form star structures. A star is a tree with one internal node and n leaves. 3. Singletons: These are orphan nodes disconnected from all nodes in the network.
  • 73. 73Social Media Mining Measures and Metrics 73Social Media Mining Community Analysishttp://socialmediamining.info/ 2. Graph Densification • The density of the graph increases as the network grows – The number of edges increases faster than the number of nodes does • Densification exponent: 1 ≤ 𝛼 ≤ 2: – 𝛼 = 1: linear growth – constant out-degree – 𝛼 = 2: quadratic growth – clique 𝐸(𝑡) and 𝑉(𝑡) are numbers of edges and nodes respectively at time 𝑡
  • 74. 74Social Media Mining Measures and Metrics 74Social Media Mining Community Analysishttp://socialmediamining.info/ Densification in Real Networks V(t) E(t) 1.69 Physics Citations V(t) E(t) 1.66 Patent Citations
  • 75. 75Social Media Mining Measures and Metrics 75Social Media Mining Community Analysishttp://socialmediamining.info/ 3. Diameter Shrinking • In networks diameter shrinks over time ArXiv citation graph Affiliation Network
  • 76. 76Social Media Mining Measures and Metrics 76Social Media Mining Community Analysishttp://socialmediamining.info/ How Communities Evolve?
  • 77. 77Social Media Mining Measures and Metrics 77Social Media Mining Community Analysishttp://socialmediamining.info/ Community Evolution • Communities also expand, shrink, or dissolve in dynamic networks
  • 78. 78Social Media Mining Measures and Metrics 78Social Media Mining Community Analysishttp://socialmediamining.info/ Community Detection in Evolving Networks
  • 79. 79Social Media Mining Measures and Metrics 79Social Media Mining Community Analysishttp://socialmediamining.info/ Extending Previous Methods 1. Take t snapshots of the network, 𝐺1,𝐺2, ⋯ , 𝐺𝑡 where 𝐺𝑖 is a snapshot at time 𝑖 2. Perform a static community detection algorithm (all methods discussed before) on all snapshots independently 3. Assign community members based on communities found in all 𝑡 time stamps. – E.g., Assign nodes to communities based on voting (assign nodes to communities they belong to the most over time) Unstable in highly dynamic networks as community memberships are always changing
  • 80. 80Social Media Mining Measures and Metrics 80Social Media Mining Community Analysishttp://socialmediamining.info/ Evolutionary Clustering • Assume that communities don’t change most of the time • Minimize an objective function that considers – Snapshot Cost. Communities at different times (SC) – Temporal Cost. How communities evolve (TC) • Objective function is defined as
  • 81. 81Social Media Mining Measures and Metrics 81Social Media Mining Community Analysishttp://socialmediamining.info/ Evolutionary Clustering • If we use spectral clustering for each snapshot • Then, 𝐶𝑜𝑠𝑡𝑡, the objective function at time 𝑡 is • 𝑋𝑡 is the community membership matrix at 𝑡 • To define TC we can use • Challenges with this definition – Assumes that we have the same number of communities at time 𝑡 − 1 and 𝑡 – 𝑋𝑡 is non-unique (any orthogonal transformation is still a solution)
  • 82. 82Social Media Mining Measures and Metrics 82Social Media Mining Community Analysishttp://socialmediamining.info/ Evolutionary Clustering • We can define TC as • Hence cost will be
  • 83. 83Social Media Mining Measures and Metrics 83Social Media Mining Community Analysishttp://socialmediamining.info/ Evolutionary Clustering • Assuming Normalized Laplacian is used Similar to Spectral Clustering 𝑿 𝒕 can be obtained by taking the top eigenvectors of 𝑳
  • 84. 84Social Media Mining Measures and Metrics 84Social Media Mining Community Analysishttp://socialmediamining.info/ Community Evaluation
  • 85. 85Social Media Mining Measures and Metrics 85Social Media Mining Community Analysishttp://socialmediamining.info/ Evaluating the Communities • Evaluation with ground truth • Evaluation without ground truth We are given objects of two different kinds (+, ×) • The perfect community: all objects inside the community are of the same type
  • 86. 86Social Media Mining Measures and Metrics 86Social Media Mining Community Analysishttp://socialmediamining.info/ Evaluation with Ground Truth • When ground truth is available – We have partial knowledge of what communities should look like – We are given the correct community (clustering) assignments • Measures – Precision and Recall, or F-Measure – Purity – Normalized Mutual Information (NMI)
  • 87. 87Social Media Mining Measures and Metrics 87Social Media Mining Community Analysishttp://socialmediamining.info/ Precision and Recall Precision = Relevant and retrieved Retrieved Recall = Relevant and retrieved Relevant False Negative (FN) : • When similar members are assigned to different communities • An incorrect decision False Positive (FP) : • When dissimilar members are assigned to the same communities • An incorrect decision True Positive (TP) : • When similar members are assigned to the same communities • A correct decision. True Negative (TN) : • When dissimilar members are assigned to different communities • A correct decision
  • 88. 88Social Media Mining Measures and Metrics 88Social Media Mining Community Analysishttp://socialmediamining.info/ Precision and Recall: Example
  • 89. 89Social Media Mining Measures and Metrics 89Social Media Mining Community Analysishttp://socialmediamining.info/ F-Measure Either 𝑃 or 𝑅 measures one aspect of the performance, – To integrate them into one measure, we can use the harmonic mean of precision of recall For the example earlier, 𝐹 = 2 × 0.6 × 0.61 0.6 + 0.61 = 0.60
  • 90. 90Social Media Mining Measures and Metrics 90Social Media Mining Community Analysishttp://socialmediamining.info/ Purity We can assume the majority of a community represents the community – We use the label of the majority against the label of each member to evaluate the communities Purity. The fraction of instances that have labels equal to the community’s majority label • 𝑘: the number of communities • 𝑁: total number of nodes, • 𝐿𝑗: the set of instances with label 𝑗 in all communities • 𝐶𝑖: the set of members in community 𝑖 Purity can be easily tampered by • Points being singleton communities (of size 1); or by • Very large communities
  • 91. 91Social Media Mining Measures and Metrics 91Social Media Mining Community Analysishttp://socialmediamining.info/ Mutual Information • Mutual information (MI). The amount of information that two random variables share. – By knowing one of the variables, it measures the amount of uncertainty reduced regarding the others • 𝐿 and 𝐻 are labels and found communities; • 𝑛ℎ and 𝑛𝑙 are the number of data points in community ℎ and with label 𝑙, respectively; • 𝑛ℎ, 𝑙 is the number of nodes in community ℎ and with label 𝑙; and 𝑛 is the number of nodes
  • 92. 92Social Media Mining Measures and Metrics 92Social Media Mining Community Analysishttp://socialmediamining.info/ Normalizing Mutual Information (NMI) • Mutual information (MI) is unbounded • To address this issue, we can normalize MI • How? We know that • 𝐻(. ) is the entropy function
  • 93. 93Social Media Mining Measures and Metrics 93Social Media Mining Community Analysishttp://socialmediamining.info/ Normalized Mutual Information Normalized Mutual Information We can also define it as Note that 𝑀𝐼 < 1/2(𝐻(𝐻) + 𝐻(𝐿)
  • 94. 94Social Media Mining Measures and Metrics 94Social Media Mining Community Analysishttp://socialmediamining.info/ Normalized Mutual Information • NMI values close to one indicate high similarity between communities found and labels • Values close to zero indicate high dissimilarity between them • where 𝑙 and ℎ are known (with labels) and found communities, respectively • 𝑛ℎ and 𝑛𝑙 are the number of members in the community ℎ and 𝑙, respectively, • 𝑛ℎ, 𝑙 is the number of members in community ℎ and labeled 𝑙, • 𝑛 is the size of the dataset
  • 95. 95Social Media Mining Measures and Metrics 95Social Media Mining Community Analysishttp://socialmediamining.info/ Normalized Mutual Information: Example Found communities (H) – [1,1,1,1,1,1, 2,2,2,2,2,2,2,2] nh h=1 6 h=2 8 nl 𝑙 = 1 7 𝑙 = 2 7 nh,l 𝑙 = 1 𝑙 = 2 h=1 5 1 h=2 2 6 𝑛 = 14 Actual Labels (L) – [2,1,1,1,1,1, 2,2,2,2,2,2,1,1]
  • 96. 96Social Media Mining Measures and Metrics 96Social Media Mining Community Analysishttp://socialmediamining.info/ Evaluation without Ground Truth • Evaluation with Semantics – A simple way of analyzing detected communities is to analyze other attributes (posts, profile information, content generated, etc.) of community members to see if there is a coherency among community members – The coherency is often checked via human subjects. • Or through labor markets: Amazon Mechanical Turk – To help analyze these communities, one can use word frequencies. By generating a list of frequent keywords for each community, human subjects determine whether these keywords represent a coherent topic. • Evaluation Using Clustering Quality Measures – Use clustering quality measures (SSE) – Use more than two community detection algorithms and compare the results and pick the algorithm with better quality measure

Hinweis der Redaktion

  1. Community is “a group of densely interconnected nodes” Ref: Y.-Y. Ahn, J. P. Bagrow, and S. Lehmann, "Communities and Hierarchical Organization of Links in Complex Networks", submitted, arxiv:0903.3178. Community is “a group of TOPOLOGICALLY SIMILAR LINKS” Ref: Y.-Y. Ahn, J. P. Bagrow, and S. Lehmann, "Communities and Hierarchical Organization of Links in Complex Networks", submitted, arxiv:0903.3178.
  2. Explicitly vs. implicitly formed groups
  3. Explicitly vs. implicitly formed groups
  4. From Lei’s Slides
  5. A is an adjacency matrix
  6. From Lei’s slides must be changed
  7. The 1st eigenvector is discarded
  8. From Lei Tang’s Book and Slides
  9. Source: Leskovec et al. KDD 2005
  10. TP – the intersection of the two oval shapes; TN – the rectangle minus the two oval shapes; FP – the circle minus the blue part; FN – the blue oval minus the circle. Accuracy = (TP+TN)/(TP+FP+FN+FN); Precision = TP/(TP+FP); Recall = TP/(TP+FN)
  11. TP+FP = C(6,2) + C(8,2) = 15+28 = 43; FP counts the wrong pairs within each cluster; FN counts the similar pairs but wrongly put into different clusters; TN counts dissimilar pairs in different clusters
  12. N – the total number of data points.