Mining and analyzing social media hicss 45 tutorial – part 2
1. Mining and Analyzing Social Media
HICSS 45 Tutorial – Part 2
Dave King
January 4, 2012
2. Agenda: This is how the slides are
organized
• Part 1
– Introduction – Bio, Resources, Social Media
– Data Mining – Processes and Example
– Text Mining – General Processes and Example
– Predicting the Future – The Portmanteaus
• Part 2
– Sentiment Analysis
– Social Network Analysis - Introduction
2
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
3. Sentiment Analysis:
What are your customers thinking?
Every hour of every day they share their opinions, issues, thoughts and
sentiments about brands, products, services and companies (on line).
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
4. Sentiment Analysis:
Some Survey Data
Cone Communications:
http://www.coneinc.com/2011co
neonlineinfluencetrendtracker
4
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
5. Sentiment Analysis:
Some Payoffs
Marketing Service Products
Message Response Issues and Focus
A form of Automated Text Categorization (ATC)
5
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
7. Sentiment Analysis:
Some Examples
Cycling Community Responds
@BicyclingMag
@BikePortland
@clevercycle
@cyclingreporter
GM runs Ad on 10/17/11
7
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
8. Sentiment Analysis:
Some Examples
Key Areas of Concern:
• Break in online link to
Mint.com
• Actionable Service
Breaks
• Outrage over “$50
limit on debit card
transactions”
8
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
9. Sentiment Analysis:
Defined
Text Mining to classify subjective opinions in text into
categories like "positive" or "negative” extracting various forms
of attitudinal information: sentiment, opinion, mood, and
emotion. Also called Voice of the Customer (VOC) or Opinion
Mining.
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
11. Sentiment Analysis:
Types
• Sentiment Classification – document
level, classified as positive or negative
• Feature-based opinion – sentence
level, determines which aspects of an
object people like or dislike
• Comparative sentence and
relationship mining – sentence level
comparisons of one object against
another (to determine which is better
than the other)
11
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
12. Sentiment Analysis:
Types
• From one type to the next (classification, features,
comparisons), it becomes more complex to identify
and extract the information.
• Once extracted, standard text mining techniques
can be used to classify and compare the opinions
• Simple techniques (like naïve Bayesian) often
produce strong results (e.g. 80+% accuracy)
12
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
13. Sentiment Analysis:
Assumption
An Opinion Lexicon that Expresses State
Polar, Opinion-Bearing, and Sentiment Words and Phrases
13
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
14. Sentiment Analysis:
How do you know if it is “+” or “-”?
plot : two teen couples go to a church party , drink and then drive .
they get into an accident .
one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares .
what's the deal ?
watch the movie and " sorta " find out . . .
critique : a mind-xxx movie for the teen generation that touches on a very cool idea , but
presents it in a very bad package .
which is what makes this review an even harder one to write , since i generally applaud films
which attempt to break the mold , mess with your head and such ( lost highway & memento ) ,
but there are good and bad ways of making all types of films , and these folks just didn't
snag this one correctly .
they seem to have taken this pretty neat concept , but executed it terribly .
so what are the problems with the movie ?
well , its main problem is that it's simply too jumbled .
having not seen , " who framed roger rabbit " in over 10 years , and not remembering much besides
that i liked it then , i decided to rent it recently .
watching it i was struck by just how brilliant a film it is .
aside from the fact that it's a milestone in animation in movies ( it's the first film to combine real
actors and cartoon characters , have them interact , and make it convincingly real ) and a great
entertainment it's also quite an effective comedy/mystery .
while the plot may be somewhat familiar the characters are original , especially baby herman , and
watching them together is a lot of fun .
…
`who framed roger rabbit' is a rare film .
one that not only presented a great challenge to the filmmakers but one that can be enjoyed
by the whole family ( although some very young viewers may be a little scared by judge doom ) .
do yourself a favor and rent it , `p-p-p-p-please . "
14
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
17. Sentiment Analysis:
Doing Simple Sentiment Analysis
General Problem
1
Automated
2
Collection Process Small Set of
of Text for Classifying Predetermined
3
Documents categories
??? …
n
17
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
18. Sentiment Analysis:
Doing Simple Sentiment Analysis
General Answer
1
Automated
2
Collection Process Small Set of
of Text for Classifying Predetermined
3
Documents categories
…
n
Inductive, supervised machine learning
classification process and algorithm
18
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
19. Sentiment Analysis:
Doing Simple Sentiment Analysis
Real-World
Text Data Training Process
Documents with known Classification
Document
Consolidation
Train Test Validate
Establish the
Corpus
Classification
Corpus Refinement Algorithm
(Token, Stem, Stop…)
Feature Selection
& Weighting
1 2 3 n
Predetermined Categories
Term-
Doc-Matrix*
19
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
21. Sentiment Analysis:
Doing Simple Sentiment Analysis
Twitter Statistics
• ~200M registered users.
• ~50M users login every day
• Over 400K new users per day.
• 400 million unique visitors per month.
• 55% use their phone to tweet.
• Average 200 million tweets a day.
• 600 million search queries per day
• 75% of traffic from 3rd Party Apps
• 60% of tweets from 3rd Party Apps
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
22. Sentiment Analysis:
Doing Simple Sentiment Analysis
Problem Features
• Each tweet <= 140 characters
(avg. 10-15 words/message)
• Heavy presence of non-alpha
symb0-ols, abbrevs,
misspellings and slang
• Tweets often include retweets
(original tweet repeated)
• In spite of this – Tweets have
proven to be an interesting text
mining source (warts and all)
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
23. Sentiment Analysis:
Doing Simple Sentiment Analysis
• Zambonini, D. "Self-Improving Bayesian Sentiment Analysis for Twitter.“ August 27,
2010. danzambonini.com/self-improving-bayesian-sentiment-analysis-for-twitter.
• Kalafatis, T. “The Sentiment on US Economy from Twitter.” October, 2009.
lifeanalytics.blogspot.com/2009/10/sentiment-on-us-economy-from-twitter.html.
• Pak, A. and P. Paroubek. “Twitter as a Corpus for Sentiment Analysis and Opinion
Mining.” In Proceedings of the Seventh International Conference on Language
Resources and Evaluation. May, 2010. lrec-
conf.org/proceedings/lrec2010/slides/385.pdf
• Sood, S. and L. Vasserman. “ESSE: Exploring Mood on the Web.” August 2009.
lcs.pomona.edu/people/files/SoodCV.pdf.
• Go, A. et al. “Twitter Sentiment Classification using Distant Supervision.”
2009.stanford.edu/~alecmgo/papers/TwitterDistantSupervision09.pdf
• Agarawal, A. et al. “Sentiment Analysis of Twitter Data.” 2011.
www1.ccls.columbia.edu/~beck/pubs/lsm2011_full.pdf
23
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
25. Sentiment Analysis:
Doing Simple Sentiment Analysis
• Twitter used to get a total of 3 billion requests a day
via its API
• API Calls for Public Tweets
– http://search.twitter.com/search.json?q=%3A)+feel+
feeling&rpp=100&page=1
– http://api.twitter.com/1/trends/current.json?
exclude=hashtags
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
26. Sentiment Analysis:
Doing Simple Sentiment Analysis
{u'iso_language_code': u'en',
u'to_user_name': None,
u'to_user_id_str': None,
u'from_user_id_str': u'59862385',
u'text': u"Lol i feel ya!!RT @Sweet_Sun_Shine: @joshaustin13 everything's up,
its the weekend baby!!!! :) and I plan on enjoying; how are you feeling?",
u'from_user_name': u'B.Resilientue50cue50cue50c',
u'profile_image_url': u'http://a3.twimg.com/profile_images/1650184586/joshaustin13_normal.jpg',
u'id': 145274459127955456L,
u'to_user': None,
u'source': u'<
a href="http://www.echofon.com/"
Sample rel="nofollow"
>Echofon<
Tweet from /a>',
API call u'id_str': u'145274459127955456',
u'from_user': u'joshaustin13',
u'from_user_id': 59862385,
u'to_user_id': None,
u'geo': None,
u'created_at': u'Fri, 09 Dec 2011 22:51:44 +0000',
u'metadata': {u'result_type': u'recent'}
}
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
27. Sentiment Analysis:
Doing Simple Sentiment Analysis
“Twitter Sentiment Classification
using Distant Supervision” (2009)
• Utilizes presence of emoticons “ :)” &
“ :( “ to serve as surrogates for
classification as positive and
negative sentiment statements
• To construct the term-document
matrix relies on a list of positive and
negative key words from Twittratr,
counting number of key words that
appear in each tweet.
• 180K tweets collected for training
purposes between April and June
2009
• 80%+ accuracy in classification
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
28. Sentiment Analysis:
Doing Simple Sentiment Analysis
Counts Type List Set Happy Face
Words HF 8354 2169
SF 7702 1996
Total 16056 3469
Alpha HF 5917 1094
SF 5433 1055
Total 11350 1169
Stop HF 3425 992
SF 3325 953
Total 6750 1563
Sad Face
Stem HF 3425 895
SF 3325 850
Total 6750 1375
Stem w/o HF 2618 894
SF 2516 849
Total 5134 1374
28
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
29. Sentiment Analysis:
Doing Simple Sentiment Analysis
Happy Face Sad Face
29
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
30. Sentiment Analysis:
Doing Simple Sentiment Analysis
P(H/D) = P(D/H) * P(H)/P(D)
H is the hypothesis and D is the data
P(H) is the prior probability of H: the probability that H is
correct before the data D are seen
.
P(D/H) is the conditional probability of seeing the data D
given that the hypothesis H is true. This conditional
probability is called the likelihood.
P(D) is the marginal probability of D.
P(H/D) is the posterior probability: the probability that
the hypothesis is true, given the data and the previous
Thomas Bayes state of belief about the hypothesis.
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
31. Sentiment Analysis:
Doing Simple Sentiment Analysis
Training Set
Message Category
Love is great Positive
I feel great now Positive
I feel sick today Negative
Great, today sucks Negative
P(Positive | Tweet)
Today is going to be good Positive
compared to
P(Negative | Tweet) P(Pos | Tweet) = P(Pos) * P(W1/Pos) / P(Tweet)
P(Pos| Tweet) = P(Pos) * P(great/Pos)
P(Pos | Tweet) = (3/5) * (2/3) = .4
P(Neg | Tweet) = P(Neg) * P(W1/N) / P(Tweet)
P(Neg | Tweet) = P(Neg) * P(great/Neg)
P(Neg| Tweet) = (2/5)*(1/2) = .2
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
32. Spam Detection:
Naïve Bayesian Classifier
Training Set
Message Category
Love is great Positive
I feel great now Positive
P(Positive | Tweet) I feel sick today Negative
compared to Great, today sucks Negative
P(Negative | Tweet) Today is going to be good Positive
P(Neg | Tweet) = P(Neg) * P(W1/Neg) * P(W2/Neg) * ...
P(Neg | Tweet) = P(Neg) * P(today/Neg) * P(sucks/Neg)
P(Neg | Tweet) = ..4 * 1 * .5 = .2
P(Pos | Tweet) = P(Pos) * P(W1/Pos) * P(W2/Pos) * ...
P(Pos | Tweet) = P(Pos) * P(today/Pos) * P(sucks/Pos)
P(Pos | Tweet) = .6 * .33 * 0 = 0
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
34. What is this number?
4.74
34
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
35. Does this help?
Frigyes Karninthy Stanley Milgram
6
John Guare Duncan Watts
35
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
36. Six Degrees of Separation
A fascinating game grew out of this discussion. One of us suggested performing
the following experiment to prove that the population of the Earth is closer
together now than they have ever been before. We should select any person from
the 1.5 billion inhabitants of the Earth—anyone, anywhere at all. He bet us that,
using no more than five individuals, one of whom is a personal acquaintance, he
could contact the selected individual using nothing except the network of
personal acquaintances.
Frigyes Karninthy , Chains, 1929
A 1 2 3 4 5
36
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
37. Sample Metric
From Social Network Analysis
4.74
Average Distance between
Facebook Members
37
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
39. Social Network Analysis:
Another Type of Analysis
Which Blogs are Similar?
Word1 Word2 Word3 … WordM
Blog1 1 0 0 … 1
For a detail description:
Blog2 0 0 1 … 0 http://www.slideshare.net/
Blog3 0 1 0 … 1 daveking63/
… … … … … …
text-mining-and-analytics-v6-p2
BlogN 0 0 0 … 1
Cluster Analysis
(e.g. K-Means)
39
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
40. Social Network Analysis:
Definitions
Network – Collection of things and their
linkages to one another.
Social Network – Collection of humans, roles,
groups, and/or institutions and their social
relationships with one another.
Social Network Analysis (SNA) – Application of
Graph Theory or Network Science to the study of
social relationships and connections.
40
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
41. Social Network Analysis:
Early Efforts
Jacob Moreno: Sociometry and the Sociogram
41
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
42. Social Network Analysis:
Definitions
“Ten years ago, the field of
Social Network Analysis
was a scientific backwater.
We were the misfits,
rejected from both
mainstream sociology and
mainstream computer
science.”
42
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
44. Social Network Analysis:
So What Happened?
Small Data Flat-files / in memory
Manually collected computation
Medium Data
SQL Databases
Data snapshots from APIs
Big Data Real-time Big Data Approaches
social media data
44
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
45. Social Network Analysis:
When things were simplier …
N=26
2005
N=80
45
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
46. Social Network Analysis:
and then …
Growth in Social Media
N~1400 Access to SM Network Data
Availability of Open Source Tools
N~3.5K
2011
N~90K
46
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
47. Social Network Analysis:
and now …
N=20M
N=80K
N = 721M
47
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
48. Social Network Analysis:
Key Elements
Graph or Network
Graph
The set of [ V,E, f ]
vertices/nodes, A
edges/links and the
relationship/function
connecting them. B
Vertices or Nodes Edge
C (Link)
The “things”
D
Vertex
Edges or Links
(Node)
The “relationships”
48
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
49. Social Network Analysis:
Types of Edges or Links
Undirected, Directed,
Unweighted Unweighted
A B A Twitter
B
Facebook
Friends Followers
C C
Undirected, Directed, Weighted
Weighted 100
A Facebook
B A 60 B
5 Email 70
Friends
Network 20
10
C C
49
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
50. Social Network Analysis:
Types of Networks
Unimodal Bimodal Multiplex
P1 E1 P1
P2 P3 P1 P2 P2 P3
Follows
Replies To
Mentions
50
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
52. Social Network Analysis:
Node Metrics (Centrality)
Measure Definition Interpretation Reasoning
Degree Number of edges or links. In How connected is a node? How Higher probability of receiving and transmitting
degree- links in, Out-degree - links many people can this person reach information flows in the network. Nodes considered to
out directly? have influence over larger number of nodes and or are
capable of communicating quickly with the nodes in
their neighborhood.
Betweenness Number of times node or vertex How important is a node in terms Degree to which node controls flow of information in
lies on shortest path between 2 of connecting other nodes? How the network. Those with high betweenness function as
nodes divided by number of all the likely is this person to be the most brokers. Useful where a network is vulnerable.
shortest paths direct route between two people
in the network?
Closeness 1 over the average distance How easily can a node reach other Measure of reach. Importance based on how close a
between a node and every other nodes? How fast can this person node is located with respect to every other node in the
node in the network reach everyone in the network? network. Nodes able to reach most or be reached by
most all other nodes in the network through geodesic
paths.
Eigenvector Proporational to the sum of the How important, central, or Evaluates a player's popularity. Identifies centers of
eigenvector centralities of all the influential are a node’s neighbors? large cliques. Node with more connections to higher
nodes directly connected to it. How well is this person connected scoring nodes is more important.
to other well-connected people?
52
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
53. Social Network Analysis:
Centrality – Who is most important?
691
E
B Eigen Node Degree Normed Degree Betweenness Closeness Eigen Vector
A 3 0.17 0.00 0.29 0.29
B 4 0.22 0.01 0.30 0.36
D
A G C 2 0.11 0.03 0.35 0.18
D 6 0.33 0.04 0.31 0.46
E 3 0.17 0.00 0.29 0.30
F
C F 4 0.22 0.11 0.36 0.35
Betw G 5 0.28 0.19 0.37 0.43
H H 5 0.28 0.58 0.45 0.28
Close I 4 0.22 0.53 0.46 0.13
R J 7 0.39 0.43 0.43 0.12
I N K 3 0.17 0.00 0.32 0.06
P Deg L 3 0.17 0.01 0.33 0.05
J M 3 0.17 0.21 0.33 0.04
O N 3 0.17 0.03 0.38 0.07
K
O 2 0.11 0.00 0.31 0.05
M
L P 3 0.17 0.03 0.38 0.08
Q 2 0.11 0.11 0.26 0.01
R 1 0.06 0.00 0.32 0.07
Q
S 1 0.06 0.00 0.21 0.00
S
53
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
54. Social Network Analysis:
Cohension – How well connected?
Density Ratio of the number of edges in How well connected is the overall Perfectly connected network is called a "clique" and
the network over the total number network? has a density of 1.
of possible edges between all pairs
of nodes
Average Degree Average number of links each node How well connected are the nodes Higher the average the better connected the members
or vector has on average? are.
Average Path Average number of edges or links On average, how far apart are any This is synonymous with the "degrees of separation" in
Length between any two nodes (along the two nodes? a network.
(Distance) shortest path)
Diameter Longest (shortest path) between At most, how long will it take to Measure of the reach of the network
any two nodes reach any node in the network?
Sparse networks usually have
greater diameters.
Clustering A node's clustering coefficient is What proportion of ego's alters Measures certain aspects of "cliquishness." Proportion
the density of it's 1.5 degree are connected? More technically, of you friends that are also friends with each other.
egocentric network (ratio of how many nodes form triangular Another way to measure is to determine (in a
connecting among ego's alters). subgraphs with their adjacent undirected) graph the ratio of the number of times that
For entire network it is the average nodes? two links eminating from the same node are also
of all the coefficients for the linked.
individual nodes.
54
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
55. Social Network Analysis:
Network Metrics (Centralization)
Measure Definition
Degree Centralization Variation in the degrees of vertices divided by the
maximum degree variation that is possible in a
network of the same size
Betweenness Centralization Variation in the betweenness centrality of vertices
divided by the maximum variation in betweenness
centrality scores possible in a network of the same
size
Closeness Centralization Variation in the closeness centrality of vertices
divided by the maximum variation in closeness
centrality scores possible in a network of the same
size
Eigenvector Centralization Variation in the eigenvector centrality of vertices
divided by the maximum variation in eigenvector
centrality scores possible in a network of the same
size
1. Variation is the summed absolute differences between centrality scores of the
vertices and the maximum centrality score among them.
2. Network is more centralized if the vertices vary more with respect to their
centrality.
55
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
56. Social Network Analysis:
Cohesion – How well connected?
691
E
B Node Clustering
A 0.67
B 0.67
D
A G Measure Value C 0.00
D 0.40
Average Degree 3.37
E 1.00
F Density 0.19
C F 0.50
Average Distance 3.06 G 0.50
H H 0.10
Diameter 8
I 0.33
R Degree Centralization 0.22 J 0.29
I N Betweenness Centralization 0.48 K 0.67
P L 0.67
Closeness Centralization 0.27
J M 0.33
O Eigenvector Centralization 0.56 N 0.67
K
Clustering Coefficient 0.43 O 1.00
M
L P 0.67
Q 0.00
R NA
Q
S NA
S
56
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
57. Social Network Analysis:
Some General Tendencies
• Small diameters and small average path
lengths
• High clustering coefficients relative to
random processes
• Rate of clustering among the higher-
degree nodes decreases with degree
• Fat tailed degree distributions relative to
random processes
• Hard to find networks that actually follow
a strict power law
• Positive assortativity and high degrees of
homophily at least in social networks
57
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
58. Social Network Analysis:
Is it really a small world?
http://www.touchgraph.com/navigator
58
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
59. Social Network Analysis:
Is it really a small world?
Ego Steps 1 1
Friends 1 50 100
FoF 2 2,500 10,000
FoFoF 3 125,000 1,000,000
FoFoFoF 4 6,250,000 100,000,000
FoFoFoFoF 5 312,500,000 10,000,000,000
FoFoFoFoFoF 6 15,625,000,000 1,000,000,000,000
The naïve view
59
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
60. Social Network Analysis:
Is it really a small world?
60
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
61. Social Network Analysis:
Is it really a small world?
The emergence of online social networking services
over the past decade has revolutionized how social
scientists study the structure of human relationships [1].
As individuals bring their social relations online, the
focal point of the internet is evolving from being a
network of documents to being a network of people,
and previously invisible social structures are being
captured at tremendous scale and with
unprecedented detail.
61
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
62. Social Network Analysis:
Study Population
Active* Global US
Members 721M 149M
Friends 68.7B 15.9B
Aver. Friends 190 214
Total Pop 6.9B 260M
Accessed within 28 days of May ’11
At least one friend
Over 13 years of age
62
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
63. Social Network Analysis:
Degree Distribution
N = 721M
F = 69B
Encouraged
Up to 20
Median
~ 99
63
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
64. Social Network Analysis:
Distance Distribution
Average Average
4.7 4.3
World 92% 99.6%
US 96% 99.7%
64
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
65. Social Network Analysis:
Connected Components
2000 99.91% of
Members Members
Connected components – set of individuals for which each pair of
individuals are connected by at least one path through the network
65
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
66. Social Network Analysis:
Cohesion
14% for 100 100 friends –
28K unique fof’s;
40K non-uniq fofs
You’re friends with a significant Feld: your friends have more friends
fraction of your friends’ friends than you (same with sex partners)
66
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
68. Social Network Analysis:
Bird’s of a Feather - Homophily
84% in the same country
68
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
69. Social Network Analysis:
“Revolution 2.0 will not be Televised”
Tweet Rate – Feb. 24-25, 2011, Tahrir Square
It will be Tweeted & Retweeted
69
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
70. Social Network Analysis:
“Revolution 2.0 will not be Televised”
1% Feed from the two day period
Nodes = 25178 Links = 32471
It will be Tweeted & Retweeted
70
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
71. Social Network Analysis:
Network Features
Average Degree = 2.58 Average Distance = 5.40
Min Degree = 0 Min Distance = 1
Max Degree = 729 Diameter = 22
Measure Value
Density 0.0001
Degree Centralization 0.029
Betweenness Centralization 0.076
Closeness Centralization WC
Eigenvector Centralization 0.724
Clustering Coefficient 0.0045
Number of Components 3122
Size of Largest Component 17762
% in Largest Component 70.50%
It will be Tweeted & Retweeted
71
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
73. Social Network Analysis:
Node and Network Metrics
It will be Tweeted & Retweeted
73
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
74. Social Network Analysis:
Egocentric Analysis
Ghonim Measure Bieber
730 Vertices 14
970 Edges 13
0.004 Density 0.140
2.660 Average Degree 1.860
1.990 Average Distance 1.860
2.000 Diameter 2.000
0.999 Degree Centralization 1.000
0.995 Betweenness Centralization 1.000
0.999 Closeness Centralization 1.000
0.990 EigenVector Centralization 1.740
0.003 Cluster Coefficient 0.000
1 Number of Components 1
730 Size of Largest Component 14
100% % in Largest Component 100%
It will be Tweeted & Retweeted
74
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
75. Social Network Analysis:
Subcommunities of US Political Blogs
• Single day snapshot of a
snowball sample of
political blogs (N=1490)
• Manually assigned as
Liberal or Conservative
• Focus on blogrolls and
front page citations
• A primary question:
Cyberbalkanization?
75
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
79. Social Networks:
Political Blogs - Metrics
Measure Liberal Conservative Total
Density 0.02 0.03 0.01
No Components 188 107 268
Largest Comp 569 569 1222
Largest Comp% 75.10 84.97 82.01
Min Deg 0 0 0
Max Deg 305 296 351
Aver Deg 19.26 21.42 22.44
Deg Central 0.38 0.38 0.22
Diameter 6 7 8
Aver Dist 2.51 2.51 2.74
Betw Cent 0.10 0.16 0.06
EigVect Cent 0.23 0.26 0.22
Clust Coeff 0.31 0.20 0.22
79
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
80. Social Network Analysis:
Political Blogs – Empirical Comparisons
Blog Citation Twitter
Measure Political Biology Economics Math Physics UK Journalists Egypt Twitter
Number of Nodes 1,490 1,520,521 81,217 253,339 52,909 523 25,178
Average Degree 22.4 15.5 1.7 3.9 9.3 88 3
Average Path Length 2.74 4.9 9.5 7.6 6.2 1.88 5.4
Diameter of the Largest Component 8 24 29 27 20 4 22
Overall Clustering 0.22 0.09 0.16 0.15 0.45 0.41 0.004
Fraction of Nodes in Largest Component 0.82 0.92 0.41 0.82 0.85 0.99 0.7
80
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
81. Social Network Analysis:
Political Blogs – Model Comparisons
Political Bernoulli Deg Conditional Small World Pref Attachment
Measure Blogs 2.5% 97.5% 2.5% 97.5% 2.5% 97.5% 2.5% 97.5%
Number of Components 268 1 1 1 1 1 1 96 134
Fraction of Nodes in Largest Component 0.82 100 100 100 100 100 100 0.91 0.94
Diameter of the Largest Component 8 4 4 7 9 4 5 7 9
Average Path Length 2.74 2.61 2.63 3.29 3.36 2.98 3.01 3.07 3.14
Overall Clustering 0.226 0.017 0.018 0.029 0.031 0.355 0.372 0.095 0.109
Betweenness 0.065 0.002 0.003 0.010 0.021 0.003 0.004 0.038 0.064
81
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
82. Social Network Analysis:
Bernoulli
• Fixed number of nodes and lines
– Sets density and average degree = density * (n-1)
• Assigns lines to each pair of nodes
independently with fixed probabilities
– Each line a random binary variable
Measure Bernouilli
• Produces Poisson degree distribution Vertices 1000
Edges 11149
• Small diameter Density 0.022
– ln(#nodes)/ln(aver.degree) Average Degree 22.900
Average Distance 2.570
• Low clustering Diameter 4
– Average degree/#nodes-1 Degree Centralization 0.016
• Large component Betweenness Centralization
Closeness Centralization
0.003
0.066
– E.g. At aver.degree of 1.5 ~50% in largest EigenVector Centralization 0.034
component Cluster Coefficient 0.021
Number of Components 1
Size of Largest Component 1000
% in Largest Component 100%
82
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
83. Social Network Analysis:
Small World
• Fixed number of nodes and the number ® of
nearby neighbors to which each vertex is
linked
– Implies strong Transitive ties
– Ensures higher clustering ~ (3r-3)/(4r-2)
• Probabilistically rewires selected lines from
Measure Small World
one vertex to another (each line and vertex has Vertices 1000
equal probability of being selected). Edges 11000
Density 0.022
– Ensures low average path length
Average Degree 22.000
Average Distance 3.070
Diameter 5.000
Degree Centralization 0.007
Betweenness Centralization 0.007
Closeness Centralization 0.079
EigenVector Centralization 0.031
Cluster Coefficient 0.514
Number of Components 1
Size of Largest Component 1000
% in Largest Component 100%
83
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
84. Social Network Analysis:
Preferential Attachment
• Vast majority of new nodes link to nodes
with proportionately higher degrees
– “Rich get richer”
• Probabilistic part involves the selection of
vertices for new lines (e.g. end vertex for
new line is proportional to degree of end
vertex)
• Tend to have a small world structure Measure Preferential
Vertices 1000
• Exhibit long-tailed degree distributions
Edges 11242
– Right-hand tail of the distribution follows a “scale
Density 0.019
free” power-law” distribution
Average Degree 18.770
– Log-log graph is a straight line
Average Distance 2.690
– Ensures low average path length Diameter 7
Degree Centralization 0.154
Betweenness Centralization 0.037
Closeness Centralization -
EigenVector Centralization 0.188
Cluster Coefficient 0.192
Number of Components 92
Size of Largest Component 908
% in Largest Component 91%
84
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL