Analyzing social media may be a daunting task, given its overwhelming size and messy, unstructured nature. Further, for those new to analyzing social behavior in online systems, there are any number of pitfalls that make it challenging to find the meaning in the mess. The goal of this session is to provide practical tips for collecting and analyzing social media data.
2. Agenda
Introductions
Overview
Lesson scenarios with real data
Usage analysis: Predictors of coming back
Social network analysis: Finding who you like
Content analysis: Relationships, cliques, and their conversation
Focus on Tools/Tips, special
consideration when examining
social data
4. SHELLY FARNHAM: INDUSTRY RESEARCH
Specialize in social technologies
Social networks, community, identity, mobile
social
Early stage innovation
Extremely rapid R&D cycle
study, brainstorm, design, prototype, deploy,
evaluate (repeat)
Convergent evaluation methodologies: usage
analysis, interviews, questionnaires
Career
PhD in Social Psychology from UW
7 years Microsoft Research
Virtual Worlds, Social Computing, Community Technologies
4 years startup world
Waggle Labs (consulting), Pathable
2 Years Yahoo!
FUSE Labs, Microsoft Research
Personal Map
5. EMRE KICIMAN
Specialize in social data analytics
Social media, social networks, search
Methods
Machine learning
Information extraction, entity recognition from social
data
Prototyping
Career
Ph.D. and M.S. in computer science from Stanford
University
B.S. in Electrical Engineering and Computer
Currently at Internet Services Research Center,
Microsoft Research
6. ANALYSIS THROUGHOUT R&D CYCLE
Importance of Information in
selecting chat partner
7
6
5
Rank
4
Rating
Similarity
3
Interacts with friends
Ratings by friends
2
1
0
10. SOCIAL MEDIA ANALYSIS
Common types
Usage analysis: behaviors, interactions
Network analysis: patterns in networks (sets of
pair-wise connections)
Content analysis: semantics, sentiment of
conversational content
Common steps
Step 1. Getting started: defining questions
Step 2. Processing data: extraction, cleaning,
summarization
Step 3. High level analysis: inference
11. CASE STUDY: USAGE ANALYSIS
So.cl usage analysis as case study scenario, lessons learned apply
to other forms of social media and other forms of analysis
So.cl is an experimental web site
that allows people to connect around
their interests by integrating search
tools with social networking.
How important are social
interactions in encouraging users to
become engaged with an interest
network?
12. SO.CL
reimagining search as social from the ground up
search +
sharing +
networking
= informal
discovery
and learning
History:
Oct 2011:
Pre-release deployment study
Dec 2011:
Private, invitation-only beta
May 2012:
removed invitation restrictions
Nov 2012:
over 300K registered users,
13K active per month
Try it now! http://www.so.cl
13. INTEREST
NETWORK
GOALS
Find others around
common interests
Be inspired by new
interests
Learn from each other
through these shared
interests
14. HOW IT
WORKS
Search & Post
Search & Post
Feed Filters
Feed Filters
Feed
Feed
People
People
Try it now! http://www.so.cl – use facsumm tag
15. POST BUILDING
Search (Bing)
Search (Bing)
Filter Results
Filter Results
Post Builder
Post Builder
Results
Results
Experience:
Step 1: Perform search
Step 2: Click on items
in results to add
to post
Step 3: Add a message
Step 4: Tag
Try it now! http://www.so.cl – use facsumm tag
18. DEFINING RESEARCH QUESTION
Amount of data overwhelming – the more
defined your question, the easier the analysis
What real world problem are you trying to
explore?
Avoid pitfall of technology for technology’s sake
What argument do you want to be able to
make?
State your problem as a hypothesis
19. CASE SCENARIO:
Real world problem:
Help people learn online
Argument want to make:
People are more motivated to explore new interests via
social media than via search alone because of the
opportunity to connect with others.
Hypothesis:
If people receive a social response when they first join
So.cl they are more likely to become engaged.
20. OPERATIONALIZING
CONSTRUCTS
Operationalize = to make measurable
Always review related literature for best practices
How do you measure…
Friendship? Similarity? Interest? Trend?
Conversation? Community? Engagement?
Can you operationalize with existing data, or do
you need to generate more?
21. CASE SCENARIO:
Hypothesis:
If people receive a social response when they first join So.cl they are more
likely to become engaged.
Measuring social/behavioral constructs:
When first join
First session = time of first action to time of last action prior to an hour of inactivity
Social responses
Follows user, likes user’s post(s), comments on user’s post(s)
Engagement = coming back
A second session = any action occurs 60 minutes or more after first session
Restating hypothesis:
If a people receive follows, likes, and comments in their first session they are
more likely to come back for a second session
23. COLLECTING DATA
Existing tools
APIs (Twitter, Foursquare, Yelp)
Web analytics (Google Analytics)
Write crawlers
Writing your own instrumentation system
e.g. log each call to server, query parameters
29. COMMON INSTRUMENTATION SCHEMA
Actions table
One row per meaningful action
Filter out non-meaningful, non-user generated actions
30. COMMON INSTRUMENTATION SCHEMA
Content table(s):
One row per content item, with text, URL, etc. of that item
e.g. messages, pictures shared, likes, tags
31. COMMON INSTRUMENTATION SCHEMA
Across tables, with social systems
instrument social target (PersonA responds to
PersonB)
Instrument parent item (e.g., Comment A, Comment
B, Comment C, responses to parent item PostB)
In other words, instrument who interacting with
whom, and in what context
32. REDUCING LARGE DATA
Filters
Time span, type of person, type of actions
Sampling
Random selection
Snow balling, so get complete picture of person’s
social experience
Consider your research questions, how you
want to generalize
33. FILTERING & SAMPLING
Filtered out administrators/community
managers
New users only
Date range: Sept 28 to Oct 13
100% sample for that time span: 2462
people
34. SYSTEMATIC BIASES IN SOCIAL SYSTEMS
If you want to understand your “typical”
users, keep in mind generally find:
Large percent never become active or
return --“lookiloos” can unduly bias
averages
Common reporting format:
X% performed Y behavior, of those averaged Z
times each
5% commented on a post their first session,
averaging 5 times each
35. OUTLIERS
Filtered out 13 people outliers z > 4 in number of
actions (if do more than sign in)
36. SYSTEMATIC BIASES IN SOCIAL SYSTEMS
A small percent “hyper-active” users: avid,
spammers, trolls, administrators, and can
unduly bias averages
Remove outliers
A substantial percent are consumers but not
producers (“lurkers”), often no signal for
lurkers
Consult literature, related work for estimates – so.cl, about 75%
lurkers
Custom instrumentation, logging sign ins
Web analytics for clicks
37. PLAYING WITH YOUR DATA
Very important to spend time examining data
Descriptives, Frequencies, Correlations, Graphs
Use tool that easily generates graphs, correlations
Does it make sense? If not, really chase it down. Often
a bug or misinterpretation of data.
38. AGGREGATIONS
Aggregation: merging down for summarization
What is your level of analysis?
Person, group, network
Content types
If person is unit of analysis, aggregate measures to
the person level
E.g. in SPSS: One line per person
very important to have appropriate unit analysis, to avoid bias in
statistics
40. DESCRIPTIVES OF ACTIVE SESSIONS
Active session = a time of
activity (public), with 60
minute gap of no activity
before or after
91% of users
only one active session
On average,
34.6 hours apart
First session,
1.6 minutes
41. DESCRIPTIVES OF ACTIONS
Actions in First Session
A
A
8% created a post there
first session, of those
averaged 1.5 times each
42. DESCRIPTIVES OF COMING BACK
9.1% came back
for another active
session
(~25% including
inactive)
On average, 35
hours later
43. IN THE FIRST SESSION
How often is user the target of social behavior?
23% received some response up to 2nd session
->3% if did not create a post, 37% if did create a post
Response *During* First Session
Response *in Between* 1st and 2nd Sessions
46. PREDICTORS OF COMING BACK
Social responses inspire people to return to
site, especially if occurring during first
session
N = 2273
N = 179
N = 1942
N = 510
Social responses to user: following, commenting on post, liking post, liking comment,
riffing
47. WHICH RESPONSE MATTERS
Logistic Regression, Any Response Predicts Coming Back
B
S.E.
Sig.
Created post first session
.71
.20
.000
Response1: during first session
1.12
.21
.000
Response2: after first session
.60
.17
.000
Logistic Regression, Which Predicts Coming Back
B
Sig.
Created post first session
.95
.000
Followed
.92
.003
Commented On
.38
ns
Post Liked
.87
.02
Comment Liked
-.09
ns
Messaged
-.09
ns
Riffed
.00
ns
48. IDENTIFYING SUBGROUPS
Component Matrixa
Type:
% Variance:
Creators
32%
Component
Socialites Browsers
12%
9%
Created post
.86
.17
.10
Invited
.01
-.16
.63
Followed
-.03
.10
.37
Factors about equally predict
if user comes back
Regression Coefficients
Beta
Added item to post
.83
.08
-.06
Searched
.81
.03
.17
Commented
.36
.64
.09
Liked post
.15
.58
.13
.80
.06
Messaged
-.09
.50
-.08
Viewed person
.22
.47
.48
Navigated to All
.51
.37
Sig
Creating
.14
5.28
.000
Socializing
.07
2.61
.000
Browsing
.19
7.20
.000
.32
Liked comment
t
.53
Browsing stronger predictor
of overall activity level
Regression Coefficients
.09
.68
Principle components, varimax rotation [meaning forced to be orthoganol]
Factor Analysis for Associated Behaviors:
Three types of usage – creating, socializing, browsing
Sig
0.20
7.89
0.00
Socializing
0.17
6.58
0.00
Browsing
.17
t
Creating
Joined party
Beta
0.29
9.07
0.00
50. Case Scenario 2:
illustrating network analysis
Real world problem:
help people find and learn from others who share their interests
online
Argument want to make:
people do not just care about content around their interests, they
want to develop friendships with others who share their interests
Hypothesis:
People will interact with others more the more common tags they
have
Design implication:
Recommendations based on common overlapping tags
51. PROCESSING NETWORK DATA
Common format:
EntityA
EntityB
EntityB
EntityF
EntityB
EntityC
EntityD
EntityG
Units of analysis:
Edges
Nodes/vertices
Clusters, networks
measure
measure
measure
measure
52. OPERATIONALIZING CONNECTION
How would you
measure…
Similar interests?
Friendship?
Information flow?
Asymmetrical?
Often some form of
co-occurrence
http://www.touchgraph.com/assets/navigator/help2/module_3_3.html
53. NORMALIZATION
adjusting values
measured on different
scales to a notionally
common scale
Allow the comparison
of corresponding
normalized values for
different datasets in a
way that eliminates the
effects of certain gross
influence
Mary
Jim
Bob
•
•
•
•
Mary has 400 friends
Jim has 200 friends
Bob and 50 friends
Mary and Jim have 100
overlapping friends
• Mary and Bob and 50
overlapping friends
• How similar are they?
• Who’s more similar?
54. CASE STUDY:
Real world problem:
Help people find people like them online
Argument want to make:
Interests you share and tag online are good indicator of
what you are like
Hypothesis:
If people more interested in receiving
recommendations of whom to befriend based on
overlapping tags than random others in the system
57. NETWORK ANALYSIS (NODEXL)
Playing with data, learned:
All tagging not a good indicator of what you are like – the
tags on your posts are, whether or not you add them
Most common tags not very meaningful, unique
overlapping tags are importance of normalization
59. Douglas Wray - http://instagr.am/p/nm695/ @ThreeShipsMedia
60. Outline
What’s in social media? (donuts)
Extracting relationships and their context
Using context with higher-level analyses
61. Do people really talk about
donuts?
1 week of tweets mentioning “donut” or
“doughnuts”
Week of Feb 6-12, 2012.
Matched ~180k messages
Train entity tagger for food and for restaurants
(no disambiguation or canonicalization)
Let’s see what we find…
65. Beyond donuts…
Drugs, diseases, and contagions
Paul and Dredze 2011; Sadilek, Kautz and
Silenzio 2012.
Crises, disasters, and wars
Starbird et al. 2010; Al-Ani, Mark & Semaan
2010; Monroy-Hernandez et al. 2012
Public Sentiment
Political and election indices, market insights
Everyday life
67. Stage 1: Feature extraction
“I had fun hiking Tiger Mountain last weekend” – Alice said
on Monday, at 10am
Location
Mood
Activity
Name
Gender
Post Time
Activity Time
Tiger Mountain
Happy
Hiking
Alice
Female
Mon 10am
{Sat-Sun}
68. Stage 2(A) Build a hyper-graph
representation
“I had fun hiking Tiger Mountain last weekend” – Alice said on
Monday, at 10am
Location:
Tiger
Mountain
Gender:
Female
Mood:
Happy
Activity:
Hiking
Name:
Alice
Post Time:
Mon
10am
Activity
Time:
{Sat-Sun}
72. Demo to show example relationships &
contexts from several domains
73. Using context with high-level
analyses
Current
Clustering
Neighborhood discovery
Network centrality
Context of discussion provides
74. Demo to show example contexts for pseudocliques and network centrality
75. CONCLUSIONS
Define research questions early to help focus analysis
Many special considerations with social media data
Operationalizing social constructs
Attention to lookiloos, hyperactives, lurkers who bias outcomes
Different types of users = different behaviors
Different context meaningfully impact conversation
Processing data = simplification, getting meaningful
measures summarized at appropriate level of analysis
Format your data and plug it into appropriate tool to enable
you play with your data a *lot*
Important for debugging, finding patterns
Great tools available for leveraging social
media to describe, predict behaviors
Which brings us to so.cl.
In the past year, at FUSE Labs we’ve been working on an experimental application called so.cl to explore some of these issues;
How might we combine the capacity for searching the internet, with really lightweight sharing in the context of a social network,
To enable informal learning
Through so.cl, our goal is to help users discover new interests, connect with others around common interests, and be inspired to learn more
For example, on the right, you can see I have clicked on the tag electronic arts, an interest of mine,
and found this post from Nathan about a water clock – which is inspiring.
To give you an idea of the experience, here is what I see when I log in;
The core of the experience is through this activity feed in the middle;
where I see posts made by people everyone in so.cl, or just the I follow;
Or posts about things I find interesting;
These posts are made out of searching the internet, selecting images and web sites you want to share;
Built in a very lightweight way from my searching;
To give you an idea of the experience, here is what I see when I log in;
The core of the experience is through this activity feed in the middle;
where I see posts made by people everyone in so.cl, or just the I follow;
Or posts about things I find interesting;
These posts are made out of searching the internet, selecting images and web sites you want to share;
Built in a very lightweight way from my searching;
Big picture:
Learning about the real world through social media
Social media: largest fine-grained record of human activity ever