Tools and Tips for Analyzing Social Media Data

Analyzing
Social Media
Systems

CHI Course 2013
Shelly Farnham, Emre Kiciman
FUSE Labs & Internet Services Research Center, Microsoft Research

Agenda
 Introductions
 Overview
 Lesson scenarios with real data
 Usage analysis: Predictors of coming back
 Social network analysis: Finding who you like
 Content analysis: Relationships, cliques, and their conversation

 Focus on Tools/Tips, special
consideration when examining
social data

MAKING
MEANING
OUT OF
THE
MESS

SHELLY FARNHAM: INDUSTRY RESEARCH
 Specialize in social technologies

 Social networks, community, identity, mobile
social

 Early stage innovation

 Extremely rapid R&D cycle
 study, brainstorm, design, prototype, deploy,
evaluate (repeat)
 Convergent evaluation methodologies: usage
analysis, interviews, questionnaires

 Career

 PhD in Social Psychology from UW
 7 years Microsoft Research

Virtual Worlds, Social Computing, Community Technologies

 4 years startup world

Waggle Labs (consulting), Pathable
 2 Years Yahoo!
 FUSE Labs, Microsoft Research

Personal Map

EMRE KICIMAN
 Specialize in social data analytics
 Social media, social networks, search

 Methods
 Machine learning
 Information extraction, entity recognition from social
data
 Prototyping

 Career
 Ph.D. and M.S. in computer science from Stanford
University
 B.S. in Electrical Engineering and Computer
 Currently at Internet Services Research Center,
Microsoft Research

ANALYSIS THROUGHOUT R&D CYCLE

Importance of Information in
selecting chat partner
7

6

5

Rank
4

Rating
Similarity

3

Interacts with friends
Ratings by friends

2

1

0

USAGE ANALYSIS
Do social responses matter in driving
engagement?

SOCIAL MEDIA ANALYSIS
 Common types
 Usage analysis: behaviors, interactions
 Network analysis: patterns in networks (sets of
pair-wise connections)
 Content analysis: semantics, sentiment of
conversational content

 Common steps
 Step 1. Getting started: defining questions
 Step 2. Processing data: extraction, cleaning,
summarization
 Step 3. High level analysis: inference

CASE STUDY: USAGE ANALYSIS
So.cl usage analysis as case study scenario, lessons learned apply
to other forms of social media and other forms of analysis

So.cl is an experimental web site
that allows people to connect around
their interests by integrating search
tools with social networking.
How important are social
interactions in encouraging users to
become engaged with an interest
network?

SO.CL

reimagining search as social from the ground up

search +
sharing +
networking
= informal
discovery
and learning
History:
Oct 2011:
Pre-release deployment study
Dec 2011:
Private, invitation-only beta
May 2012:
removed invitation restrictions
Nov 2012:
over 300K registered users,
13K active per month

Try it now! http://www.so.cl

INTEREST
NETWORK
GOALS
 Find others around
common interests
 Be inspired by new
interests
 Learn from each other
through these shared
interests

HOW IT
WORKS

Search & Post
Search & Post

Feed Filters
Feed Filters

Feed
Feed
People
People
Try it now! http://www.so.cl – use facsumm tag

POST BUILDING

Search (Bing)
Search (Bing)

Filter Results
Filter Results

Post Builder
Post Builder

Results
Results
Experience:
Step 1: Perform search
Step 2: Click on items
in results to add
to post
Step 3: Add a message
Step 4: Tag

Try it now! http://www.so.cl – use facsumm tag

DEFINING RESEARCH QUESTION
 Amount of data overwhelming – the more
defined your question, the easier the analysis
 What real world problem are you trying to
explore?
Avoid pitfall of technology for technology’s sake

 What argument do you want to be able to
make?
 State your problem as a hypothesis

CASE SCENARIO:
 Real world problem:
Help people learn online

 Argument want to make:
People are more motivated to explore new interests via
social media than via search alone because of the
opportunity to connect with others.

 Hypothesis:
If people receive a social response when they first join
So.cl they are more likely to become engaged.

OPERATIONALIZING
CONSTRUCTS
 Operationalize = to make measurable
 Always review related literature for best practices
 How do you measure…
Friendship? Similarity? Interest? Trend?
Conversation? Community? Engagement?

 Can you operationalize with existing data, or do
you need to generate more?

CASE SCENARIO:
 Hypothesis:
If people receive a social response when they first join So.cl they are more
likely to become engaged.

 Measuring social/behavioral constructs:
 When first join
First session = time of first action to time of last action prior to an hour of inactivity

 Social responses
Follows user, likes user’s post(s), comments on user’s post(s)

 Engagement = coming back
A second session = any action occurs 60 minutes or more after first session

 Restating hypothesis:
 If a people receive follows, likes, and comments in their first session they are
more likely to come back for a second session

COLLECTING DATA
 Existing tools
 APIs (Twitter, Foursquare, Yelp)
 Web analytics (Google Analytics)

 Write crawlers
 Writing your own instrumentation system
e.g. log each call to server, query parameters

RAW INSTRUMENTATION
 Tendency to
collect everything
 incomprehensible,
incoherent mess
 Prone towards
bugs

INSTRUMENTATION
 Convert to human readable

Always look at your raw data: play with it
ask yourself if it makes sense, test!

COMMON INSTRUMENTATION SCHEMA
 Users table
 One row per user

 Actions table
 One row per meaningful action
 Filter out non-meaningful, non-user generated actions

 Content table(s):
 One row per content item, with text, URL, etc. of that item
e.g. messages, pictures shared, likes, tags

 Across tables, with social systems





instrument social target (PersonA responds to
PersonB)
Instrument parent item (e.g., Comment A, Comment
B, Comment C, responses to parent item PostB)
In other words, instrument who interacting with
whom, and in what context

REDUCING LARGE DATA
 Filters
 Time span, type of person, type of actions

 Sampling
 Random selection
 Snow balling, so get complete picture of person’s
social experience

 Consider your research questions, how you
want to generalize

FILTERING & SAMPLING
 Filtered out administrators/community
managers
 New users only
 Date range: Sept 28 to Oct 13
 100% sample for that time span: 2462
people

SYSTEMATIC BIASES IN SOCIAL SYSTEMS

 If you want to understand your “typical”
users, keep in mind generally find:
 Large percent never become active or
return --“lookiloos” can unduly bias
averages
Common reporting format:
X% performed Y behavior, of those averaged Z
times each
5% commented on a post their first session,
averaging 5 times each

OUTLIERS
 Filtered out 13 people outliers z > 4 in number of
actions (if do more than sign in)

SYSTEMATIC BIASES IN SOCIAL SYSTEMS
 A small percent “hyper-active” users: avid,
spammers, trolls, administrators, and can
unduly bias averages
 Remove outliers

 A substantial percent are consumers but not
producers (“lurkers”), often no signal for
lurkers
 Consult literature, related work for estimates – so.cl, about 75%
lurkers
 Custom instrumentation, logging sign ins
 Web analytics for clicks

PLAYING WITH YOUR DATA
 Very important to spend time examining data
 Descriptives, Frequencies, Correlations, Graphs
 Use tool that easily generates graphs, correlations
 Does it make sense? If not, really chase it down. Often
a bug or misinterpretation of data.

AGGREGATIONS
 Aggregation: merging down for summarization
 What is your level of analysis?
 Person, group, network
 Content types

 If person is unit of analysis, aggregate measures to
the person level
 E.g. in SPSS: One line per person
 very important to have appropriate unit analysis, to avoid bias in
statistics

DESCRIPTIVES OF ACTIVE SESSIONS
 Active session = a time of
activity (public), with 60
minute gap of no activity
before or after
 91% of users
only one active session
 On average,
34.6 hours apart
 First session,
1.6 minutes

DESCRIPTIVES OF ACTIONS
Actions in First Session

A
A

8% created a post there
first session, of those
averaged 1.5 times each

DESCRIPTIVES OF COMING BACK
 9.1% came back
for another active
session
(~25% including
inactive)

 On average, 35
hours later

IN THE FIRST SESSION
 How often is user the target of social behavior?
 23% received some response up to 2nd session
->3% if did not create a post, 37% if did create a post
Response *During* First Session

Response *in Between* 1st and 2nd Sessions

PRELIMINARY CORRELATIONS
 Always
ask,
does
this
pattern
make
sense
?

PREDICTORS OF COMING BACK
 Social responses inspire people to return to
site, especially if occurring during first
session

N = 2273

N = 179

N = 1942

N = 510

Social responses to user: following, commenting on post, liking post, liking comment,
riffing

WHICH RESPONSE MATTERS
Logistic Regression, Any Response Predicts Coming Back
B
S.E.
Sig.
Created post first session
.71
.20
.000
Response1: during first session
1.12
.21
.000
Response2: after first session
.60
.17
.000
Logistic Regression, Which Predicts Coming Back
B

Sig.

Created post first session

.95

.000

Followed

.92

.003

Commented On

.38

ns

Post Liked

.87

.02

Comment Liked

-.09

ns

Messaged

-.09

ns

Riffed

.00

ns

IDENTIFYING SUBGROUPS
Component Matrixa

Type:
% Variance:

Creators
32%

Component
Socialites Browsers
12%
9%

Created post

.86

.17

.10

Invited

.01

-.16

.63

Followed

-.03

.10

.37

Factors about equally predict
if user comes back
Regression Coefficients
Beta

Added item to post

.83

.08

-.06

Searched

.81

.03

.17

Commented

.36

.64

.09

Liked post

.15

.58

.13

.80

.06

Messaged

-.09

.50

-.08

Viewed person

.22

.47

.48

Navigated to All

.51

.37

Sig

Creating

.14

5.28

.000

Socializing

.07

2.61

.000

Browsing

.19

7.20

.000

.32

Liked comment

t

.53

Browsing stronger predictor
of overall activity level
Regression Coefficients

.09

.68

Principle components, varimax rotation [meaning forced to be orthoganol]

Factor Analysis for Associated Behaviors:
Three types of usage – creating, socializing, browsing

Sig

0.20

7.89

0.00

Socializing

0.17

6.58

0.00

Browsing

.17

t

Creating
Joined party

Beta

0.29

9.07

0.00

Case Scenario 2:
illustrating network analysis
Real world problem:
help people find and learn from others who share their interests
online

Argument want to make:
people do not just care about content around their interests, they
want to develop friendships with others who share their interests

Hypothesis:
People will interact with others more the more common tags they
have

Design implication:
Recommendations based on common overlapping tags

PROCESSING NETWORK DATA
 Common format:
EntityA
EntityB
EntityB
EntityF

EntityB
EntityC
EntityD
EntityG

 Units of analysis:
Edges
Nodes/vertices
Clusters, networks

measure
measure
measure
measure

OPERATIONALIZING CONNECTION
 How would you
measure…
 Similar interests?
Friendship?
Information flow?
 Asymmetrical?

 Often some form of
co-occurrence

http://www.touchgraph.com/assets/navigator/help2/module_3_3.html

NORMALIZATION
 adjusting values
measured on different
scales to a notionally
common scale
 Allow the comparison
of corresponding
normalized values for
different datasets in a
way that eliminates the
effects of certain gross
influence

Mary
Jim

Bob

•
•
•
•

Mary has 400 friends
Jim has 200 friends
Bob and 50 friends
Mary and Jim have 100
overlapping friends
• Mary and Bob and 50
overlapping friends
• How similar are they?
• Who’s more similar?

CASE STUDY:

 Real world problem:
Help people find people like them online

 Argument want to make:
Interests you share and tag online are good indicator of
what you are like

 Hypothesis:
If people more interested in receiving
recommendations of whom to befriend based on
overlapping tags than random others in the system

CONNECTION VIA OVERLAPPING TAGS

NETWORK ANALYSIS (NODEXL)
 Playing with data, learned:
 All tagging not a good indicator of what you are like – the
tags on your posts are, whether or not you add them
 Most common tags not very meaningful, unique
overlapping tags are  importance of normalization

Douglas Wray - http://instagr.am/p/nm695/ @ThreeShipsMedia

Outline
 What’s in social media? (donuts)
 Extracting relationships and their context
 Using context with higher-level analyses

Do people really talk about
donuts?
 1 week of tweets mentioning “donut” or
“doughnuts”
 Week of Feb 6-12, 2012.
 Matched ~180k messages

 Train entity tagger for food and for restaurants
 (no disambiguation or canonicalization)

 Let’s see what we find…

What do people drink with donuts?

What kind of donuts do people eat?

Beyond donuts…
 Drugs, diseases, and contagions
 Paul and Dredze 2011; Sadilek, Kautz and
Silenzio 2012.

 Crises, disasters, and wars
 Starbird et al. 2010; Al-Ani, Mark & Semaan
2010; Monroy-Hernandez et al. 2012

 Public Sentiment
 Political and election indices, market insights

 Everyday life

Stage 1: Feature extraction
“I had fun hiking Tiger Mountain last weekend” – Alice said
on Monday, at 10am

Location
Mood
Activity
Name
Gender
Post Time
Activity Time

Tiger Mountain
Happy
Hiking
Alice
Female
Mon 10am
{Sat-Sun}

Stage 2(A) Build a hyper-graph
representation
“I had fun hiking Tiger Mountain last weekend” – Alice said on
Monday, at 10am
Location:
Tiger
Mountain

Gender:
Female

Mood:
Happy

Activity:
Hiking

Name:
Alice

Post Time:
Mon
10am
Activity
Time:
{Sat-Sun}

Gender:
Male

Name:
Bob

Post Time:
Fri 3pm

Location:
Tiger
Mountain

Gender:
Female

Mood:
Happy

Activity:
Hiking

Name:
Alice

Post Time:
Mon
10am
Activity
Time:
{Sat-Sun}

Stage 2(B) Projection
• Reduce graph to key domains
• Statistical distributions of other domains provide key
context
Location:
Tiger
Mountain

Activity:
Hiking

Gender:
Male

Location:
Tiger
Mountain

Activity:
Hiking

Gender:
Female

Demo to show example relationships &
contexts from several domains

Using context with high-level
analyses
Current
 Clustering
 Neighborhood discovery
 Network centrality
 Context of discussion provides

Demo to show example contexts for pseudocliques and network centrality

CONCLUSIONS
 Define research questions early to help focus analysis
 Many special considerations with social media data





Operationalizing social constructs
Attention to lookiloos, hyperactives, lurkers who bias outcomes
Different types of users = different behaviors
Different context meaningfully impact conversation

 Processing data = simplification, getting meaningful
measures summarized at appropriate level of analysis
 Format your data and plug it into appropriate tool to enable
you play with your data a *lot*
 Important for debugging, finding patterns

 Great tools available for leveraging social
media to describe, predict behaviors

CONTACT INFO
Shelly Farnham, Researcher
Emre Kiciman, Researcher
(@shellyshelly; shellyfa@microsoft.com)
(emrek@microsoft.com)

QUESTIONS

Tools and Tips for Analyzing Social Media Data

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Tools and Tips for Analyzing Social Media Data

Ähnlich wie Tools and Tips for Analyzing Social Media Data (20)

Mehr von Shelly D. Farnham, Ph.D.

Mehr von Shelly D. Farnham, Ph.D. (13)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Tools and Tips for Analyzing Social Media Data

Hinweis der Redaktion