Mining and analyzing social media hicss 45 tutorial – part 1

Mining and Analyzing Social Media
HICSS 45 Tutorial – Part 1
Dave King
January 4, 2012

Agenda: This is how the slides are
organized
• Part 1
– Introduction – Bio, Resources, Social Media
– Data Mining – Processes and Example
– Text Mining – General Processes and Example
– Predicting the Future – The Portmanteaus
• Part 2
– Sentiment Analysis
– Social Network Analysis - Introduction

2
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Biography: Dave King

• Currently, EVP of Product Development
and Management at JDA Software
• 30 years in enterprise package
software business
• 15 years as university professor
• 14 years as Co-Chair of the Internet &
Digital Economy Track (HICSS)
• Long time interest in various aspects of
E-Commerce & Business Intelligence
• Tutorial topic primarily reflects a
personal interest and tangentially a
job(s) related interest.

Personal Experiences with
Analytics
• Taught applied statistics, math modeling & mathematical sociology
• In software R&D for 30 years
– Optimization in the 80s
– Natural Language Frontends
• NLI Query & CMU Robotics Lab
– EIS Competitive Analysis
• Dow Jones and Reuters
• Verity Topics
• NewsAlert
– InXight’s Hyperbolic Tree
– Supply Chain Analytics
• In the case of text analysis and it’s practical application, often
audiences have been small, bewildered, and fleeting


Mining and Analytics Resources

5


6


7


8

Mining and Analytics Resources:
Web Sites, Online Books & Tutorials
• DM/Blog -- abbottanalytics.blogspot.com
• DM/Blog – blog.data-miners.com
• DM/Blog -- bx.businessweek.com/data-mining/blogs
• DM/Blog -- bytemining.com
• DM/Blog – data-mining.alltop.com
• DM/Blog -- dataminingblog.com
• DMBlog – dataminingdownunder.com
• DM/Blog -- datamining.typepad.com
• DM/Blog -- datawrangling.com
• DM/Blog -- timmanns.blogspot.com
• DM/General -- kdnuggets.com
• DM/General -- mydatamine.com
• DM/General -- the-data-mine.com
• DM/Online Book -- chem-eng.utoronto.ca/~datamining/dmc/data_mining_map.htm
• DM/Tutorial -- autonlab.org/tutorials/
9

• TA/General -- social.textanalyticsnews.com
• TA/General -- textanalysis.info
• TM/Blog -- blogs.sas.com/text-mining
• TM/Blog -- lingpipe-blog.com
• TM/Blog -- texttechnologies.com
• TM & TA/Blog -- informationweek.com/authors/showAuthor.jhtml?authorID=1331
• TA Tutorial -- slideshare.net/SethGrimes/text-analytics-overview-2011
• TM & DM/Online Book -- statsoft.com/textbook/text-mining/
• TM & DM/Tutorial -- alias-i.com/lingpipe/demos/tutorial/db/read-me.html
• TM Tutorial -- scienceforseo.com/tutorials/text-mining-tutorial
• TM/Wiki -- textanalytics.wikidot.com
• SNA/Blog – iq.harvard.edu/blog/netgov/2011/10/
• SNA/Blog – thenetworkthinkers.com
• SNA/Blog – blog.echen.me/tag/social-network-analysis/
• SNA/Blog – lithosphere.lithium.com/t5/user/viewprofilepage/user-id/151
• SNA/Tutorial -- cs.stanford.edu/people/jure/icml09networks/ 10

• DA/Blog – dataists.com
• DA/Blog – drewconway.com
• Visualization/Blog – abeautifulwww.com/
• Visualization/Blog – benfry.com/writing/
• Visualization/Blog -- blog.blprnt.com
• Visualization/Blog – chrisharrison.net/index.php/visualization.com
• Visualization/Blog – datavisualization.ch/
• Visualization/Blog – eagereyes.com
• Visualization/Blog – informationandvisualization.de/
• Visualization/Blog – infosthetics.com
• Visualization/Blog – junkcharts.typepad.com/junk_charts/
• Visualization/Blog – neoformix.com
• Visualization/Blog – perpetualedge.com/blog
• Visualization/Blog – processing.org
• Visualization/Blog – visualcomplexity.com
• Visualization/Blog – well-formed-data.net/

11

Social Media Defined
Marta Kagan


Social Media Defined: …Sort of …

13

Social Media Defined:
Actually, it’s 33 Definitions
1. Media for social interaction, using highly accessible and scalable 18. Not one thing. It’s five distinct things:
communication techniques. 19. Digital, content-based communications based on the interactions enabled by a
2. Various user-driven (inbound marketing) channels (e.g., Facebook, Twitter, plethora of web technologies
blogs, YouTube). 20. Collection of online platforms and tools that people use to share content,
3. Most transparent, engaging and interactive form of public relations profiles, opinions, insights, experiences, perspectives and media itself,
4. What we do and say together, worldwide, to communicate in all direction at facilitating conversations and interactions online between groups of people.
any time, by any possible (digital) means. 21. Platform/tools.
5. New marketing tool that allows you to get to know your customers and 22. Act of connecting on social media platforms.
prospects in ways that were previously not possible. 23. How businesses join the conversation in an authentic and transparent way to
6. Platforms that enable the interactive web by engaging users to participate in, build relationships.
comment on and create content as means of communicating 24. The notion that social media is about the technology that facilitates individuals
7. Consists of any online platform or channel for user generated content. and groups of people to connect and interact, create and share.
8. Digital content and interaction that is created by and between people. 25. Any of a number of individual web-based applications aggregating users who
9. Shift in how we get our information. Social media allows us to network, to find are able to conduct one-to-one and one-to-many two-way conversations.
people with like interests, and to meet people who can become friends or 26. Media channel that relies on listening and conversation, as opposed to a
customers. monologue, to get your point across, make a connection and build a
10. Platforms for interaction and relationships, not content and ads. relationship.
11. Online platforms and locations that provide a way for people to participate in 27. Social media is all about leveraging online tools that promote sharing and
these conversations. conversations, which ultimately lead to engagement with current and future
12. People’s conversations and actions online that can be mined by advertisers customers and influencers in your target market.
for insights but not coerced to pass along marketing messages. 28. Social media: Evolution, Revolution and Contribution -by the ability of
13. Tools, services, and communication facilitating connection between peers everybody to share and contribute as a publisher
with common interests. 29. Social media is communication channels or tools used to store, aggregate,
14. Online technologies and practices that people use to share content, opinions, share, discuss or deliver information within online communities.
insights, experiences, perspectives, and media themselves. 30. Social Media is simply another arrow to be shot in a company’s marketing
15. Ever-growing and evolving collection of online tools and toys, platforms and quiver.
applications that enable all of us to interact with and share information. 31. Social media platforms make it easier to share information–usually online.
Increasingly, it’s both the connective tissue and neural net of the Web. 32. Any object or tool, that connects people in dialogue or interaction — in
16. Reflection of conversations happening every day, whether at the supermarket, person, in print, or online.
a bar, the train, the watercooler or the playground. 33. Wild, Wild West of Marketing, with brands, businesses, and organizations
17. Online text, pictures, videos and links, shared amongst people and jostling with individuals to make news, friends, connections and build
organizations. communities in the virtual space.
14

Social Media Defined: If a Picture isn’t
worth a 1000 words, then …

15

Social Media Defined

Online technologies and practices
for social interaction

enabling the sharing of opinions, insights,
experiences, perspectives and media itself

16

Social Media Defined: Categories

17

Social Media Defined:
Unanimous Agreement
Marta Kagan

18

Social Media is Huge: Users
Marta Kagan

750 Million: Facebook

200 Million: Twitter
100 Million: LinkedIn
19

Social Media is Huge!
Marta Kagan

If Facebook
were a country,
it would be the
3 rd largest in

the world

20

Social Media Data:
Research Opportunity

“Every day, Twitter
generates more
social network
data than the
entire field of SNA
possessed 10
years ago.”

21

Social Media is Huge:
Usage and Content

Nam e 10**N Nam e Value
(Sym bol) (Sym bol)

kilobyte (kB) 3 kibibyte (KiB) 210 = 1.024 × 103

megabyte (MB) 6 mebibyte (MiB) 220 ≈ 1.049 × 106

gigabyte (GB) 9 gibibyte (GiB) 230 ≈ 1.074 × 109

terabyte (TB) 12 tebibyte (TiB) 240 ≈ 1.100 × 1012

petabyte (PB) 15 pebibyte (PiB) 250 ≈ 1.126 × 1015

exabyte (EB) 16 exbibyte (EiB) 260 ≈ 1.153 × 1018

zettabyte (ZB) 21 zebibyte (ZiB) 270 ≈ 1.181 × 1021

yottabyte (YB) 24 yobibyte (YiB) 280 ≈ 1.209 × 1024

22

Social Media Data:
Part of a Bigger Picture

23

Social Media Data:
Ways in big data is creating value

• Makes information
transparent and usable at
much higher frequency.
• Provides more transactional
data in digital form, that can
be used to improve
performance across the
board.
• Allows ever-narrower
segmentation of customers to
tailor products or services.
• Improves decision-making
through sophisticated.
• Improves the development of
the next generation of
products and services

24

Data Mining: Defined

Discovering meaningful
patterns from large data
sets using pattern
recognition technologies.

25

Data Mining: CRISP-DM
Real-World
Data

Data Consolidation
Business Data
Understanding Understanding

Data
Preparation
Data Cleaning

Deployment

Modeling
Data Transformation

Evaluation
Data Reduction

Well-Formed
Cross-Industry Standard Process for Data Mining Data
26

Data Mining:
General Data Assumptions

Structured
Transformed
Well-Formed

27

Data Mining: Example

Affinity Analysis

28

1. Market Basket Analysis: Items for Sale:

Apples Bananas Cherries

2. Possible Transactions: With one item or a collection of items selected as
the Driver or Independent Variable
No X Y No X Y
1 A B 7 C A
2 A C 8 C B
3 A B C 9 C A B
4 B A 10 A B C
5 B C 11 A C B
6 B A C 12 B C A

3. Objective is to empirically determine those groups of items that occur
frequently together in a set of transactions, producing a set of rules of the
form X -> Y.


1 1 1 1
Transaction ID Items
2 1 0 0
1 Apple
3 0 1 1
1 Banana
4 0 1 1
1 Cherry
5 1 1 0
2 Apple 6 1 1 0
3 Banana 7 1 0 1
3 Cherry 8 1 1 0
4 Banana 9 1 1 1
4 Cherry 10 1 1 0
5 Apple Sum 8 8 5
5 Banana
6 Apple
6 Banana
Standard Market Basket Measures:
7 Apple
7 Cherry
Support: Rule’s coverage (% match antecedents)
8 Apple N(X & Y)/ N(T) Example: N(A & B)/ N(T) = 2/7 = 29%
8 Banana
9 Apple Confidence: Rule’s predictive ability (% consequent | antecedent)
9 Banana N(X & Y)/ N(X) Example: N(A & B)/ N(A) = 2/4 = 50%
9 Cherry
10 Apple Lift: Predictive improvement (ratio of observed support for X&Y to support if X& Y
10 Banana independent -- S(XuY)/S(X)S(Y) Example: (2 x7)/(4/7)(5/7) = .7 or 70%

30


Rule selection usually based Parameters
Min. Support 40%
on minimum support & confidence Min. Confidence 75%

No X Y N(XuY) N(T) S(XuY) N(X) Conf N(Y) S(X) S(Y) Lift Rule
1 A B 6 10 60% 8 75% 8 80% 80% 94% Ok
2 A C 3 10 30% 8 38% 5 80% 50% 94%
3 A B C 2 10 20% 8 25% 4 80% 40% 78%
4 B A 6 10 60% 8 75% 8 80% 80% 117% Ok
5 B C 4 10 40% 8 50% 5 80% 50% 125%
6 B A C 2 10 20% 8 25% 3 80% 30% 104%
7 C A 3 10 30% 5 60% 8 50% 80% 150%
8 C B 4 10 40% 5 80% 8 50% 80% 200% Ok
9 C A B 2 10 20% 5 40% 6 50% 60% 133%
10 A B C 2 10 20% 6 33% 5 60% 50% 111%
11 A C B 2 10 20% 3 67% 8 30% 80% 278%
12 B C A 2 10 20% 4 50% 8 40% 80% 156%

31

Data Mining:
Simple Example
But, what if the baskets were described in the
following manner:
– Jane bought a handful of maraschinos and a couple of
granny smiths.
– Harold purchased a bag of appls and 2 bananas.
– Bill paid for a pound of cherries but decided not to buy
the three durians because of their odor.
How could we automate the analysis?


Social Media Data:

33

Social Media Data: Commonality?

34

Text Mining: Defined

Using data mining to discover patterns
in a collection of documents
35

Text Mining:
CRISP-Like Processes
Real-World
Text Data

Document
Business
Understanding
Document
Understanding
Consolidation

Document Establish the
Preparation
Corpus
Deployment
Documents
Modeling
Corpus Refinement
(Token, Stem, Stop…)

Feature Selection
Evaluation
& Weighting

Term-
Doc-Matrix*
36

Text Mining Process:
Sample Corpa
• Brown Corpus – first million word corpus compiled in 60s at
Brown U., 500 samples across 15 genres, each ~2000 words with
POS tags (Lancaster-Oslo-Bergen Corpus – British equivalent)
• Linguistic Consortium Treebanks – collections of manually
tagged and parsed (tree structures) of sentences from a variety of
sources (includes well-known Penn Treebank collection)
• Reuters 21578, RCV1 & V2, TRC2 -- collections (1000s of)
Reuter’s English & multi-lingual news stories classified into topics and
grouped into training & test sets
• Pang & Lee’s Sentiment Analysis – 1000 positive and 1000
negative movie reviews
• MEDLINE – An extensive collection of articles and abstracts
(18M+) used in a variety of biomedical and linguistic text mining
applications
• WordNet® -- large lexical database of English grouped into sets of
cognitive synonyms (synsets) and interlinked by means of
conceptual-semantic and lexical relations.
• 20 Newsgroups -- collection of approximately 20,000 newsgroup
documents, partitioned (nearly) evenly across 20 different
newsgroups each representing a different topic.


Text Mining Process:
Corpus Refinement
Common representation of tokens within and between documents

Eliminate
Tokenization Normalize Stemming
Stop Words

• Tokenization —Parse the text to generate terms. Sophisticated
analyzers can also extract phrases from the text.
• Normalize — Convert them to lowercase.
• Eliminate stop words — Eliminate terms that appear very often (e.g.
the, and, …).
• Stemming — Convert the terms into their stemmed form—remove
plurals and different word forms (e.g. achieve, achieves, achieved –
achiev) [note: word about synonyms – WordNet Synset]


Text Mining:
Feature Extraction & Weighting

Feature
Extraction “Bag of Words, Terms
or Tokens”

Vector Representation ->
Word, Term, Token or Pairs-Triplets
x Doc Matrix
Token1 Token2 Token3 Token4 …
Doc1 1 2 2 4 Words or Tokens are
Doc2 4 2 3 0
attributes and documents
Doc3 1 1 1 0
Doc4 1 1 1 2
are examples
…


Text Mining:
Transforming Frequencies
• Binary Frequencies: tf =1 for tf>0; otherwise 0
• Term Frequencies: tf(i,j)/Sum of tf(i,j) in Doc K
• Log Frequencies: 1 + log(tf) for tf>0; otherwise 0
• Normalized Frequencies: Divide each frequency by SQRT
of Sum of Squares of the frequencies within the vector
(column)
• Term Frequency–Inverse Document Frequency
– TF * IDF
– Inverse Document Frequency: log(N/(1+D)) where N is total
number of docs and D is number with term


Text Mining: Simple Example

Listening Post is an art installation by Mark
Hansen and Ben Rubin that culls text
fragments in real time from thousands of
unrestricted Internet chat rooms, bulletin
boards and other public forums.
41


42


sentence
imageid
Blogs feeling
“I feel” posttime
“I’m feeling” postdate
posturl
15-20K
gender
Feelings
born
Per Day
country
Contains state
Every
1 of 5000 city
10 Mins
Pre-Determined lat
Feelings lon
conditions

43


Query API Result
<?xml version="1.0" ?>
http://api.wefeelfine.org
<feelings>
:8080/ <feeling imageid="-
ShowFeelings? mZmybPrOGTZ+xukpcU7jg"
display=xml& feeling="better"
sentence="i feel almost 100 better
returnfields= aside from that weird sandy feeling in
Sentence my throat"
&postdate=2010-11-25 posttime="1321633467"
postdate=2010-11-25="0"
&limit=500
posturl="http://jenngreenleaf.blogspot.com
/2011/11/im-coming-down-with-cold-or-
am-i.html"
gender="0" country="united states"
state="maine" city="richmond"
lat="44.091522" lon="-69.801787"
conditions="4" />
…

44


• i'm done believing you don't know what i'm feeling
• i feel so out of place
• i'm feeling healthy
• i never feel down when i'm with her
• i love the feeling
• i feel like i've been run over by a truck
• i feel so positive today
• i feel like a poor man's pin up girl

45


• Input String (128925 chars; 24282 spaces)
– "i have found to be helpful especially during those times when i am feeling
discouragedni have a 50km commute and just the lack of the sense of freedom that
driving brings just leaves me feeling scaredni seem to be feeling better mostly…"
• Tokenize (26465 tokens)
– ['i', ', 'have', 'found', 'to', 'be', 'helpful', 'especially', 'during', 'those', 'times', 'when', 'i',
'am', 'feeling', 'discouraged', 'i', 'have', 'a', '50km', 'commute', 'and', 'just', 'the', 'lack',
'of', 'the', 'sense', 'of', 'freedom', 'that', 'driving', 'brings', 'just', 'leaves', 'me', 'feeling',
'scared', 'i', 'feel', 'noone', 'know', 'if', 'you', 'were', 'me', 'you', 'will', 'feel', 'the', 'same',
'way‘, …]
• Set of Tokens (3045 distinct tokens)
– ["'", "'believe", "'d", "'en", "'encoding", "'feedlinks", "'forever", "'gets", "'http",
"'ismobile", "'isprivate", "'item", "'languagedirection", "'ll", "'locale", "'ltr", "'m",
"'mefaked", "'mobileclass", "'mr", "'no", "'okay", "'on", "'pagetitle", "'pagetype", "'re",
"'s", "'t", "'toned", "'url", "'us", "'utf", "'ve", "'yes", '0', '034', '039', '0aeverytime', '0d',
'10', '100', '101',…]



Corpus Word Length Sentence Length Lexical Diversity
We Feel Fine 4 17 8
Gutenberg Corpus
Austen-persuasion.txt 4 23 16
Bible-kjv.txt 4 33 79
Blake-poems.txt 4 18 5
Carroll-alice.txt 4 16 12
Melville-moby.txt 4 24 15
Milton-paradise.txt 4 52 15
Shakespeare-caesar.txt 4 12 8
Shakespeare-hamlet.txt 4 13 7



• Eliminate Stopwords (175 words - 'a', 'about', 'above', 'after', …)
– Set of tokens (12827) with stopwords eliminated ['ab', 'abit', 'able', 'abs',
'absolute', 'absolutely', 'absorb', 'abuse', 'accomplished',
'accomplishment', 'achieve', 'achieved', 'across', 'acted', 'action',
'activities', 'activity', 'actually', 'acura', 'add', …]
– Content (11896 or 45% of tokens not stopwords – 4053 with tokens
starting with apostrophes and #s eliminated )
• Stemming
– Stemmed tokens (11896) ['abdomen', 'abdul', 'abil', 'abl', 'abrupt', 'absolut',
'abstract', 'academ', 'accept', 'accid', 'accomplish', 'accur', 'accus', 'accustom',
'achi', 'achiev', 'acknowledg', 'across', 'action', 'activ‘…]
– Set of tokens in stemmed content(2283) ['abdomen', 'abdul', 'abil', 'abl', 'abrupt',
'absolut', 'abstract', 'academ', 'accept', 'accid', 'accomplish', 'accur', 'accus',
'accustom', 'achi', 'achiev',…]



Document-Term Matrix
Sum 416 94 90 89 83 80 80 76 76 75 … 16 16 16 16 16 16 16 16 16
Sum WeFeel like know time go think better way get good love … hear didn place almost comfort everyonsinc babi actual
3 comment1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 comment2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 comment3 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
1 comment4 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
1 comment5 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 comment6 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 comment7 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
7 comment8 2 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 comment9 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 comment10 0 0 2 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
… …
2 comment1490 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 comment1491 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
6 comment1492 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
3 comment1493 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 comment1494 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 comment1495 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 comment1496 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 comment1497 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 comment1498 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 comment1499 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

50


Madness Murmerings Montage

Mobs Metrics Mounds

Prediction

Collective, macroscopic
trends which can be
scientifically inferred by
harnessing publicly
accessible data from
the Internet.

52

Prediction: Characteristics

Public
Practical
Big

53

Prediction: Sources

Easily accessible digital traces:
What we surf
Whom we “friend”
What we say
Where we go
What we buy
How we play

54

Prediction: Sample Studies

55

Prediction: Sample Studies

Infodemiology
Nowcasting
Culturomics

56

Prediction: Infodemiology

Information + Epidemiology:

Science of distribution and
determinants of information
in an electronic medium,
specifically the Internet, or
in a population, with the
ultimate aim to inform public
health and public policy
Coined by Gunther Eysenbach, Univ. of Toronto

57

A Major Application - Practical

58

A Major Application - Practical
Vi

Regional, Weekly Syndromic Surveillance

59

An Alternative Approach

Text Mining of Worldwide Newswires, Web Sites
and Various Offline Reports

60

Utilizing Aggregate Search Data

Monitoring and analyzing
queries from Internet search
engines or peoples' status
updates on microblogs for
syndromic surveillance to
predict disease outbreaks

61


62


63


Dependent Dependent Traditional, Aggregate
Variable at Variable at Publicly Search
Time t Time t - n Available Index or
(Standard = b0 + b1 (Standard + b2 Explanatory + b3 Social +e
Publicly Publicly Variable Media
Available Available Freq.
Measure) Measure) Count

Standard Linear Prediction Model

64

“Detecting Influenza Epidemics Using Search
Engine Query Data” (Ginsberg et. al.), 2/19/09

• Aggregating historical logs of search queries
from 2003-2008, computing weekly time series
• Logit(P) = b0 + b1 * logit(Q) + e
– P – percentage of ILI physician visits
– Q – query fraction 45 highest influenza queries
• r is between .80-.96 for 9 regions

65


http://www.google.org/flutrends/about/how.html

66


67

A Similar Application

http://www.google.org/denguetrends/
68

Utilizing Tweets

?

69

Utilizing Tweets

70

Utilizing Tweets
“Nowcasting Events from the Social Web
with Statistical Learning,” Lampos and
Cristianini, ACM IS&T, 9/11

• Text analysis of 50M tweets for 3 regions of UK
from 6/09-4/10 (303 days)
• HPA weekly reports of GP consultations with ILI
diagnosis correlated with number of “hybrid
grams”
• Average “r” of .911
71

A Major Application – Text Analysis

50M Tweets
Corpus
3 Region UK, 6/09-4/10

Corpus Lower Stop
Tokens Stems
Refinement Case Words

Feature 1- 2 Hybrid N-Gram
Selection Grams Grams Grams Freqs

72

Utilizing Tweets
Discarded
when
n<50

BoLasso - Bootstrap LASSO (least absolute shrinkage and selection operator
73

Utilizing Tweets

74

Utilizing Tweets

75

Prediction:

Now + Forecasting:

Predicting the present
by analyzing large
volumes of data that
can be used to
"forecast" current
events for which
official analysis has
not been released
76

Prediction: Nowcasting
Weather Envy

Within the next 6 hours …
77

Prediction:
Sample Studies with Search
Authors Date (Mnth-Year) Dependent Variables Explanatory Variables Model Results

Song, Pan, Ng Apr-10 Weekly Hotel Bookings in Indexed Search Volumes from Log of Room Nights for Log of Search Test various statistical models; all gave
Charleston, SC Google Trends/Insights Jan Volumes - Charleston, Travel Charleston, reasonable forecasts. Best fit model
2008-Aug 2009 Charleston Hotels, Charleston was Autoregressive Distributed Lag
Restaurants, Charleston Tourism (ADLM) with a lag period of 6 weeks.

Kholodilin, Apr-10 Year-on-Year Growth Rate 220 Google Trend/Insights Y-o-Y monthly URPC growth rates for 3 Query term principal components
Podstawski, of Monthly US Real Search terms related to Priv sets of regressors -- Sentiment outperform standard Sentiment and
Sliliverstovs Private Consumption, Consumption reduced to 10 (consumer sentiment and confidence); Financial Indicators. A combination of
ALFRED db of Fed Rsrv of principal components for Financial (short term and long term two of the factors work best -- those
St. Louis montly periods from Jan 2005 interest rates and S&P 500); Query related to mobility and health care
to Dec 2009 (combinations of principal components of consumption.
query terms)
Choi, Varian Apr-09 US Census Bureau Google Trend/Insight query Google Trend indices for query Simple seasonal AR models and fixed-
Advance Monthly Retail indices for categories and subcategories related to (log values) of
effects models that includes relevant
Sales (general and subcategories related to retail overall monthly retail trade (NAICS Google Trend variables tend to
specific) and Travel sales (general and specifix) categories), automotive sales, home outperform models that exclude these
(Visitor arrival in Hong and related to Travel sales and travel. variables. In some cases small gains, in
Kong) other substantial.
McLaren, Q2-11 Official monthly Google Trend/Insight query For unemployment, linear AR model For unemployment forecasts, claimant
Shanbhogue unemployment data and indexes for the term "Job with query term, claimant count, and GfK count strongest followed by query term.
housing price growth in Seekers Allowance (JSA)" for consumer confid. as exp vars; for housing For housing prices, the query term was
the UK from June 2004-Jan unemployment and "Estate price growth with query term, Home much stronger than HBF and RICS data.
2011 Agents" for housing Builders and Royal Instit. of Chartered
Surveyors price growth balances as exp
vars.

78

Prediction:
Sample Studies with Social Media
Authors Date (Mnth-Year) Dependent Explanatory Variables Model Results
Variables
Asur, Mar-10 Box-office Promotion tweets-retweets for a particular movie, Regression of 1st weekend box Promotional tweets are weakly
Huberman revenues for (24) tweet rates for particular movie per hour, ratio of office revenues by promotional correlated 1st weekend revs. Tweet
movies positive to negative sentiments for the movie tweets-retweets, by tweet rates rates are very strongly correlated
vs. Hollywood Stock Exchange (min .9) and a stronger predictor than
prices, and 2nd weekend HSX. Finally, tweet rates are strongly
revenues by tweet rates and the correlated with 2nd weekend
sentiment ratio. revenues and sentiments improve
the forecasts slightly.
Gruhl, Guha, Aug-05 Amazon Sales Number of mentions of the book/author in over 300K Cross correlation of time series While sales rank is a poor predictor of
Kumar, Novak, Rank for 2340 blogs whose postings that were maintained by IBM's for sales rank and mentions. the change in sales rankings, a prior
Tomkins bestselling books WebFountain project (over 200K postings/day) spike in mentions predicts quite well
in 4 month period a future spike in sales rank.
(Jul 2004-Aug
2004) and spikes
in these sales
ranks
Sadikov, Aug-09 Movie critic Basic features that count movie references in blogs, Linear regression for weekly Minimal correlation between
Parameswaran, ranking, user count movie references taking into account ranking rankings and sales data by blog rankings and references and
Venetis ranking, 2008 and indegree of the blogs where they appear, references and sentiment. sentiment. Strong correlation
gross sales, consider only references made within a time window between references and gross sales
weekly box office before or after a movie release date, features that but week with sentiment. Strongest
sales (weeks 1-5) consider positive sentiment; and combinations of relationships with timing of
these. References based on spinn3r.com blog data references in weeks after release.
set 11/07-11/08

79

Prediction: Any Guesses?

80

Prediction: Idiom, a Sculpture of
10s of 1000s of Books

81

Prediction: It comes in many
Shapes but not Sizes

Omphalos Book Cell

Matej Krén
Gravity Mixer
82

Prediction: Culturnomics

Culture + Genomics:

Application of high-
throughput data
collection and analysis
to the study of human
culture.

83

Prediction: Culturomics

“Quantitative Analysis of Culture Using
Millions of Digitized Books,” Science, 12/16/10.
84

Prediction: Culturomics 2.0

http://www.youtube.com/watch?v=61qn7S9NCOs

85

Culturomics 2.0: Forecasting Large-Scale Human
Behavior Using Global News Media Tone in Time
and Space, Kalev Leetaru, 9/11

• The tone of real-time consciousness reflected in the media can
be used to forecast broad social behavior.
• Combined three massive news archives totaling more than 100
million articles worldwide to explore the global consciousness
of the news media.
• Employs a large shared-memory supercomputer (University of
Tennessee SGI Altix supercomputer Nautilus with 1024
processors and 4-TB of memory)
• Using the tone and location of the reports, (claims to have)
predicted the outcome of the Arab Spring and the location of Bin
Laden within radius of 125 miles

86

Based on Carbon Capture Report

87

Based on Carbon Capture Report

88

Features of Stories or Tweets
• Tone/Positivity/Negativity. Ratio of + to - tone (-
100 to 100)
• Polarity. Emotional charge (0 to 100)
• Activity. Intensity of "active language" (0 to 100)
• Personalization. Degree to which the writer
attempts to bring the reader into the fold (0 to
100)
• Questions/Exclamations. Tweet tone indicators of
non-word items
• Geocoding. Location of story content
89

Features of Stories or Tweets

100M Articles from the: Sentiment Mining,
New York Times (1945-05) Geocoding,
Sum. of Wrld Brdcasts (1979-10) Entity Extraction Geocoding
Google News articles (2006-11) Nautilus Supercomputer Feature Scores

2.4 Petabyte
Network with over
10M entitles

90

Predicting Unrest

91

NY Times View of Tone
http://contentanalysis.ichass.illinois.edu/Culturomics20/nyt-movie-
1000x1000.gif

92

SWB View of Tone
http://contentanalysis.ichass.illinois.edu/Culturomics20/swb-movie-
1000x1000.gif

93

Mining and analyzing social media hicss 45 tutorial – part 1

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Mining and analyzing social media hicss 45 tutorial – part 1

Ähnlich wie Mining and analyzing social media hicss 45 tutorial – part 1 (20)

Mehr von Dave King

Mehr von Dave King (12)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Mining and analyzing social media hicss 45 tutorial – part 1