Scaling API-first – The story of a global engineering organization
Mining and analyzing social media hicss 45 tutorial – part 1
1. Mining and Analyzing Social Media
HICSS 45 Tutorial – Part 1
Dave King
January 4, 2012
2. Agenda: This is how the slides are
organized
• Part 1
– Introduction – Bio, Resources, Social Media
– Data Mining – Processes and Example
– Text Mining – General Processes and Example
– Predicting the Future – The Portmanteaus
• Part 2
– Sentiment Analysis
– Social Network Analysis - Introduction
2
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
3. Biography: Dave King
• Currently, EVP of Product Development
and Management at JDA Software
• 30 years in enterprise package
software business
• 15 years as university professor
• 14 years as Co-Chair of the Internet &
Digital Economy Track (HICSS)
• Long time interest in various aspects of
E-Commerce & Business Intelligence
• Tutorial topic primarily reflects a
personal interest and tangentially a
job(s) related interest.
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
4. Personal Experiences with
Analytics
• Taught applied statistics, math modeling & mathematical sociology
• In software R&D for 30 years
– Optimization in the 80s
– Natural Language Frontends
• NLI Query & CMU Robotics Lab
– EIS Competitive Analysis
• Dow Jones and Reuters
• Verity Topics
• NewsAlert
– InXight’s Hyperbolic Tree
– Supply Chain Analytics
• In the case of text analysis and it’s practical application, often
audiences have been small, bewildered, and fleeting
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
5. Mining and Analytics Resources
5
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
6. Mining and Analytics Resources
6
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
7. Mining and Analytics Resources
7
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
8. Mining and Analytics Resources
8
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
12. Social Media Defined
Marta Kagan
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
13. Social Media Defined: …Sort of …
13
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
14. Social Media Defined:
Actually, it’s 33 Definitions
1. Media for social interaction, using highly accessible and scalable 18. Not one thing. It’s five distinct things:
communication techniques. 19. Digital, content-based communications based on the interactions enabled by a
2. Various user-driven (inbound marketing) channels (e.g., Facebook, Twitter, plethora of web technologies
blogs, YouTube). 20. Collection of online platforms and tools that people use to share content,
3. Most transparent, engaging and interactive form of public relations profiles, opinions, insights, experiences, perspectives and media itself,
4. What we do and say together, worldwide, to communicate in all direction at facilitating conversations and interactions online between groups of people.
any time, by any possible (digital) means. 21. Platform/tools.
5. New marketing tool that allows you to get to know your customers and 22. Act of connecting on social media platforms.
prospects in ways that were previously not possible. 23. How businesses join the conversation in an authentic and transparent way to
6. Platforms that enable the interactive web by engaging users to participate in, build relationships.
comment on and create content as means of communicating 24. The notion that social media is about the technology that facilitates individuals
7. Consists of any online platform or channel for user generated content. and groups of people to connect and interact, create and share.
8. Digital content and interaction that is created by and between people. 25. Any of a number of individual web-based applications aggregating users who
9. Shift in how we get our information. Social media allows us to network, to find are able to conduct one-to-one and one-to-many two-way conversations.
people with like interests, and to meet people who can become friends or 26. Media channel that relies on listening and conversation, as opposed to a
customers. monologue, to get your point across, make a connection and build a
10. Platforms for interaction and relationships, not content and ads. relationship.
11. Online platforms and locations that provide a way for people to participate in 27. Social media is all about leveraging online tools that promote sharing and
these conversations. conversations, which ultimately lead to engagement with current and future
12. People’s conversations and actions online that can be mined by advertisers customers and influencers in your target market.
for insights but not coerced to pass along marketing messages. 28. Social media: Evolution, Revolution and Contribution -by the ability of
13. Tools, services, and communication facilitating connection between peers everybody to share and contribute as a publisher
with common interests. 29. Social media is communication channels or tools used to store, aggregate,
14. Online technologies and practices that people use to share content, opinions, share, discuss or deliver information within online communities.
insights, experiences, perspectives, and media themselves. 30. Social Media is simply another arrow to be shot in a company’s marketing
15. Ever-growing and evolving collection of online tools and toys, platforms and quiver.
applications that enable all of us to interact with and share information. 31. Social media platforms make it easier to share information–usually online.
Increasingly, it’s both the connective tissue and neural net of the Web. 32. Any object or tool, that connects people in dialogue or interaction — in
16. Reflection of conversations happening every day, whether at the supermarket, person, in print, or online.
a bar, the train, the watercooler or the playground. 33. Wild, Wild West of Marketing, with brands, businesses, and organizations
17. Online text, pictures, videos and links, shared amongst people and jostling with individuals to make news, friends, connections and build
organizations. communities in the virtual space.
14
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
15. Social Media Defined: If a Picture isn’t
worth a 1000 words, then …
15
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
16. Social Media Defined
Online technologies and practices
for social interaction
enabling the sharing of opinions, insights,
experiences, perspectives and media itself
16
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
17. Social Media Defined: Categories
17
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
19. Social Media is Huge: Users
Marta Kagan
750 Million: Facebook
200 Million: Twitter
100 Million: LinkedIn
19
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
20. Social Media is Huge!
Marta Kagan
If Facebook
were a country,
it would be the
3 rd largest in
the world
20
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
21. Social Media Data:
Research Opportunity
“Every day, Twitter
generates more
social network
data than the
entire field of SNA
possessed 10
years ago.”
21
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
23. Social Media Data:
Part of a Bigger Picture
23
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
24. Social Media Data:
Ways in big data is creating value
• Makes information
transparent and usable at
much higher frequency.
• Provides more transactional
data in digital form, that can
be used to improve
performance across the
board.
• Allows ever-narrower
segmentation of customers to
tailor products or services.
• Improves decision-making
through sophisticated.
• Improves the development of
the next generation of
products and services
24
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
25. Data Mining: Defined
Discovering meaningful
patterns from large data
sets using pattern
recognition technologies.
25
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
26. Data Mining: CRISP-DM
Real-World
Data
Data Consolidation
Business Data
Understanding Understanding
Data
Preparation
Data Cleaning
Deployment
Modeling
Data Transformation
Evaluation
Data Reduction
Well-Formed
Cross-Industry Standard Process for Data Mining Data
26
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
27. Data Mining:
General Data Assumptions
Structured
Transformed
Well-Formed
27
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
28. Data Mining: Example
Affinity Analysis
28
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
29. Data Mining: Example
1. Market Basket Analysis: Items for Sale:
Apples Bananas Cherries
2. Possible Transactions: With one item or a collection of items selected as
the Driver or Independent Variable
No X Y No X Y
1 A B 7 C A
2 A C 8 C B
3 A B C 9 C A B
4 B A 10 A B C
5 B C 11 A C B
6 B A C 12 B C A
3. Objective is to empirically determine those groups of items that occur
frequently together in a set of transactions, producing a set of rules of the
form X -> Y.
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
30. Data Mining: Example
1 1 1 1
Transaction ID Items
2 1 0 0
1 Apple
3 0 1 1
1 Banana
4 0 1 1
1 Cherry
5 1 1 0
2 Apple 6 1 1 0
3 Banana 7 1 0 1
3 Cherry 8 1 1 0
4 Banana 9 1 1 1
4 Cherry 10 1 1 0
5 Apple Sum 8 8 5
5 Banana
6 Apple
6 Banana
Standard Market Basket Measures:
7 Apple
7 Cherry
Support: Rule’s coverage (% match antecedents)
8 Apple N(X & Y)/ N(T) Example: N(A & B)/ N(T) = 2/7 = 29%
8 Banana
9 Apple Confidence: Rule’s predictive ability (% consequent | antecedent)
9 Banana N(X & Y)/ N(X) Example: N(A & B)/ N(A) = 2/4 = 50%
9 Cherry
10 Apple Lift: Predictive improvement (ratio of observed support for X&Y to support if X& Y
10 Banana independent -- S(XuY)/S(X)S(Y) Example: (2 x7)/(4/7)(5/7) = .7 or 70%
30
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
31. Data Mining: Example
Rule selection usually based Parameters
Min. Support 40%
on minimum support & confidence Min. Confidence 75%
No X Y N(XuY) N(T) S(XuY) N(X) Conf N(Y) S(X) S(Y) Lift Rule
1 A B 6 10 60% 8 75% 8 80% 80% 94% Ok
2 A C 3 10 30% 8 38% 5 80% 50% 94%
3 A B C 2 10 20% 8 25% 4 80% 40% 78%
4 B A 6 10 60% 8 75% 8 80% 80% 117% Ok
5 B C 4 10 40% 8 50% 5 80% 50% 125%
6 B A C 2 10 20% 8 25% 3 80% 30% 104%
7 C A 3 10 30% 5 60% 8 50% 80% 150%
8 C B 4 10 40% 5 80% 8 50% 80% 200% Ok
9 C A B 2 10 20% 5 40% 6 50% 60% 133%
10 A B C 2 10 20% 6 33% 5 60% 50% 111%
11 A C B 2 10 20% 3 67% 8 30% 80% 278%
12 B C A 2 10 20% 4 50% 8 40% 80% 156%
31
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
32. Data Mining:
Simple Example
But, what if the baskets were described in the
following manner:
– Jane bought a handful of maraschinos and a couple of
granny smiths.
– Harold purchased a bag of appls and 2 bananas.
– Bill paid for a pound of cherries but decided not to buy
the three durians because of their odor.
How could we automate the analysis?
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
33. Social Media Data:
33
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
34. Social Media Data: Commonality?
34
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
35. Text Mining: Defined
Using data mining to discover patterns
in a collection of documents
35
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
36. Text Mining:
CRISP-Like Processes
Real-World
Text Data
Document
Business
Understanding
Document
Understanding
Consolidation
Document Establish the
Preparation
Corpus
Deployment
Documents
Modeling
Corpus Refinement
(Token, Stem, Stop…)
Feature Selection
Evaluation
& Weighting
Term-
Doc-Matrix*
36
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
37. Text Mining Process:
Sample Corpa
• Brown Corpus – first million word corpus compiled in 60s at
Brown U., 500 samples across 15 genres, each ~2000 words with
POS tags (Lancaster-Oslo-Bergen Corpus – British equivalent)
• Linguistic Consortium Treebanks – collections of manually
tagged and parsed (tree structures) of sentences from a variety of
sources (includes well-known Penn Treebank collection)
• Reuters 21578, RCV1 & V2, TRC2 -- collections (1000s of)
Reuter’s English & multi-lingual news stories classified into topics and
grouped into training & test sets
• Pang & Lee’s Sentiment Analysis – 1000 positive and 1000
negative movie reviews
• MEDLINE – An extensive collection of articles and abstracts
(18M+) used in a variety of biomedical and linguistic text mining
applications
• WordNet® -- large lexical database of English grouped into sets of
cognitive synonyms (synsets) and interlinked by means of
conceptual-semantic and lexical relations.
• 20 Newsgroups -- collection of approximately 20,000 newsgroup
documents, partitioned (nearly) evenly across 20 different
newsgroups each representing a different topic.
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
38. Text Mining Process:
Corpus Refinement
Common representation of tokens within and between documents
Eliminate
Tokenization Normalize Stemming
Stop Words
• Tokenization —Parse the text to generate terms. Sophisticated
analyzers can also extract phrases from the text.
• Normalize — Convert them to lowercase.
• Eliminate stop words — Eliminate terms that appear very often (e.g.
the, and, …).
• Stemming — Convert the terms into their stemmed form—remove
plurals and different word forms (e.g. achieve, achieves, achieved –
achiev) [note: word about synonyms – WordNet Synset]
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
39. Text Mining:
Feature Extraction & Weighting
Feature
Extraction “Bag of Words, Terms
or Tokens”
Vector Representation ->
Word, Term, Token or Pairs-Triplets
x Doc Matrix
Token1 Token2 Token3 Token4 …
Doc1 1 2 2 4 Words or Tokens are
Doc2 4 2 3 0
attributes and documents
Doc3 1 1 1 0
Doc4 1 1 1 2
are examples
…
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
40. Text Mining:
Transforming Frequencies
• Binary Frequencies: tf =1 for tf>0; otherwise 0
• Term Frequencies: tf(i,j)/Sum of tf(i,j) in Doc K
• Log Frequencies: 1 + log(tf) for tf>0; otherwise 0
• Normalized Frequencies: Divide each frequency by SQRT
of Sum of Squares of the frequencies within the vector
(column)
• Term Frequency–Inverse Document Frequency
– TF * IDF
– Inverse Document Frequency: log(N/(1+D)) where N is total
number of docs and D is number with term
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
41. Text Mining: Simple Example
Listening Post is an art installation by Mark
Hansen and Ben Rubin that culls text
fragments in real time from thousands of
unrestricted Internet chat rooms, bulletin
boards and other public forums.
41
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
42. Text Mining: Simple Example
42
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
43. Text Mining: Simple Example
sentence
imageid
Blogs feeling
“I feel” posttime
“I’m feeling” postdate
posturl
15-20K
gender
Feelings
born
Per Day
country
Contains state
Every
1 of 5000 city
10 Mins
Pre-Determined lat
Feelings lon
conditions
43
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
44. Text Mining: Simple Example
Query API Result
<?xml version="1.0" ?>
http://api.wefeelfine.org
<feelings>
:8080/ <feeling imageid="-
ShowFeelings? mZmybPrOGTZ+xukpcU7jg"
display=xml& feeling="better"
sentence="i feel almost 100 better
returnfields= aside from that weird sandy feeling in
Sentence my throat"
&postdate=2010-11-25 posttime="1321633467"
postdate=2010-11-25="0"
&limit=500
posturl="http://jenngreenleaf.blogspot.com
/2011/11/im-coming-down-with-cold-or-
am-i.html"
gender="0" country="united states"
state="maine" city="richmond"
lat="44.091522" lon="-69.801787"
conditions="4" />
…
44
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
45. Text Mining: Simple Example
• i'm done believing you don't know what i'm feeling
• i feel so out of place
• i'm feeling healthy
• i never feel down when i'm with her
• i love the feeling
• i feel like i've been run over by a truck
• i feel so positive today
• i feel like a poor man's pin up girl
45
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
46. Text Mining: Simple Example
• Input String (128925 chars; 24282 spaces)
– "i have found to be helpful especially during those times when i am feeling
discouragedni have a 50km commute and just the lack of the sense of freedom that
driving brings just leaves me feeling scaredni seem to be feeling better mostly…"
• Tokenize (26465 tokens)
– ['i', ', 'have', 'found', 'to', 'be', 'helpful', 'especially', 'during', 'those', 'times', 'when', 'i',
'am', 'feeling', 'discouraged', 'i', 'have', 'a', '50km', 'commute', 'and', 'just', 'the', 'lack',
'of', 'the', 'sense', 'of', 'freedom', 'that', 'driving', 'brings', 'just', 'leaves', 'me', 'feeling',
'scared', 'i', 'feel', 'noone', 'know', 'if', 'you', 'were', 'me', 'you', 'will', 'feel', 'the', 'same',
'way‘, …]
• Set of Tokens (3045 distinct tokens)
– ["'", "'believe", "'d", "'en", "'encoding", "'feedlinks", "'forever", "'gets", "'http",
"'ismobile", "'isprivate", "'item", "'languagedirection", "'ll", "'locale", "'ltr", "'m",
"'mefaked", "'mobileclass", "'mr", "'no", "'okay", "'on", "'pagetitle", "'pagetype", "'re",
"'s", "'t", "'toned", "'url", "'us", "'utf", "'ve", "'yes", '0', '034', '039', '0aeverytime', '0d',
'10', '100', '101',…]
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
47. Text Mining: Simple Example
Corpus Word Length Sentence Length Lexical Diversity
We Feel Fine 4 17 8
Gutenberg Corpus
Austen-persuasion.txt 4 23 16
Bible-kjv.txt 4 33 79
Blake-poems.txt 4 18 5
Carroll-alice.txt 4 16 12
Melville-moby.txt 4 24 15
Milton-paradise.txt 4 52 15
Shakespeare-caesar.txt 4 12 8
Shakespeare-hamlet.txt 4 13 7
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
48. Text Mining: Simple Example
• Eliminate Stopwords (175 words - 'a', 'about', 'above', 'after', …)
– Set of tokens (12827) with stopwords eliminated ['ab', 'abit', 'able', 'abs',
'absolute', 'absolutely', 'absorb', 'abuse', 'accomplished',
'accomplishment', 'achieve', 'achieved', 'across', 'acted', 'action',
'activities', 'activity', 'actually', 'acura', 'add', …]
– Content (11896 or 45% of tokens not stopwords – 4053 with tokens
starting with apostrophes and #s eliminated )
• Stemming
– Stemmed tokens (11896) ['abdomen', 'abdul', 'abil', 'abl', 'abrupt', 'absolut',
'abstract', 'academ', 'accept', 'accid', 'accomplish', 'accur', 'accus', 'accustom',
'achi', 'achiev', 'acknowledg', 'across', 'action', 'activ‘…]
– Set of tokens in stemmed content(2283) ['abdomen', 'abdul', 'abil', 'abl', 'abrupt',
'absolut', 'abstract', 'academ', 'accept', 'accid', 'accomplish', 'accur', 'accus',
'accustom', 'achi', 'achiev',…]
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
49. Text Mining: Simple Example
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
51. Text Mining: Simple Example
Madness Murmerings Montage
Mobs Metrics Mounds
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
52. Prediction
Collective, macroscopic
trends which can be
scientifically inferred by
harnessing publicly
accessible data from
the Internet.
52
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
54. Prediction: Sources
Easily accessible digital traces:
What we surf
Whom we “friend”
What we say
Where we go
What we buy
How we play
54
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
57. Prediction: Infodemiology
Information + Epidemiology:
Science of distribution and
determinants of information
in an electronic medium,
specifically the Internet, or
in a population, with the
ultimate aim to inform public
health and public policy
Coined by Gunther Eysenbach, Univ. of Toronto
57
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
59. Prediction: Infodemiology
A Major Application - Practical
Vi
Regional, Weekly Syndromic Surveillance
59
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
60. Prediction: Infodemiology
An Alternative Approach
Text Mining of Worldwide Newswires, Web Sites
and Various Offline Reports
60
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
61. Prediction: Infodemiology
Utilizing Aggregate Search Data
Monitoring and analyzing
queries from Internet search
engines or peoples' status
updates on microblogs for
syndromic surveillance to
predict disease outbreaks
61
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
64. Prediction: Infodemiology
Utilizing Aggregate Search Data
Dependent Dependent Traditional, Aggregate
Variable at Variable at Publicly Search
Time t Time t - n Available Index or
(Standard = b0 + b1 (Standard + b2 Explanatory + b3 Social +e
Publicly Publicly Variable Media
Available Available Freq.
Measure) Measure) Count
Standard Linear Prediction Model
64
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
65. Prediction: Infodemiology
Utilizing Aggregate Search Data
“Detecting Influenza Epidemics Using Search
Engine Query Data” (Ginsberg et. al.), 2/19/09
• Aggregating historical logs of search queries
from 2003-2008, computing weekly time series
• Logit(P) = b0 + b1 * logit(Q) + e
– P – percentage of ILI physician visits
– Q – query fraction 45 highest influenza queries
• r is between .80-.96 for 9 regions
65
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
71. Prediction: Infodemiology
Utilizing Tweets
“Nowcasting Events from the Social Web
with Statistical Learning,” Lampos and
Cristianini, ACM IS&T, 9/11
• Text analysis of 50M tweets for 3 regions of UK
from 6/09-4/10 (303 days)
• HPA weekly reports of GP consultations with ILI
diagnosis correlated with number of “hybrid
grams”
• Average “r” of .911
71
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
72. Prediction: Infodemiology
A Major Application – Text Analysis
50M Tweets
Corpus
3 Region UK, 6/09-4/10
Corpus Lower Stop
Tokens Stems
Refinement Case Words
Feature 1- 2 Hybrid N-Gram
Selection Grams Grams Grams Freqs
72
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
73. Prediction: Infodemiology
Utilizing Tweets
Discarded
when
n<50
BoLasso - Bootstrap LASSO (least absolute shrinkage and selection operator
73
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
76. Prediction:
Now + Forecasting:
Predicting the present
by analyzing large
volumes of data that
can be used to
"forecast" current
events for which
official analysis has
not been released
76
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
78. Prediction:
Sample Studies with Search
Authors Date (Mnth-Year) Dependent Variables Explanatory Variables Model Results
Song, Pan, Ng Apr-10 Weekly Hotel Bookings in Indexed Search Volumes from Log of Room Nights for Log of Search Test various statistical models; all gave
Charleston, SC Google Trends/Insights Jan Volumes - Charleston, Travel Charleston, reasonable forecasts. Best fit model
2008-Aug 2009 Charleston Hotels, Charleston was Autoregressive Distributed Lag
Restaurants, Charleston Tourism (ADLM) with a lag period of 6 weeks.
Kholodilin, Apr-10 Year-on-Year Growth Rate 220 Google Trend/Insights Y-o-Y monthly URPC growth rates for 3 Query term principal components
Podstawski, of Monthly US Real Search terms related to Priv sets of regressors -- Sentiment outperform standard Sentiment and
Sliliverstovs Private Consumption, Consumption reduced to 10 (consumer sentiment and confidence); Financial Indicators. A combination of
ALFRED db of Fed Rsrv of principal components for Financial (short term and long term two of the factors work best -- those
St. Louis montly periods from Jan 2005 interest rates and S&P 500); Query related to mobility and health care
to Dec 2009 (combinations of principal components of consumption.
query terms)
Choi, Varian Apr-09 US Census Bureau Google Trend/Insight query Google Trend indices for query Simple seasonal AR models and fixed-
Advance Monthly Retail indices for categories and subcategories related to (log values) of
effects models that includes relevant
Sales (general and subcategories related to retail overall monthly retail trade (NAICS Google Trend variables tend to
specific) and Travel sales (general and specifix) categories), automotive sales, home outperform models that exclude these
(Visitor arrival in Hong and related to Travel sales and travel. variables. In some cases small gains, in
Kong) other substantial.
McLaren, Q2-11 Official monthly Google Trend/Insight query For unemployment, linear AR model For unemployment forecasts, claimant
Shanbhogue unemployment data and indexes for the term "Job with query term, claimant count, and GfK count strongest followed by query term.
housing price growth in Seekers Allowance (JSA)" for consumer confid. as exp vars; for housing For housing prices, the query term was
the UK from June 2004-Jan unemployment and "Estate price growth with query term, Home much stronger than HBF and RICS data.
2011 Agents" for housing Builders and Royal Instit. of Chartered
Surveyors price growth balances as exp
vars.
78
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
79. Prediction:
Sample Studies with Social Media
Authors Date (Mnth-Year) Dependent Explanatory Variables Model Results
Variables
Asur, Mar-10 Box-office Promotion tweets-retweets for a particular movie, Regression of 1st weekend box Promotional tweets are weakly
Huberman revenues for (24) tweet rates for particular movie per hour, ratio of office revenues by promotional correlated 1st weekend revs. Tweet
movies positive to negative sentiments for the movie tweets-retweets, by tweet rates rates are very strongly correlated
vs. Hollywood Stock Exchange (min .9) and a stronger predictor than
prices, and 2nd weekend HSX. Finally, tweet rates are strongly
revenues by tweet rates and the correlated with 2nd weekend
sentiment ratio. revenues and sentiments improve
the forecasts slightly.
Gruhl, Guha, Aug-05 Amazon Sales Number of mentions of the book/author in over 300K Cross correlation of time series While sales rank is a poor predictor of
Kumar, Novak, Rank for 2340 blogs whose postings that were maintained by IBM's for sales rank and mentions. the change in sales rankings, a prior
Tomkins bestselling books WebFountain project (over 200K postings/day) spike in mentions predicts quite well
in 4 month period a future spike in sales rank.
(Jul 2004-Aug
2004) and spikes
in these sales
ranks
Sadikov, Aug-09 Movie critic Basic features that count movie references in blogs, Linear regression for weekly Minimal correlation between
Parameswaran, ranking, user count movie references taking into account ranking rankings and sales data by blog rankings and references and
Venetis ranking, 2008 and indegree of the blogs where they appear, references and sentiment. sentiment. Strong correlation
gross sales, consider only references made within a time window between references and gross sales
weekly box office before or after a movie release date, features that but week with sentiment. Strongest
sales (weeks 1-5) consider positive sentiment; and combinations of relationships with timing of
these. References based on spinn3r.com blog data references in weeks after release.
set 11/07-11/08
79
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
81. Prediction: Idiom, a Sculpture of
10s of 1000s of Books
81
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
82. Prediction: It comes in many
Shapes but not Sizes
Omphalos Book Cell
Matej Krén
Gravity Mixer
82
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
83. Prediction: Culturnomics
Culture + Genomics:
Application of high-
throughput data
collection and analysis
to the study of human
culture.
83
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
84. Prediction: Culturomics
“Quantitative Analysis of Culture Using
Millions of Digitized Books,” Science, 12/16/10.
84
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
86. Prediction: Culturomics 2.0
Culturomics 2.0: Forecasting Large-Scale Human
Behavior Using Global News Media Tone in Time
and Space, Kalev Leetaru, 9/11
• The tone of real-time consciousness reflected in the media can
be used to forecast broad social behavior.
• Combined three massive news archives totaling more than 100
million articles worldwide to explore the global consciousness
of the news media.
• Employs a large shared-memory supercomputer (University of
Tennessee SGI Altix supercomputer Nautilus with 1024
processors and 4-TB of memory)
• Using the tone and location of the reports, (claims to have)
predicted the outcome of the Arab Spring and the location of Bin
Laden within radius of 125 miles
86
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
89. Prediction: Culturomics 2.0
Features of Stories or Tweets
• Tone/Positivity/Negativity. Ratio of + to - tone (-
100 to 100)
• Polarity. Emotional charge (0 to 100)
• Activity. Intensity of "active language" (0 to 100)
• Personalization. Degree to which the writer
attempts to bring the reader into the fold (0 to
100)
• Questions/Exclamations. Tweet tone indicators of
non-word items
• Geocoding. Location of story content
89
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
90. Prediction: Culturomics 2.0
Features of Stories or Tweets
100M Articles from the: Sentiment Mining,
New York Times (1945-05) Geocoding,
Sum. of Wrld Brdcasts (1979-10) Entity Extraction Geocoding
Google News articles (2006-11) Nautilus Supercomputer Feature Scores
2.4 Petabyte
Network with over
10M entitles
90
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
92. Prediction: Culturomics 2.0
NY Times View of Tone
http://contentanalysis.ichass.illinois.edu/Culturomics20/nyt-movie-
1000x1000.gif
92
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
93. Prediction: Culturomics 2.0
SWB View of Tone
http://contentanalysis.ichass.illinois.edu/Culturomics20/swb-movie-
1000x1000.gif
93
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL