SlideShare a Scribd company logo
1 of 71
Turning Unstructured Content into
Kernels of Ideas
Jason Kessler
Data Scientist, CDK Global
Data Day Seattle, 2016
@jasonkessler
www.jasonkessler.com
Code used in talk:
https://github.com/JasonKessler/scattertext
1. Case studies on
mining documents
for knowledge
2. Old, new, and
obscure algorithms
and tools
Customer-Written
Product Reviews
Good Ad Content
Naïve Approach: Indicators of Positive Sentiment
"If you ask a Subaru owner what they think of their car, more times than not
they'll tell you they love it,"
-Alan Bethke, director of marketing communications for Subaru of
America (via Adweek)
But does positivity
imply engagement?
Finding Engaging Content
…I was very skeptical giving up my truck
and buying an "Economy Car." I'm 6'
215lbs, but my new career has me
driving a personal vehicle to make sales
calls. I am overly impressed with my
Cruze…
Rating: 4.4/5 Stars
Example Review Appearing on a 3rd Party
Automotive Site
# of users who
read review:
20
Text:
Car Reviewed: Chevy Cruze
Finding Engaging Content
…I was very skeptical giving up my truck
and buying an "Economy Car." I'm 6'
215lbs, but my new career has me
driving a personal vehicle to make sales
calls. I am overly impressed with my
Cruze…
Rating: 4.4/5 Stars
Example Review Appearing on a 3rd Party
Automotive Site
# of users who
read review:
# who went on to
visit a Chevy
dealer’s website: 15
20
Text:
Car Reviewed: Chevy Cruze
Finding Engaging Content
…I was very skeptical giving up my truck
and buying an "Economy Car." I'm 6'
215lbs, but my new career has me
driving a personal vehicle to make sales
calls. I am overly impressed with my
Cruze…
Rating: 4.4/5 Stars
Example Review Appearing on a 3rd Party
Automotive Site
# of users who
read review:
# who went on to
visit a Chevy
dealer’s website: 15
20
Review Engagement Rate:
15/20=75%
Text:
Car Reviewed: Chevy Cruze
Finding Engaging Content
…I was very skeptical giving up my truck
and buying an "Economy Car." I'm 6'
215lbs, but my new career has me
driving a personal vehicle to make sales
calls. I am overly impressed with my
Cruze…
Rating: 4.4/5 Stars
Example Review Appearing on a 3rd Party
Automotive Site
# of users who
read review:
# who went on to
visit a Chevy
dealer’s website: 15
20
Review Engagement Rate:
15/20=75%
Text:
Car Reviewed: Chevy Cruze
Median Review Engagement Rate:
22%
Are words and
phrases more
precise predictors?
Positive Sentiment
Love
Comfortable
Features
Solid
Amazing
Sentiment vs. Persuasiveness: SUV-Specific
Negative Sentiment
Transmission
Problem
Issue
Dealership
Times
Positive Sentiment High Engagement
Love Comfortable
Comfortable Front [Seats]
Features Acceleration
Solid Free [Car Wash, Oil Change]
Amazing Quiet
Sentiment vs. Persuasiveness: SUV-Specific
Negative Sentiment
Transmission
Problem
Issue
Dealership
Times
Positive Sentiment High Engagement
Love Comfortable
Comfortable Front [Seats]
Features Acceleration
Solid Free [Car Wash, Oil Change]
Amazing Quiet
Sentiment vs. Persuasiveness: SUV-Specific
Negative Sentiment Low Engagement
Transmission Money [spend my, save]
Problem Features
Issue Dealership
Dealership Amazing
Times Build Quality [typically positive]
Sentiment vs. Persuasiveness: SUV-Specific
Negative Sentiment Low Engagement
Transmission Money [spend my, save]
Problem Features
Issue Dealership
Dealership Amazing
Times Build Quality [typically positive]
• The worst thing to say when reviewing an SUV:
“I saved money and got all
these amazing features!”
Milquetoast positivity
Persuasiveness
High Sentiment Terms
Love
Awesome
Fantastic
Handled
Perfect
Engagement Terms
Blind (spot, alert)
Contexts from high
engagement reviews
- “The techno safety features
(blind spot, lane alert, etc)
are reason for buying car...”
- “Side blind Zone Alert is
truly wonderful…”
- …
Can better science improve messaging?
Engagement Terms
Blind
White (paint, diamond)
Contexts
- “White with cornsilk interior.”
- “My wife fell in love with the
Equinox in White Diamond”
- “The white diamond paint is
to die for”
Can better science improve messaging?
Can better science improve messaging?
Engagement Terms
Blind
White
Climate (geography, a/c)
Contexts
- “Love the front wheel drive
in this northern Minn.
Climate”
- “We do live in a cold climate
(Ontario)”
- …climate control…
Just recently, VW has produced very similar
commercials.
Finding useful class-
differentiating terms
is complicated.
But simple methods
work well.
OKCupid: How does gender and ethnicity affect self-
presentation on online dating profiles?
Christian Rudder: http://blog.okcupid.com/index.php/page/7/
Which words and phrases statistically distinguish ethnic groups and genders?
hobos
almond
butter 100 Years of
Solitude
Bikram yoga
Source: Christian Rudder. Dataclysm. 2014.
Ranking with everyone else
The smaller the distance from the top left, the
higher the association with white men
High distance: but
ignore K-Pop
Low distance: white guys
disproportionately like Phish
Source: Christian Rudder. Dataclysm. 2014.
my blue eyes
The Rudder Score:
|| < 1,0 > − < Φ(# 𝑤𝑜𝑟𝑑,¬𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦),Φ (# 𝑤𝑜𝑟𝑑,𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦) > ||2
The percentile function Φ ranks counts along a scale of [0,1],
Source: http://blog.okcupid.com/index.php/page/7/ (Rudder 2010)
Words and phrases that distinguish white men.
OKCupid: How do ethnicities’ self-presentation differ
on a dating site?
Source: http://blog.okcupid.com/index.php/page/7/ (Rudder 2010)
Words and phrases that distinguish Latino men.
Explanation
OKCupid: How do ethnicities’ self-presentation differ
on a dating site?
How can we make money with this?
• Genre of insurance or investment ads
– Montage of important events in the life of a person.
• With these phrase sets, the ads practically write themselves:
• What if you wanted to target Latino men?
– Grows up boxing
– Meets girlfriend salsa dancing
– Becomes a Marine
– Tells a joke at his wedding
– Etc…
Word clouds.
The good:
• Word clouds force you
to hunt for the most
impactful terms
• You end up examining
the long tail in the
process
• Compactly represent a
lot of phrases
Lyle Ungar 2013 AAAI Tutorial
Above: words found in Facebook status message predictive of women authors.
Lyle Ungar 2013 AAAI Tutorial
Above: words found in Facebook status message predictive of male authors.
The bad:
• “Mullets of the Internet”
--Jeffrey Zeldman
• Longer phrases are are
more prominent
• Ranking is unclear
• Does size indicate
higher frequency?
Word clouds done
right.
NYT: 2012 Political Convention Word Use by Party
Mike Bostock et al., http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html
2012 Political Convention Word Use by Party
Mike Bostock et al., http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html
Predicting a movie’s
revenue from its
reviews.
- Data:
- 1,718 movie reviews from 2005-2009 7 different publications
(e.g., Austin Chronicle, NY Times, etc.)
- Various movie metadata like rating and director
- Gross revenue
- Task:
- Predict revenue from text, couched as a regression problem
- Regressor used: Elastic Net
- l1 and l2 penalized linear regression
- 2009 reviews were held-out as test data
- Features:
- Ngrams: unigrams, bigrams and trigrams
- Dependency relation triples: <dependent, relation, head>
- Versions of features labeled for each publication (i.e. domain)
- “Ent. Weekly: comedy_for”, “Variety: comedy_for”
- Essentially the same algo as Daume III (2007)
- Performed better than naïve baseline, but worse than metadata
Predicting Box-Office Revenue From Movie Reviews
(Joshi et al., 2010)
Joshi et al. Movie Reviews and Revenues: An Experiment in Text Regression. NAACL 2010
Daume III. Frustratingly Easy
Domain Adaptation. ACL 2007.
Predicting Box-Office Revenue From Movie Reviews
Joshi et al. Movie Reviews and Revenues: An Experiment in Text Regression. NAACL 2010
manually labeled feature
categories
Feature weight
(“Weight ($M)”) in
linear model
indicates how much
features are “worth”
in millions of dollars.
The learned
coefficients.
• The corpus used in Joshi et al. 2010 is freely available.
• Can we use the Rudder algorithm to find interesting associated
terms? How does it compare?
– Rudder algorithm requires two or more classes.
– We can partition the the dataset into high and low revenue partitions.
• High being movies in the upper third of revenue, low in the bottom third
– Find words that are associated with high vs. low (throwing out the
middle third) and vice versa
Univariate approach to predicting revenue from text
• Observation definition is really important!
– Recall that the same movie may have multiple reviews.
– We can treat an observation as
• a single review
• a single movie (concatenation of all reviews)
• word frequencies: # observations words appear in
– The response variable remains the same– movie revenue
Univariate approach to predicting revenue from text
• Observation definition is really important!
– Recall that the same movie may have multiple reviews.
– We can treat an observation as
• a single review
• a single movie (concatenation of all reviews)
• word frequencies: # observations words appear in
– The response variable remains the same– movie revenue
Univariate approach to predicting revenue from text
Top 5 high revenue terms (Rudder algorithm)
Review-level observations Movie-level observations
Batman Computer generated
Borat Superhero
Rodriguez The franchise
Wahlberg Comic book
Comic book Popcorn
• Observation definition is really important!
– Recall that the same movie may have multiple reviews.
– We can treat an observation as
• a single review
• a single movie (concatenation of all reviews)
• word frequencies: # observations words appear in
– The response variable remains the same– movie revenue
Univariate approach to predicting revenue from text
Top 5 high revenue terms (Rudder algorithm)
Review-level observations Movie-level observations
Batman Computer generated
Borat Superhero
Rodriguez The franchise
Wahlberg Comic book
Comic book Popcorn
Univariate approach to predicting revenue-category
from text
Top 5 (all movies)
Computer generated
Superhero
The franchise
Comic book
Popcorn
Bottom 5 (all movies)
exclusively
[Phone number]
Festival
Tribeca
With English
Failed to produce term associations around
content ratings (e.g., “g-rated” or “adult).
Rating is strongly correlated to revenue.
Let’s look exclusively at PG-13 movies.
Only PG-13-rated movies
Selected Top Terms
Franchise
Computer generated
Installment
The first two
The ultimate
Selected Bottom Terms
A friend
Her mother
Parent
One day
Siblings
Top terms are very similar. Franchises
and sequels are very successful.
Bottom terms are interesting!
Movies about friendship or family
dynamics don’t seem to perform well!
Idea generation tools can also be idea
rejection tools.
- Spiderman 15 >> PG-13 family
melodrama.
Corpus selection is important in getting
actionable, interpretable results!
Deep dive on term-
importance
algorithms
The code for these experiments are available: https://github.com/JasonKessler/scattertext/
• 138k words total. By modern standards, a pretty small corpus.
– 78k words spoken by Democrats, and 58k by Republicans
– The Brown Corpus (1961) is 1mm words
– But, often you have small amounts of data
• We can compare the word frequencies against a list of English
word frequencies.
• We’ll look at three methods to do this:
– Statistical: Fisher’s Exact Test
– Geometirc: Rudder Scores
– Information retrieval: F-Score
What terms are characteristic the 2012 conventions?
The code for these experiments are available: https://github.com/JasonKessler/scattertext/
• 1. Hypothesis testing: comparing word, class frequencies in a 2x2
contingency table using Fisher’s Exact test.
What terms are characteristic the 2012 conventions?
Convention
Speeches
General English
Frequencies
”Romney” count 570 695,398
~” Romney” count 137,966 - 570 369 billion - 695,398
Fisher’s Exact test:
Odds ratio: 3508.61
P: 0.0
- Numerical issues can cause p-values to be zero, wreaking havoc
on Bonferroni corrections.
- Rank by p-values, and break ties on odds ratios.
The code for these experiments are available: https://github.com/JasonKessler/scattertext/
• Frequencies in convention
dataset are listed next to term
• Fisher is sorted by p-value, then
OR (mostly p = 0 here)
• Fisher’s test selects
characteristic terms, but includes
stop words
• The Rudder score selects for
terms not seen elsewhere, like
names, neologisms and foreign
words
What terms are characteristic the 2012 conventions?
Top Fisher Terms Top Rudder Terms
obama (702) obamacare* (33)
barack (248) romneys* (8)
romney (570) yohana* (3)
mitt (501) nevadan* (2)
republican (228) solyndra* (2)
president (1049) breth* (2)
thank (366) outeducate* (2)
america (689) vouchercare* (2)
u (529) bendiga* (2)
american (653) gjelsvik* (2)
be (4298) carlette* (1)
we (2464) valora* (1)
* term was not found in background corpus
The code for these experiments are available: https://github.com/JasonKessler/scattertext/
• Characteristic terms appear a lot in convention speeches
– In other words, have a high recall-- a high P(term | convention
speech)
– However, so do a lot of stop words
• If you encounter a characteristic term, there’s a good chance
you’re reading a convention speech
– Have a high precision– a high P(convention speech|term)
– However, this may be true for random low-frequency words
• Because of the two caveats, both probabilities need to be “high”.
• F-score = HarmoicMean(P(term|category), P(category|term))
Intuition behind F-Score
The code for these experiments are available: https://github.com/JasonKessler/scattertext/
F-score looks pretty good here
Top Fisher Terms Top Rudder Terms Top F-score Terms
obama (702) obamacare* (33) obama (702)
barack (248) romneys* (8) romney (570)
romney (570) yohana* (3) mitt (501)
mitt (501) nevadan* (2) barack (248)
republican (228) solyndra* (2) republican (228)
president (1049) breth* (2) bless (132)
thank (366) outeducate* (2) president (1049)
america (689) vouchercare* (2) democrat (127)
u (529) bendiga* (2) elect (102)
american (653) gjelsvik* (2) biden (46)
be (4298) carlette* (1) governor (172)
we (2464) valora* (1) obamacare* (33)
A problem with F-
Score
The code for these experiments are available: https://github.com/JasonKessler/scattertext/
@jasonkessler
Using F-Score to Find Differences between
Democratic and Republican Speakers
F-Score
the (3399:2531)
and (2707:2232)
to (2337:1664)
be (2326:1972)
a (1596:1346)
of (1565:1372)
that (1400:1051)
we (1318:1146)
• When class-label distributions are about equal, term
recalls become far lower than term precisions.
Top democratic speech predictors
(dem:rep frequencies)
The code for these experiments are available: https://github.com/JasonKessler/scattertext/
@jasonkessler
Using F-Score to Find Differences between
Democratic and Republican Speakers
F-Score
the (3399:2531)
and (2707:2232)
to (2337:1664)
be (2326:1972)
a (1596:1346)
of (1565:1372)
that (1400:1051)
we (1318:1146)
• When class-label distributions are about equal, term
recalls become far lower than term precisions.
• P(the|dem) ~= P(the) ~= 0.04
• P(dem|the) ~= P(dem) ~= 0.57
• F-score = 0.073 (> 0.039)
• P(dem|auto) = 1.00
• P(auto|dem) ~= 0.0005
• F-score = 0.0009
Top democratic speech predictors
(dem:rep frequencies)
The code for these experiments are available: https://github.com/JasonKessler/scattertext/
@jasonkessler
Using F-Score to Find Differences between
Democratic and Republican Speakers
F-Score
the (3399:2531)
and (2707:2232)
to (2337:1664)
be (2326:1972)
a (1596:1346)
of (1565:1372)
that (1400:1051)
we (1318:1146)
• When class-label distributions are about equal, term
recalls become far lower than term precisions.
• P(the|dem) ~= P(the) ~= 0.04
• P(dem|the) ~= P(dem) ~= 0.57
• F-score = 0.073 (> 0.039)
• P(dem|auto) = 1.00
• P(auto|dem) ~= 0.0005
• F-score = 0.0009
• We can correct for this without having to mess with β
Top democratic speech predictors
(dem:rep frequencies)
The code for these experiments are available: https://github.com/JasonKessler/scattertext/
@jasonkessler
A new modification to F-Score
F-Score
the (3399:2531)
and (2707:2232)
to (2337:1664)
be (2326:1972)
a (1596:1346)
of (1565:1372)
that (1400:1051)
we (1318:1146)
Top democratic speech predictors
(dem:rep frequencies)
Let Φ 𝑥 = 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑥 𝑜𝑟 𝑎 𝑁𝑜𝑟𝑚𝑎𝑙𝐶𝐷𝐹(𝑥, 𝜇 𝑥, 𝜎 𝑥
2)
F−score scaled =
2Φ 𝑃(term|category) Φ 𝑃(category|term)
Φ 𝑃(term|category) + Φ 𝑃(category|term)
• Solution: normalize probabilities
• Φ (P(dem|auto)) ~= 0.99 (avg. pctl)
• Φ (P(auto|dem)) ~= 0.66
• F-score = 0.78
• Φ (P(dem|the)) ~= 0.19
• Φ (P(the|dem)) ~= 1.0
• F-score = 0.33
The code for these experiments are available: https://github.com/JasonKessler/scattertext/
@jasonkessler
A new modification to F-Score
F-Score
F-Score Scaled
(NormCDF)
the (3399:2531) fight for (108:7)
and (2707:2232) auto (37:0)
to (2337:1664) fair (47:3)
be (2326:1972) insurance company (28:0)
a (1596:1346) america forward (28:0)
of (1565:1372) middle class (149:18)
that (1400:1051) president barack (47:4)
we (1318:1146) auto industry (24:0)
Top democratic speech predictors
(dem:rep frequencies)
Let Φ 𝑥 = 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑥 𝑜𝑟 𝑎 𝑁𝑜𝑟𝑚𝑎𝑙𝐶𝐷𝐹(𝑥, 𝜇 𝑥, 𝜎 𝑥
2)
F−score scaled =
2Φ 𝑃(term|category) Φ 𝑃(category|term)
Φ 𝑃(term|category) + Φ 𝑃(category|term)
• Solution: normalize probabilities
• Φ (P(dem|auto)) ~= 0.99 (avg. pctl)
• Φ (P(auto|dem)) ~= 0.66
• F-score = 0.78
• Φ (P(dem|the)) ~= 0.19
• Φ (P(the|dem)) ~= 1.0
• F-score = 0.33
I’ve written software
to create the plots
of term frequencies,
similar to Rudder
(2014).
This shows
Democratic vs.
Republican
language, colored
with Scaled F-
Scores.
The code is
available at
https://github.com/J
asonKessler/scatter
text/
Colored with L1-
penalized logistic
regression
coefficients
The code is
available at
https://github.com
/JasonKessler/sc
attertext/
CDK Global’s
Language
Visualization
Tool
• Suppose you are selling a car to a typical person, how would you
describe the car’s performance?
• Should you say
– This car has 162 ft-lbs of torque.
– OR
– People who bought this car tell us it makes passing on two lane roads
easy.
• Having an idea generation (and rejection) tool makes this very
easy.
Informing dealer talk tracks.
• Corpus and document selection are important
– Documents: movie-level instead of review-level
– Corpus: rating-specific
• Use a variety of approaches
– Univariate and multivariate approaches can highlight different terms
• Visualize your data set– the pre-alpha text-to-ideas package can
let you do that. https://github.com/JasonKessler/text-to-ideas/
• More phrase context is better than less
• Add a narrative behind to you analysis, no matter how speculative.
Recommendations
What you learn from
corpora is often
more important
than classification
accuracy.
• Thank you!
• Check out https://github.com/JasonKessler/scattertext/ for code
• $ pip install scattertext
• We’re hiring
– talk to me or, if you can’t, go to CDKJobs.com
– http://www.jasonkessler.com to contact me
• Special thanks to the CDK Global Data Science team, including
Michael Mabale (thoughts on word clouds), Iris Laband, Peter
Kahn, Michael Eggerling, Kyle Lo, Iris Laband, Chris Mills, Phil
Burger, Dengyao Mo, and many others.
Acknowledgements
Questions?
• Data Scientist
• UI/UX Development
& Design
• Software Engineer –
all levels
• Product Manager
Is this
you?
• Find “Jobs by
Category”
• Click Technology
• Have your
Resume ready
• Click “Apply”!
Head to
CDKJobs.com
-or-
talk to me
@jasonkessler
F-Score-Percentile Fisher LogReg-L1 LogReg-L2 Rudder
1 fight for (108:7)
president obama
(351:78)
president obama
(351:78)
obama (537:165) auto (37:0)
2 auto (37:0) obama (537:165) barack (202:46)
president obama
(351:78)
insurance
company
(28:0)
3 fair (47:3)
president
(746:303)
cut (118:24) president (746:303)
america
forward (28:0)
4
insurance company
(28:0)
middle class
(149:18)
worker (54:9) for (1020:542)
auto industry
(24:0)
5
america forward
(28:0)
fight for (108:7) all of (61:23) barack (202:46) pell (23:0)
6 middle class (149:18) class (162:25) charlotte (28:3) forward (105:16) wealthy (23:0)
7
president barack
(47:4)
middle (164:27) today 's (12:3) fight (165:31)
pell grant
(23:0)
8 auto industry (24:0) barack (202:46) so that (33:9) fight for (108:7)
last week
(22:0)
9 pell grant (23:0) fight (165:31) forward (105:16) class (162:25)
the wealthy
(21:0)
10 pell (23:0) forward (105:16) that ' (236:91)
barack obama
(164:45)
romney say
(21:0)
11 wealthy (23:0) for (1020:542) make (299:200)
middle class
(149:18)
millionaire
(20:0)
The top words ranked by each method (dem:rep freqs)
This is a Rudder
diagram colored
with Rudder
scores.
The code is
available at
https://github.com
/JasonKessler/tex
t-to-ideas/

More Related Content

Viewers also liked

CAPE SOCIOLOGY UNIT TWO Robertkmerton crimeanddeviance
CAPE SOCIOLOGY UNIT  TWO Robertkmerton crimeanddevianceCAPE SOCIOLOGY UNIT  TWO Robertkmerton crimeanddeviance
CAPE SOCIOLOGY UNIT TWO Robertkmerton crimeanddeviancecapesociology
 
Irena africa 2030_r_emap_2015
Irena africa 2030_r_emap_2015Irena africa 2030_r_emap_2015
Irena africa 2030_r_emap_2015Dmitry Tseitlin
 
Test automation using selenium - Marketing platform
Test automation using selenium - Marketing platformTest automation using selenium - Marketing platform
Test automation using selenium - Marketing platformExpedux Technologies
 
Magento 2 Seminar - Christian Muench - Magerun2
Magento 2 Seminar - Christian Muench - Magerun2Magento 2 Seminar - Christian Muench - Magerun2
Magento 2 Seminar - Christian Muench - Magerun2Yireo
 
Buku Pedoman MUI ttg Syiah-7
Buku Pedoman MUI ttg Syiah-7Buku Pedoman MUI ttg Syiah-7
Buku Pedoman MUI ttg Syiah-7Feizal Karim
 
Fyou pmec gauges catalog2017(api thread gauges&amp;api thread measuring gauge...
Fyou pmec gauges catalog2017(api thread gauges&amp;api thread measuring gauge...Fyou pmec gauges catalog2017(api thread gauges&amp;api thread measuring gauge...
Fyou pmec gauges catalog2017(api thread gauges&amp;api thread measuring gauge...API THREAD GAGES- Edwin Hsiung
 
11 tips for a perfect Logo
11 tips for a perfect Logo11 tips for a perfect Logo
11 tips for a perfect LogoAlilogoexpert
 
Gamifcation: Fun as a Management Tool
Gamifcation: Fun as a Management ToolGamifcation: Fun as a Management Tool
Gamifcation: Fun as a Management ToolAlex Chalkias
 
Capítol 1 Receptes per somiar
Capítol 1 Receptes per somiarCapítol 1 Receptes per somiar
Capítol 1 Receptes per somiarJoanprofe
 

Viewers also liked (12)

CAPE SOCIOLOGY UNIT TWO Robertkmerton crimeanddeviance
CAPE SOCIOLOGY UNIT  TWO Robertkmerton crimeanddevianceCAPE SOCIOLOGY UNIT  TWO Robertkmerton crimeanddeviance
CAPE SOCIOLOGY UNIT TWO Robertkmerton crimeanddeviance
 
Irena africa 2030_r_emap_2015
Irena africa 2030_r_emap_2015Irena africa 2030_r_emap_2015
Irena africa 2030_r_emap_2015
 
Test automation using selenium - Marketing platform
Test automation using selenium - Marketing platformTest automation using selenium - Marketing platform
Test automation using selenium - Marketing platform
 
Nutricion
NutricionNutricion
Nutricion
 
Magento 2 Seminar - Christian Muench - Magerun2
Magento 2 Seminar - Christian Muench - Magerun2Magento 2 Seminar - Christian Muench - Magerun2
Magento 2 Seminar - Christian Muench - Magerun2
 
Buku Pedoman MUI ttg Syiah-7
Buku Pedoman MUI ttg Syiah-7Buku Pedoman MUI ttg Syiah-7
Buku Pedoman MUI ttg Syiah-7
 
Adversary simulation
Adversary simulationAdversary simulation
Adversary simulation
 
I LOVE HOI AN TOUR
I LOVE HOI AN TOUR I LOVE HOI AN TOUR
I LOVE HOI AN TOUR
 
Fyou pmec gauges catalog2017(api thread gauges&amp;api thread measuring gauge...
Fyou pmec gauges catalog2017(api thread gauges&amp;api thread measuring gauge...Fyou pmec gauges catalog2017(api thread gauges&amp;api thread measuring gauge...
Fyou pmec gauges catalog2017(api thread gauges&amp;api thread measuring gauge...
 
11 tips for a perfect Logo
11 tips for a perfect Logo11 tips for a perfect Logo
11 tips for a perfect Logo
 
Gamifcation: Fun as a Management Tool
Gamifcation: Fun as a Management ToolGamifcation: Fun as a Management Tool
Gamifcation: Fun as a Management Tool
 
Capítol 1 Receptes per somiar
Capítol 1 Receptes per somiarCapítol 1 Receptes per somiar
Capítol 1 Receptes per somiar
 

More from Jason Kessler

Visualizing Words and Topics with Scattertext
Visualizing Words and Topics with ScattertextVisualizing Words and Topics with Scattertext
Visualizing Words and Topics with ScattertextJason Kessler
 
Natural Language Visualization with Scattertext
Natural Language Visualization with ScattertextNatural Language Visualization with Scattertext
Natural Language Visualization with ScattertextJason Kessler
 
Lexicon Mining for Semiotic Squares: Exploding Binary Classification
Lexicon Mining for Semiotic Squares: Exploding Binary ClassificationLexicon Mining for Semiotic Squares: Exploding Binary Classification
Lexicon Mining for Semiotic Squares: Exploding Binary ClassificationJason Kessler
 
Jason Kessler Problems: What's Wrong with Twitter
Jason Kessler Problems: What's Wrong with TwitterJason Kessler Problems: What's Wrong with Twitter
Jason Kessler Problems: What's Wrong with TwitterJason Kessler
 
Discovering Persuasive Language through Observing Customer Behavior
Discovering Persuasive Language through Observing Customer BehaviorDiscovering Persuasive Language through Observing Customer Behavior
Discovering Persuasive Language through Observing Customer BehaviorJason Kessler
 
The 2010 JDPA Sentiment Corpus for the Automotive Domain
The 2010 JDPA Sentiment Corpus for the Automotive DomainThe 2010 JDPA Sentiment Corpus for the Automotive Domain
The 2010 JDPA Sentiment Corpus for the Automotive DomainJason Kessler
 
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...Jason Kessler
 
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...Jason Kessler
 

More from Jason Kessler (8)

Visualizing Words and Topics with Scattertext
Visualizing Words and Topics with ScattertextVisualizing Words and Topics with Scattertext
Visualizing Words and Topics with Scattertext
 
Natural Language Visualization with Scattertext
Natural Language Visualization with ScattertextNatural Language Visualization with Scattertext
Natural Language Visualization with Scattertext
 
Lexicon Mining for Semiotic Squares: Exploding Binary Classification
Lexicon Mining for Semiotic Squares: Exploding Binary ClassificationLexicon Mining for Semiotic Squares: Exploding Binary Classification
Lexicon Mining for Semiotic Squares: Exploding Binary Classification
 
Jason Kessler Problems: What's Wrong with Twitter
Jason Kessler Problems: What's Wrong with TwitterJason Kessler Problems: What's Wrong with Twitter
Jason Kessler Problems: What's Wrong with Twitter
 
Discovering Persuasive Language through Observing Customer Behavior
Discovering Persuasive Language through Observing Customer BehaviorDiscovering Persuasive Language through Observing Customer Behavior
Discovering Persuasive Language through Observing Customer Behavior
 
The 2010 JDPA Sentiment Corpus for the Automotive Domain
The 2010 JDPA Sentiment Corpus for the Automotive DomainThe 2010 JDPA Sentiment Corpus for the Automotive Domain
The 2010 JDPA Sentiment Corpus for the Automotive Domain
 
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
 
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...
 

Recently uploaded

Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 

Recently uploaded (20)

Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 

Turning Unstructured Content into Kernels of Ideas

  • 1. Turning Unstructured Content into Kernels of Ideas Jason Kessler Data Scientist, CDK Global Data Day Seattle, 2016 @jasonkessler www.jasonkessler.com Code used in talk: https://github.com/JasonKessler/scattertext
  • 2. 1. Case studies on mining documents for knowledge 2. Old, new, and obscure algorithms and tools
  • 4. Naïve Approach: Indicators of Positive Sentiment "If you ask a Subaru owner what they think of their car, more times than not they'll tell you they love it," -Alan Bethke, director of marketing communications for Subaru of America (via Adweek)
  • 6. Finding Engaging Content …I was very skeptical giving up my truck and buying an "Economy Car." I'm 6' 215lbs, but my new career has me driving a personal vehicle to make sales calls. I am overly impressed with my Cruze… Rating: 4.4/5 Stars Example Review Appearing on a 3rd Party Automotive Site # of users who read review: 20 Text: Car Reviewed: Chevy Cruze
  • 7. Finding Engaging Content …I was very skeptical giving up my truck and buying an "Economy Car." I'm 6' 215lbs, but my new career has me driving a personal vehicle to make sales calls. I am overly impressed with my Cruze… Rating: 4.4/5 Stars Example Review Appearing on a 3rd Party Automotive Site # of users who read review: # who went on to visit a Chevy dealer’s website: 15 20 Text: Car Reviewed: Chevy Cruze
  • 8. Finding Engaging Content …I was very skeptical giving up my truck and buying an "Economy Car." I'm 6' 215lbs, but my new career has me driving a personal vehicle to make sales calls. I am overly impressed with my Cruze… Rating: 4.4/5 Stars Example Review Appearing on a 3rd Party Automotive Site # of users who read review: # who went on to visit a Chevy dealer’s website: 15 20 Review Engagement Rate: 15/20=75% Text: Car Reviewed: Chevy Cruze
  • 9. Finding Engaging Content …I was very skeptical giving up my truck and buying an "Economy Car." I'm 6' 215lbs, but my new career has me driving a personal vehicle to make sales calls. I am overly impressed with my Cruze… Rating: 4.4/5 Stars Example Review Appearing on a 3rd Party Automotive Site # of users who read review: # who went on to visit a Chevy dealer’s website: 15 20 Review Engagement Rate: 15/20=75% Text: Car Reviewed: Chevy Cruze Median Review Engagement Rate: 22%
  • 10.
  • 11. Are words and phrases more precise predictors?
  • 12. Positive Sentiment Love Comfortable Features Solid Amazing Sentiment vs. Persuasiveness: SUV-Specific Negative Sentiment Transmission Problem Issue Dealership Times
  • 13. Positive Sentiment High Engagement Love Comfortable Comfortable Front [Seats] Features Acceleration Solid Free [Car Wash, Oil Change] Amazing Quiet Sentiment vs. Persuasiveness: SUV-Specific Negative Sentiment Transmission Problem Issue Dealership Times
  • 14. Positive Sentiment High Engagement Love Comfortable Comfortable Front [Seats] Features Acceleration Solid Free [Car Wash, Oil Change] Amazing Quiet Sentiment vs. Persuasiveness: SUV-Specific Negative Sentiment Low Engagement Transmission Money [spend my, save] Problem Features Issue Dealership Dealership Amazing Times Build Quality [typically positive]
  • 15. Sentiment vs. Persuasiveness: SUV-Specific Negative Sentiment Low Engagement Transmission Money [spend my, save] Problem Features Issue Dealership Dealership Amazing Times Build Quality [typically positive] • The worst thing to say when reviewing an SUV: “I saved money and got all these amazing features!”
  • 18. Engagement Terms Blind (spot, alert) Contexts from high engagement reviews - “The techno safety features (blind spot, lane alert, etc) are reason for buying car...” - “Side blind Zone Alert is truly wonderful…” - …
  • 19.
  • 20. Can better science improve messaging? Engagement Terms Blind White (paint, diamond) Contexts - “White with cornsilk interior.” - “My wife fell in love with the Equinox in White Diamond” - “The white diamond paint is to die for”
  • 21. Can better science improve messaging?
  • 22. Can better science improve messaging? Engagement Terms Blind White Climate (geography, a/c) Contexts - “Love the front wheel drive in this northern Minn. Climate” - “We do live in a cold climate (Ontario)” - …climate control…
  • 23.
  • 24. Just recently, VW has produced very similar commercials.
  • 25. Finding useful class- differentiating terms is complicated. But simple methods work well.
  • 26. OKCupid: How does gender and ethnicity affect self- presentation on online dating profiles? Christian Rudder: http://blog.okcupid.com/index.php/page/7/ Which words and phrases statistically distinguish ethnic groups and genders? hobos almond butter 100 Years of Solitude Bikram yoga
  • 27. Source: Christian Rudder. Dataclysm. 2014. Ranking with everyone else The smaller the distance from the top left, the higher the association with white men High distance: but ignore K-Pop Low distance: white guys disproportionately like Phish
  • 28. Source: Christian Rudder. Dataclysm. 2014. my blue eyes
  • 29. The Rudder Score: || < 1,0 > − < Φ(# 𝑤𝑜𝑟𝑑,¬𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦),Φ (# 𝑤𝑜𝑟𝑑,𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦) > ||2 The percentile function Φ ranks counts along a scale of [0,1],
  • 30. Source: http://blog.okcupid.com/index.php/page/7/ (Rudder 2010) Words and phrases that distinguish white men. OKCupid: How do ethnicities’ self-presentation differ on a dating site?
  • 31. Source: http://blog.okcupid.com/index.php/page/7/ (Rudder 2010) Words and phrases that distinguish Latino men. Explanation OKCupid: How do ethnicities’ self-presentation differ on a dating site?
  • 32. How can we make money with this? • Genre of insurance or investment ads – Montage of important events in the life of a person. • With these phrase sets, the ads practically write themselves: • What if you wanted to target Latino men? – Grows up boxing – Meets girlfriend salsa dancing – Becomes a Marine – Tells a joke at his wedding – Etc…
  • 34. The good: • Word clouds force you to hunt for the most impactful terms • You end up examining the long tail in the process • Compactly represent a lot of phrases Lyle Ungar 2013 AAAI Tutorial Above: words found in Facebook status message predictive of women authors.
  • 35. Lyle Ungar 2013 AAAI Tutorial Above: words found in Facebook status message predictive of male authors. The bad: • “Mullets of the Internet” --Jeffrey Zeldman • Longer phrases are are more prominent • Ranking is unclear • Does size indicate higher frequency?
  • 37. NYT: 2012 Political Convention Word Use by Party Mike Bostock et al., http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html
  • 38. 2012 Political Convention Word Use by Party Mike Bostock et al., http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html
  • 39. Predicting a movie’s revenue from its reviews.
  • 40. - Data: - 1,718 movie reviews from 2005-2009 7 different publications (e.g., Austin Chronicle, NY Times, etc.) - Various movie metadata like rating and director - Gross revenue - Task: - Predict revenue from text, couched as a regression problem - Regressor used: Elastic Net - l1 and l2 penalized linear regression - 2009 reviews were held-out as test data - Features: - Ngrams: unigrams, bigrams and trigrams - Dependency relation triples: <dependent, relation, head> - Versions of features labeled for each publication (i.e. domain) - “Ent. Weekly: comedy_for”, “Variety: comedy_for” - Essentially the same algo as Daume III (2007) - Performed better than naïve baseline, but worse than metadata Predicting Box-Office Revenue From Movie Reviews (Joshi et al., 2010) Joshi et al. Movie Reviews and Revenues: An Experiment in Text Regression. NAACL 2010 Daume III. Frustratingly Easy Domain Adaptation. ACL 2007.
  • 41. Predicting Box-Office Revenue From Movie Reviews Joshi et al. Movie Reviews and Revenues: An Experiment in Text Regression. NAACL 2010 manually labeled feature categories Feature weight (“Weight ($M)”) in linear model indicates how much features are “worth” in millions of dollars. The learned coefficients.
  • 42. • The corpus used in Joshi et al. 2010 is freely available. • Can we use the Rudder algorithm to find interesting associated terms? How does it compare? – Rudder algorithm requires two or more classes. – We can partition the the dataset into high and low revenue partitions. • High being movies in the upper third of revenue, low in the bottom third – Find words that are associated with high vs. low (throwing out the middle third) and vice versa Univariate approach to predicting revenue from text
  • 43. • Observation definition is really important! – Recall that the same movie may have multiple reviews. – We can treat an observation as • a single review • a single movie (concatenation of all reviews) • word frequencies: # observations words appear in – The response variable remains the same– movie revenue Univariate approach to predicting revenue from text
  • 44. • Observation definition is really important! – Recall that the same movie may have multiple reviews. – We can treat an observation as • a single review • a single movie (concatenation of all reviews) • word frequencies: # observations words appear in – The response variable remains the same– movie revenue Univariate approach to predicting revenue from text Top 5 high revenue terms (Rudder algorithm) Review-level observations Movie-level observations Batman Computer generated Borat Superhero Rodriguez The franchise Wahlberg Comic book Comic book Popcorn
  • 45. • Observation definition is really important! – Recall that the same movie may have multiple reviews. – We can treat an observation as • a single review • a single movie (concatenation of all reviews) • word frequencies: # observations words appear in – The response variable remains the same– movie revenue Univariate approach to predicting revenue from text Top 5 high revenue terms (Rudder algorithm) Review-level observations Movie-level observations Batman Computer generated Borat Superhero Rodriguez The franchise Wahlberg Comic book Comic book Popcorn
  • 46. Univariate approach to predicting revenue-category from text Top 5 (all movies) Computer generated Superhero The franchise Comic book Popcorn Bottom 5 (all movies) exclusively [Phone number] Festival Tribeca With English Failed to produce term associations around content ratings (e.g., “g-rated” or “adult). Rating is strongly correlated to revenue. Let’s look exclusively at PG-13 movies.
  • 47. Only PG-13-rated movies Selected Top Terms Franchise Computer generated Installment The first two The ultimate Selected Bottom Terms A friend Her mother Parent One day Siblings Top terms are very similar. Franchises and sequels are very successful. Bottom terms are interesting! Movies about friendship or family dynamics don’t seem to perform well! Idea generation tools can also be idea rejection tools. - Spiderman 15 >> PG-13 family melodrama. Corpus selection is important in getting actionable, interpretable results!
  • 48. Deep dive on term- importance algorithms
  • 49. The code for these experiments are available: https://github.com/JasonKessler/scattertext/ • 138k words total. By modern standards, a pretty small corpus. – 78k words spoken by Democrats, and 58k by Republicans – The Brown Corpus (1961) is 1mm words – But, often you have small amounts of data • We can compare the word frequencies against a list of English word frequencies. • We’ll look at three methods to do this: – Statistical: Fisher’s Exact Test – Geometirc: Rudder Scores – Information retrieval: F-Score What terms are characteristic the 2012 conventions?
  • 50. The code for these experiments are available: https://github.com/JasonKessler/scattertext/ • 1. Hypothesis testing: comparing word, class frequencies in a 2x2 contingency table using Fisher’s Exact test. What terms are characteristic the 2012 conventions? Convention Speeches General English Frequencies ”Romney” count 570 695,398 ~” Romney” count 137,966 - 570 369 billion - 695,398 Fisher’s Exact test: Odds ratio: 3508.61 P: 0.0 - Numerical issues can cause p-values to be zero, wreaking havoc on Bonferroni corrections. - Rank by p-values, and break ties on odds ratios.
  • 51. The code for these experiments are available: https://github.com/JasonKessler/scattertext/ • Frequencies in convention dataset are listed next to term • Fisher is sorted by p-value, then OR (mostly p = 0 here) • Fisher’s test selects characteristic terms, but includes stop words • The Rudder score selects for terms not seen elsewhere, like names, neologisms and foreign words What terms are characteristic the 2012 conventions? Top Fisher Terms Top Rudder Terms obama (702) obamacare* (33) barack (248) romneys* (8) romney (570) yohana* (3) mitt (501) nevadan* (2) republican (228) solyndra* (2) president (1049) breth* (2) thank (366) outeducate* (2) america (689) vouchercare* (2) u (529) bendiga* (2) american (653) gjelsvik* (2) be (4298) carlette* (1) we (2464) valora* (1) * term was not found in background corpus
  • 52. The code for these experiments are available: https://github.com/JasonKessler/scattertext/ • Characteristic terms appear a lot in convention speeches – In other words, have a high recall-- a high P(term | convention speech) – However, so do a lot of stop words • If you encounter a characteristic term, there’s a good chance you’re reading a convention speech – Have a high precision– a high P(convention speech|term) – However, this may be true for random low-frequency words • Because of the two caveats, both probabilities need to be “high”. • F-score = HarmoicMean(P(term|category), P(category|term)) Intuition behind F-Score
  • 53. The code for these experiments are available: https://github.com/JasonKessler/scattertext/ F-score looks pretty good here Top Fisher Terms Top Rudder Terms Top F-score Terms obama (702) obamacare* (33) obama (702) barack (248) romneys* (8) romney (570) romney (570) yohana* (3) mitt (501) mitt (501) nevadan* (2) barack (248) republican (228) solyndra* (2) republican (228) president (1049) breth* (2) bless (132) thank (366) outeducate* (2) president (1049) america (689) vouchercare* (2) democrat (127) u (529) bendiga* (2) elect (102) american (653) gjelsvik* (2) biden (46) be (4298) carlette* (1) governor (172) we (2464) valora* (1) obamacare* (33)
  • 54. A problem with F- Score
  • 55. The code for these experiments are available: https://github.com/JasonKessler/scattertext/ @jasonkessler Using F-Score to Find Differences between Democratic and Republican Speakers F-Score the (3399:2531) and (2707:2232) to (2337:1664) be (2326:1972) a (1596:1346) of (1565:1372) that (1400:1051) we (1318:1146) • When class-label distributions are about equal, term recalls become far lower than term precisions. Top democratic speech predictors (dem:rep frequencies)
  • 56. The code for these experiments are available: https://github.com/JasonKessler/scattertext/ @jasonkessler Using F-Score to Find Differences between Democratic and Republican Speakers F-Score the (3399:2531) and (2707:2232) to (2337:1664) be (2326:1972) a (1596:1346) of (1565:1372) that (1400:1051) we (1318:1146) • When class-label distributions are about equal, term recalls become far lower than term precisions. • P(the|dem) ~= P(the) ~= 0.04 • P(dem|the) ~= P(dem) ~= 0.57 • F-score = 0.073 (> 0.039) • P(dem|auto) = 1.00 • P(auto|dem) ~= 0.0005 • F-score = 0.0009 Top democratic speech predictors (dem:rep frequencies)
  • 57. The code for these experiments are available: https://github.com/JasonKessler/scattertext/ @jasonkessler Using F-Score to Find Differences between Democratic and Republican Speakers F-Score the (3399:2531) and (2707:2232) to (2337:1664) be (2326:1972) a (1596:1346) of (1565:1372) that (1400:1051) we (1318:1146) • When class-label distributions are about equal, term recalls become far lower than term precisions. • P(the|dem) ~= P(the) ~= 0.04 • P(dem|the) ~= P(dem) ~= 0.57 • F-score = 0.073 (> 0.039) • P(dem|auto) = 1.00 • P(auto|dem) ~= 0.0005 • F-score = 0.0009 • We can correct for this without having to mess with β Top democratic speech predictors (dem:rep frequencies)
  • 58. The code for these experiments are available: https://github.com/JasonKessler/scattertext/ @jasonkessler A new modification to F-Score F-Score the (3399:2531) and (2707:2232) to (2337:1664) be (2326:1972) a (1596:1346) of (1565:1372) that (1400:1051) we (1318:1146) Top democratic speech predictors (dem:rep frequencies) Let Φ 𝑥 = 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑥 𝑜𝑟 𝑎 𝑁𝑜𝑟𝑚𝑎𝑙𝐶𝐷𝐹(𝑥, 𝜇 𝑥, 𝜎 𝑥 2) F−score scaled = 2Φ 𝑃(term|category) Φ 𝑃(category|term) Φ 𝑃(term|category) + Φ 𝑃(category|term) • Solution: normalize probabilities • Φ (P(dem|auto)) ~= 0.99 (avg. pctl) • Φ (P(auto|dem)) ~= 0.66 • F-score = 0.78 • Φ (P(dem|the)) ~= 0.19 • Φ (P(the|dem)) ~= 1.0 • F-score = 0.33
  • 59. The code for these experiments are available: https://github.com/JasonKessler/scattertext/ @jasonkessler A new modification to F-Score F-Score F-Score Scaled (NormCDF) the (3399:2531) fight for (108:7) and (2707:2232) auto (37:0) to (2337:1664) fair (47:3) be (2326:1972) insurance company (28:0) a (1596:1346) america forward (28:0) of (1565:1372) middle class (149:18) that (1400:1051) president barack (47:4) we (1318:1146) auto industry (24:0) Top democratic speech predictors (dem:rep frequencies) Let Φ 𝑥 = 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑥 𝑜𝑟 𝑎 𝑁𝑜𝑟𝑚𝑎𝑙𝐶𝐷𝐹(𝑥, 𝜇 𝑥, 𝜎 𝑥 2) F−score scaled = 2Φ 𝑃(term|category) Φ 𝑃(category|term) Φ 𝑃(term|category) + Φ 𝑃(category|term) • Solution: normalize probabilities • Φ (P(dem|auto)) ~= 0.99 (avg. pctl) • Φ (P(auto|dem)) ~= 0.66 • F-score = 0.78 • Φ (P(dem|the)) ~= 0.19 • Φ (P(the|dem)) ~= 1.0 • F-score = 0.33
  • 60. I’ve written software to create the plots of term frequencies, similar to Rudder (2014). This shows Democratic vs. Republican language, colored with Scaled F- Scores. The code is available at https://github.com/J asonKessler/scatter text/
  • 61. Colored with L1- penalized logistic regression coefficients The code is available at https://github.com /JasonKessler/sc attertext/
  • 63.
  • 64.
  • 65. • Suppose you are selling a car to a typical person, how would you describe the car’s performance? • Should you say – This car has 162 ft-lbs of torque. – OR – People who bought this car tell us it makes passing on two lane roads easy. • Having an idea generation (and rejection) tool makes this very easy. Informing dealer talk tracks.
  • 66. • Corpus and document selection are important – Documents: movie-level instead of review-level – Corpus: rating-specific • Use a variety of approaches – Univariate and multivariate approaches can highlight different terms • Visualize your data set– the pre-alpha text-to-ideas package can let you do that. https://github.com/JasonKessler/text-to-ideas/ • More phrase context is better than less • Add a narrative behind to you analysis, no matter how speculative. Recommendations
  • 67. What you learn from corpora is often more important than classification accuracy.
  • 68. • Thank you! • Check out https://github.com/JasonKessler/scattertext/ for code • $ pip install scattertext • We’re hiring – talk to me or, if you can’t, go to CDKJobs.com – http://www.jasonkessler.com to contact me • Special thanks to the CDK Global Data Science team, including Michael Mabale (thoughts on word clouds), Iris Laband, Peter Kahn, Michael Eggerling, Kyle Lo, Iris Laband, Chris Mills, Phil Burger, Dengyao Mo, and many others. Acknowledgements
  • 69. Questions? • Data Scientist • UI/UX Development & Design • Software Engineer – all levels • Product Manager Is this you? • Find “Jobs by Category” • Click Technology • Have your Resume ready • Click “Apply”! Head to CDKJobs.com -or- talk to me @jasonkessler
  • 70. F-Score-Percentile Fisher LogReg-L1 LogReg-L2 Rudder 1 fight for (108:7) president obama (351:78) president obama (351:78) obama (537:165) auto (37:0) 2 auto (37:0) obama (537:165) barack (202:46) president obama (351:78) insurance company (28:0) 3 fair (47:3) president (746:303) cut (118:24) president (746:303) america forward (28:0) 4 insurance company (28:0) middle class (149:18) worker (54:9) for (1020:542) auto industry (24:0) 5 america forward (28:0) fight for (108:7) all of (61:23) barack (202:46) pell (23:0) 6 middle class (149:18) class (162:25) charlotte (28:3) forward (105:16) wealthy (23:0) 7 president barack (47:4) middle (164:27) today 's (12:3) fight (165:31) pell grant (23:0) 8 auto industry (24:0) barack (202:46) so that (33:9) fight for (108:7) last week (22:0) 9 pell grant (23:0) fight (165:31) forward (105:16) class (162:25) the wealthy (21:0) 10 pell (23:0) forward (105:16) that ' (236:91) barack obama (164:45) romney say (21:0) 11 wealthy (23:0) for (1020:542) make (299:200) middle class (149:18) millionaire (20:0) The top words ranked by each method (dem:rep freqs)
  • 71. This is a Rudder diagram colored with Rudder scores. The code is available at https://github.com /JasonKessler/tex t-to-ideas/

Editor's Notes

  1. Source: http://www.adweek.com/news/advertising-branding/how-subaru-fell-love-and-never-looked-back-148475 "If you ask a Subaru owner what they think of their car, more times than not they'll tell you they love it," said Alan Bethke, director of marketing communications for Subaru of America. "It was always in front of us, but never utilized in the marketing."
  2. Break up slide
  3. Break up slide
  4. Break up slide
  5. Three and four star reviews may be more credible.
  6. Source: http://www.adweek.com/news/advertising-branding/how-subaru-fell-love-and-never-looked-back-148475
  7. Edit examples to ensure that blind spot alert system
  8. Source: http://www.adweek.com/news/advertising-branding/how-subaru-fell-love-and-never-looked-back-148475
  9. Excuse the poor photoshopping.
  10. Excuse the poor photoshopping.
  11. You’re left to your own devices to try and make sense of the differences.
  12. Theater specific terms talk about independent theaters
  13. This example uses percentiles
  14. This example uses percentiles