Presented at the September 2014 Sports Analytics Innovation Enterprise conference in San Francisco, the presentation covers fan engagement using text mining. Three teams were used to demonstrate basic text mining concepts applied to fan engagement. The analysis was performed on the Cleveland Indians, Los Angeles Dodgers and New York Yankees using the R statistics software.
Call Girls In Dwarka ⏩7838079806 ⏩Escort Service In Patel Nagar Delhi
Quantifying Fan Engagement using Social Media
1. SOCIAL MEDIA ANALYTICS TO QUANTIFY FAN
ENGAGEMENT
DR. ROBERT BAKER
TED KWARTLER
Get a more complete profile of your fans to inform business decisions and improve ROI
calculations.
2. AGENDA
Basics
Where are the fans?
Who are the fans?
What are fans talking about?
How do the fans feel towards the team?
What is the point of all this?
3. A FAN’S EXPERIENCE
If only there had been social media, the Yankees could have profiled my experience.
5. SOCIAL MEDIA ANALYTICS REQUIRES TEXT MINING
Before text mining. After text mining.
Text mining lets you “drink from a fire hose” of information and distill useful meaning.
6. Organized into
Document Term Matrix (DTM)
Term Document Matrix (TDM)
Apply standard and
domain specific rules
WHAT IS TEXT MINING?
Unstructured natural
language texts
Insight
&
Recommendation
Natural
language
surveys tweets
Text mining is an emerging technology that can be used to augment existing data by making
unstructured text available for analysis and decision making.
articles
emails
blogs
reviews
texts
7. EXAMPLE UNSTRUCTURED TEXT SOURCES
Many sources including emails, forum posts,
tweets, books, pdfs, reviews, transcripts etc.
Unstructured natural
language texts
杜兰特和詹姆斯谁才是当今联盟的头牌?这是最近很火热的话题。一方
面杜兰特高居得分榜首位,在MVP权力榜上也雄踞第一;另一方面詹姆
斯带领热火一切为了三连冠,比赛沉稳...
Had my first experience at TD Garden when my Bulls came to play the Celtics. Being someone with an out of
state license living in Boston, I usually carry my passport anyway, but I had a friend in town and wanted to clear up this ID
controversy I read so much about in the rules.
8. EXAMPLE PRE-PROCESSING STEPS
(or other software
e.g. Python NLTK)
1.Make all text lower case
2.For twitter, remove “RT” for
retweet.
3.Remove symbols like “@”
4.Remove punctuation
5.Remove numbers
6.Remove Urls e.g.
http://www.espn.com
7.Remove extra whitespace
8.Remove “stopwords”
9.Others as needed depending on
In a “bag of words” text mining methodology the corpus must
be cleaned. Cleaning often means making items lower case, removing
punctuation, numbers and extra whitespace. In unique instances
domain specific rules are applied (e.g. removing “RT” for retweet).
Apply standard and domain specific
rules
Cleaned Version:
no doubt derek jeter makes
my top all time with babe lou
yankee clipper mick
杜兰特和詹姆斯谁才是当今联盟的头
牌?这是最近很火热的话题。一方面杜兰特
高居得分榜首位,在MVP权力榜上也雄踞第
一;另一方面詹姆斯带领热火一切为了三连
冠,比赛沉稳...
Translated Version:
Durant and James, who is the league's first
card today? This is a very hot topic recently.
On the one hand Durant highest scoring top
position in the standings MVP authority also
ranked first; on the other hand, James led the
Heat everything for three consecutive years,
the race calm ...
Cleaned Version:
durant james who league first card today very
hot topic recently on one hand durant highest
scoring top position standings MVP authority
ranked first other hand, james led heat
everything three consecutive years race calm
9. DATA ORGANIZATION
Once cleaned the documents and terms are organized into large matrices.
Often they are very sparse and may contain tens of thousands of data points.
Attributes may be single words or word tokens of 2 or more words.
Organized into
Document Term Matrix
Term Document Matrix
no doubt derek jeter makes my
top all time with babe lou
yankee clipper mick
Document no doubt derek jeter top durant james termN
Tweet_1 1 1 1 1 1 0 0 0
Sina_1 0 0 0 0 1 2 2 1
docN … … … … … … … …
Term Tweet_1 Sina_1 docN
no 1 0 …
doubt 1 0 …
jeter 1 0 …
top 1 1 …
termN 0 1 …
durant james who league first
card today very hot topic
recently on one hand durant
highest scoring top position
standings MVP authority ranked
first other hand, james led heat
everything three consecutive
years race calm ...
Document Term Matrix
Term Document Matrix
14. Team Total Followers Sample Bing API Geo-Located Median Distance to Stadium
Dodgers ~540K First 10K 2,854 1,372 miles
Indians ~225K First 10K 3,774 319 miles
Yankees ~1.18K First 10K 1,335 713 miles
15. WHO ARE THE FANS?
COMMON DEMOGRAPHIC EXTRACTION
16. From Twitter locations to zip code then demographic data.
Sample of 3262 of 10k Followers Geo-located IDs
Zip City Populatio
on
Avg
house
value
Income
below
poverty
Total
business
es
Total
household
ds
91766 Pomona,
CA
71,599 $142,800 15.4% 803
93301 Bakersfiel
d, CA
12,248 $109,600 20.4% 1,438
91606 North
Hollywood,
CA
44,958 $170,100 15.4% 622 14,903
WE CAN GET MORE GRANULAR.
17. Sample of 3775 of 10k Followers Geo-located IDs
Zip City Populatio
on
Avg
house
value
Income
below
poverty
Total
business
es
Total
household
ds
44107 Lakewood
d,
OH
52,244 $117,900 16.4% 945 25,333
44139 Solon,
OH
24,356 $215,700 16.4% 1,155 8,693
44304 Akron,
OH
5,916 $56,300 13.0% 172 1,637
WE CAN GET MORE GRANULAR.
From Twitter locations to zip code then demographic data.
18. Sample of 1335 of 10k Followers Geo-located IDs
Zip City Populatio
on
Avg
house
value
Income
below
poverty
Total
business
es
Total
household
ds
10462 Bronx,
NY
75,784 $192,600 27.9% 1002 29855
14223 Buffalo,
NY
22,665 $85,700 13.9% 328 9832
75060 Irving,
TX
45,980 $83,300 17.2% 503
WE CAN GET MORE GRANULAR.
From Twitter locations to zip code then demographic data.
19. FURTHER INSIGHTS OF ZIP 91766, POMONA CA
At the zip code and metropolitan area there are
countless dimensions that may aid in fan
segmentation and marketing.
• Ranked #1 Drought Riskiest Cities
• Ranked #15 Riskiest for Identity Theft
• Ranked #5 Most Irritation Prone City
Sources:
http://www.census.gov
http://emergency.cdc.gov/snaps/data/39/39153.htm
http://www.bestplaces.net/rankings/zip-code/ohio/akron/44304
• Ranked #8 Healthiest
• Ranked #13 Best City for Teleworking
• Ranked #6 Most Single City
Population
White Black Hispanic
Asian Hawaiin Indian
Other
Gender
male female
Households
total.households house w/child
Immigration
Mexico El Savador Philippines
Gutemala Korea China
Vietnam Iran
20. FURTHER INSIGHTS OF ZIP 44304, AKRON OH
Population
White Black Asian
Hawaiin Indian Other
At the zip code and metropolitan area there are
countless dimensions that may aid in fan
segmentation and marketing.
Gender
male female
Households
total.households house w/child
Immigration
India Germany Yugoslavia
UK Italy Canada
China other
• Ranked #1 Best City for Thanksgiving
• Ranked #4 Best Cities for Teleworking
• Ranked #25 America’s Best Cities for Dating
Sources:
http://www.census.gov
http://emergency.cdc.gov/snaps/data/39/39153.htm
http://www.bestplaces.net/rankings/zip-code/ohio/akron/44304
• Ranked #64 Most Popular City for the Holidays
• Ranked #73 America’s Most Stressful Cities
• Ranked #140 2005 Best Places to Live
21. FURTHER INSIGHTS OF ZIP 10462, BRONX NY
At the zip code and metropolitan area there are
countless dimensions that may aid in fan
segmentation and marketing.
• Ranked #2 Least Crime for Large Metro Area
• Ranked #2 Sleepless Cities 2011
• Ranked #3 Most Single Cities
Sources:
http://www.census.gov
http://emergency.cdc.gov/snaps/data/39/39153.htm
http://www.bestplaces.net/rankings/zip-code/ohio/akron/44304
• Ranked #9 Most Irritation Prone Cities
• Ranked #14 Healthiest Cities
• Ranked #28 Most Playful Cities
Population
White Black Hispanic
Asian Hawaiin Indian
Other
Gender
male female
Households
total.households house w/child
Immigration
Dominican Jamaica Mexico
Guyana Ecuador Caribbean
Honduras Ghana
22. WHAT ARE THE FANS TALKING ABOUT?
INTERESTING TOPICS AND NAMED ENTITY RECOGNITION
23. • Free Twitter API
1.1K Tweets
• Tweets mentioning “Indians”
• 7/31 & 8/1
• “Tokenize” single words into unique two
word groups
• Trade mentions
• Masterson to Cardinals for Ramsey
• Cabrera to Nationals for Walters
• Throwback jerseys for KC Royals game
• Mariners game attendees 7/31
24. DIFFERENCES OF WORD CLOUDS SIMPLE WORD CLOUD,
CLOUD, COMMON CLOUD AND POLARIZED CLOUD
text1 text2
text2
text1 text21 text2
Simple Word Cloud
Commonality & Polarized Cloud
Comparison Cloud
25. 12K Tweets
• Includes a mix free API access and
full fire hose paid API over 48 distinct
hours
• Sampling occurred August 1 and
August 13
• Tweets mentioning “Dodgers” most
often discussed
• Clayton Kershaw’s appearance
on Jimmy Kimmel Live
• FCC Chairman’s letter to Time
Warner CEO about the Dodger’s
TV Channel
26. 2K Spanish
Tweets
• Free Twitter API Spanish language
search over 48 distinct hours
• Sampling occurred July 29 and
August 12
• Tweets mentioning “Dodgers” and
used Spanish most often discussed
• The AP story of Dan Haren
beating the Braves
• Vin Scully retiring was a smaller
topic although present
Example:
Dodgers vencen a Bravos con 2 jonrones de Kemp http://t.co/9U7xiIPOdo
#noticias
Dodgers beat Braves with 2 homers Kemp http://t.co/9U7xiIPOdo #news
27. 235 Blogs
Treemap
Sentiment
• July 29-July 31
• Group is Correlated Topic Modeling
• Color is sentiment
•Area is blog length
• Takeaways:
• Babe Ruth’s birthday is shared with
Laurence Fishburn, born in Augusta
Georgia – picked up blogs mentioning
“birthdays on this date”
• Eli Manning wants to remember advice of
Derek Jeter
• Pending trade deadline
• ESPNNewYork writer Wallace Matthews
• Game recaps
28. Dissimilar
Words
• Full FB Firehose of public posts
• Sampling occurred
• Dodgers:
July 29 – July 31
• Yankees:
July 28 – July 31
• FB mentions of Dodgers and
Yankees tagged as English
•Marketing posts about Spike
Lee requested a Red New
York Yankees World Series
edition fitted cap
29. Words
in
Comm
on
• Full FB Firehose of public posts
• Sampling occurred
• Dodgers:
July 29 – July 31
• Yankees:
July 28 – July 31
• FB mentions of Dodgers and Yankees
tagged as English
•As expected trades to improve the season
towards the end of the deadline were
mentioned by both teams
30. COMPARATIVE ANALYSIS – BIGRAMS IN COMMON
• Full FB Firehose of
public posts
• Sampling occurred
• Dodgers:
Jul 29, -- Jul 31
• Yankees:
Jul 28 – Jul 31
• FB mentions of
Dodgers and
Yankees tagged as
English
red sox
Equal Mentions
32. EXAMPLE POLARITY SCORING IN TWITTER
Many words in natural language
Follows a predictable distribution. Zipf’s Law
but there is steep decline in everyday usage.
900,000
800,000
700,000
600,000
500,000
400,000
300,000
200,000
100,000
0
1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
61
64
67
70
73
76
79
82
85
88
91
94
97
100
Top two words in English
spoken language are “the”
and “be”. Top two words in
Twitter are “RT” and “I”.
However the power
distribution is similar and
follows Zipf’s law.
Top 100 Word Usage from 3M Tweets
33. SENTIMENT POLARITY ANALYSIS
Surprise is a sentiment.
Hit by a bus! – Negative
polarity but surprising.
Won the lottery! – Positive
polarity but still surprising.
Use the University of Pittsburgh’s MPQA Lexicon
& Illocution Inc’s 10K top Twitter words.
Keyword Scanning for
polarity
R script scans for 3546 positive
words, and 5701 negative
words. It adds positive words
and subtracts negative ones.
The final score represents the
polarity of the social
interaction.
•I loathe the Tigers. -1
•I love Lou Whittaker. He was the
best. +2
•I like the Tigers but dislike going to
the stadium. 0
37. IN COMPARISON…
hey..yankees....can ya score some runs?!
indians activate murphy from disabled list http://t.co/bqliintwsf
dodgers rhp josh beckett won't return this season
Team Tweets>=1 Tweets<=-1 Total w/o 0 % positive
Yankees 280 406 686 41%
Indians 290 456 746 39%
Dodgers 448 1,226 1,674 27%
38. WHAT IS THE POINT OF ALL THIS?
TARGETED MARKETING EFFORTS, EVANGELISTS, REFINED SEGMENTATION, MEDIA MIX MODELING
LEADING TO ROI
39. EXAMPLE IDENTIFY EVANGELISTS, INFLUENCERS & DETRACTORS
• When engaging on social media it
is important to note the clout of
followers in terms of status updates,
and followers
• Running sentiment analysis on
updates/posts adds context to the
voice of the customer
• Appending other data allows for
additional segmentation, and
differentiated customer
experiences e.g. my Yankee story
10K Indians Followers less 138 outliers
40. MEDIA MIX MODELING FOR SOCIAL MEDIA ROI
• In lieu of actual sales
merchandise data and
marketing spend,
tracked Amazon Sales
Rank hourly from 4/1 to
8/31
• Relative measure of
sales against other
“Sports and Outdoors”
category items
•Lower number is better
41. DODGER CAP AVERAGE HOURLY SALES RANK PER DAY
4500
4000
3500
3000
2500
2000
1500
1000
500
0
1-Apr 8-Apr 15-Apr 22-Apr 29-Apr 6-May 13-May 20-May 27-May 3-Jun 10-Jun 17-Jun 24-Jun 1-Jul 8-Jul 15-Jul 22-Jul 29-Jul 5-Aug 12-Aug 19-Aug 26-Aug
Amazon sales rank when seen as a time series exhibits is not stationary. Overall the Dodgers has
an increasing trend despite being successful on field and has some periodicity based on day of
week.
42. Time Series
Decompositi
on
• Econometric forecasting TSD
was used in an attempt to
isolate social media impact
and understand sales rank
patterns
• Trend is likely the impact of
baseball season excitement
then waning to other sports
• Seasonal may be the impact
of retail day of the week
cycles
• Leaving random as the
dependent variable in the
media mix GLM
43. Tweets to Decomposed Amazon Sales Rank
• Correlation is only -0.08.
• Given the tweets are
examined against ‘random’
or unexplained data the
relationship may still be
relevant.
•As this is proxy data for sales
of a single item, results not
conclusive
1000
800
600
400
200
0
-200
-400
-600
-800
-1000
0 10 20 30 40 50 60 70 80 90 100
*removed dates with missing data
44. Tweets to Average Daily Amazon Sales Rank
•Much stronger correlation -
0.24
• Leads one to believe the
more a team tweets the
lower the sales rank
•As this is proxy data for sales
of a single item, results not
conclusive
*removed dates with missing data
4500
4000
3500
3000
2500
2000
1500
1000
500
0
0 10 20 30 40 50 60 70 80 90 100
45. Media mix modeling
*removed dates with missing data
• Given the likely relationship:
• Set up a GLM using marketing
efforts media spend with the
dependent variable being
revenue, ticket sales,
merchandise sales etc.
• The coefficients of the inputs
illustrate the impact of the
channel marketing spends
leading you to ROI
Example:
푓 푠푎푙푒푠
= 훽0 + 훽1 푠표푐푖푎푙. 푚푒푑푖푎. 푠푝푒푛푑
+ 훽2 푡푟푎푑푖푡푖표푛푎푙. 푚푘푡푔. 푠푝푒푛푑
+ 훽3 푡푒푎푚. 푝푒푟푓표푟푚푎푛푐푒 … 훽푛 + 휖
The goal is increased model lift, and accuracy by
incorporating social media spend. The coefficient of the
variable demonstrates the impact. This will allow you to
calculate a ROI of social spend.
46. FURTHER INFO
Want example R scripts for the visuals?
www.sportsanalytics.org starting 9/15
Hinweis der Redaktion
Misses Amplifiers, negations and emoticons
Missed example:
Wicked good.
Don’t have cancer.
Twitter is unique in frequency e.g. “I” is the #2 most frequent word yet in all other English usages ranks #10, RT =#1, the=#1
Twitter=not English! “smh” , “jk”, “lmao” & “gr8” are words in twitter
~3M Tweets
April 2013 to July 2013
~600K unigrams
Example done in R using Illocution Inc Twitter Data to create unigram lexicon appended to University of Illinois at Chicago positive/negative lexicon
Lexicon from top 10K words scored as positive, neutral, negative
Multi-question perspective answering subjectivity lexicon from the University of Pittsburgh
Added twitter peculiarities from illocution inc’s twitter frequency analysis
Total is
5701 negative words
3546 positive words