Applying Machine-Learning and Natural Language Processing tools in an attempt to better predict article virality for BuzzFeed; a Data Science capstone project.
WHAT IS BUZZFEED?
“BuzzFeed is a cross-platform, global
network for news and entertainment that
generates seven billion views each month.
BuzzFeed creates and distributes content
for a global audience and utilizes
proprietary technology to continuously
test, learn and optimize.”
(buzzfeed website)
● More than 7 billion
monthly global content
views
● More than 200M monthly
unique visitors to
BuzzFeed.com
● 11 international editions
including US, UK,
Germany, Espanol, France,
Spain, India, Canada,
Mexico, Brazil, Australia
and Japan
(buzzfeed website)
PROBLEM
➔ There is good money to be made from
consistently generating popular content on the
internet.
➔ A significant portion (20%-30%) of Buzzfeed’s
articles generate very little traffic.
Hypothesis
We believe there may be a
correlation between the
content of the language
associated with an article (title,
description, tags, etc.) and how
likely it is to go viral.
We also believe that this
likelihood is tied to the country
in which an article goes viral
WHY DOES IT MATTER?
➔ BuzzFeed could hypothetically save
money and improve user experience by
informing content by what topics
consistently draw readership
OUR APPROACH
➔ Visualization to help identify underlying themes
in a given dataset through three lenses-the title,
the content of the article itself, or the tags
ascribed to it by the author.
➔ Title Generator to suggest topics and themes
based upon recent trends in the Social Media to
guide the editing staff in writing content that is
likely to generate significant online traffic.
➔ Given sufficient number of articles in our data
and trending topics, we believe that the product
of reasonable title generator can be fed into a
predictor to help assess its potential virality.
INGESTION
“You need to start pulling
data, like, now.”
- Ben Bengfort, 1st Day of Class
➔ Project required us to gather data from 5 separate
public APIs
➔ Before anything else, it was necessary to
automate the process of querying the APIs
➔ Set up an ubuntu instance on Amazon Web
Services’ Elastic Compute Cloud (EC2)
➔ Run Python Script hourly (crontab) to capture
.json files on a server-side WORM -- 5 calls/hour,
each for Australia, Canada, India, UK and US
Data Collection began: May 18, 2016.
Data Collection ended: Aug 31, 2016
Total raw data size in WORM: 1.16GB.
Number of records pulled: 330,000
(25 articles/hr each for 5 countries for
100 days)
WRANGLING
➔ Clean Raw Data
◆ Remove tags, images and other content outside the scope of our analysis
◆ Used insight from this to drop irrelevant variables and identify gaps that
could be accounted for
➔ Understand Target Variable (Measure of Virality)
◆ A frequency column to understand how each article was “persisting”, as a
measure of virality
◆ Understand the accuracy and applicability of Number of Impressions
provided in the data
➔ Capture all Instances, Features and Target Variables in Postgres Table to use
downstream in the pipeline
WHAT DOES THE DATA LOOK LIKE?
Australia Canada India
UK US
9%
5%
7%
17%
62%
ANALYSIS
➔ Word Clouds
◆ What terms “jump out”?
➔ Natural Language Toolkit
◆ What sorts of analysis can we run
on our textual data?
➔ Sci-Kit Learn
◆ What can Machine Learning
models can help us predict?
TOP TERMS
Tags: Australia
1. game
2. thrones
3. australia
4. season
5. 6
6. fan
7. twitter
8. quiz
9. stark
10. hot
Canada
1. canada
2. canadian
3. news
4. social
5. quiz
6. animals
7. twitter
8. funny
9. lol
10. food
India
1. social
2. news
3. india
4. bollywood
5. indian
6. twitter
7. desi
8. khan
9. stories
10. women
UK
1. quiz
2. british
3. uk
4. food
5. trivia
6. twitter
7. you
8. funny
9. celebrity
10. 00s
US
1. test
2. quiz
3. food
4. recipes
5. you
6. funny
7. news
8. social
9. summer
10. music
● The United States, United Kingdom, and Canada share the most similar top tags (as well as titles)
while Australia and India have more distinct preferences.
● Articles about Game of Thrones - and television in general - fare better in Buzzfeed Australia
● “Women/woman” only appears on the top list for India, perhaps reflective of readership
● Twitter does well across all five groups - evidence of the popularity of listicles (“27 Times Mindy
Kaling Was Just Too Relatable On Twitter”)
TITLE GENERATOR
➔ Generated a corpus of all the unique
titles from API pulls
➔ Natural Language Toolkit: Trigram
Collocation Finder & Trigram Assoc Metrics
➔ Grabbing most likely subsequent words
using Likelihood Ratios
➔ Introduced minor stochasticity to
prevent it always providing the same
titles
➔ Notable Examples:
◆ “Canada Goose Is Most Calories”
◆ “You More Hilary Duff or Lohan?”
◆ “What Game of Thrones Fan if You
Guess We Thrones”
FEATURE SELECTION WHAT FEATURES ARE THE
MOST TELLING - HYPOTHESIS
CATEGORY: SOME SIGNAL
There are 140+ categories on
Buzzfeed? Is there a relationship
between the categories and
virality?
METAVALUE: TOO BROAD - NO
SIGNAL
How many keywords are there?
What is the relationship between
virality and certain keywords?
➔ Each “Buzz” had 36 data points
◆ Some of these data points were standardized
◆ Some of them were not
➔ A significant amount of these data points did
not contain any signal
➔ Other than category, only fields that
contained signals had text/words that are
contained in the article:
◆ Decription, Title, Primary Keywords
◆ Tags, containing phrases and words
TARGET
MEASURE OF VIRALITY
IMPRESSIONS
Number of times an article is
views
FREQUENCY
Number of hours an article stays
on a country’s BuzzFeed page.
➔ Impressions: Inaccurate and aggregated
measure in the snapshot
➔ Frequency: Another measure but not always
aligned with the corresponding impression
provided in the instance
➔ Some f(Impressions, Frequency) worked
➔ Needed to use the function to identify classes
➔ Log Transformation to account for wide
variability and skewed distribution as follows:
Virality = Log (Impressions * Frequency)
Non-Viral: Virality < mean- standard devitation
Viral: Virality >= mean - standard deviation
FEATURE ENGINEERING
FEATURE ENGINEERING
ATTEMPTED OBVIOUS ONES
STOP WORDS OR COMMON
WORDS COULD HAVE HELPED
➔ Title Length: Fairly constant and not a good indicator.
➔ Lists vs. Non-Lists: Contrary to our hypothesis, no such
correlation in the data.
➔ Words in tags: To retain the context in the tags, we
used individual phrases, as provided (simulated
n-grams) and individual words (1-gram).
➔ Low Document Frequency: No positive impact on the
predictability.
➔ High Document Frequency: Negative impact on the
predictability on the model.
➔ Stop Words OR Common Words: Did not attempt it
due to time constraints.
MODELING WITH SCI-KIT LEARN
Multinomial Naive Bayes and Logistic Regression:
Feature Selection: For each instance, we used all the text contained in Title, Description, Category,
Primary Keywords and Phrases in Tags.
Document Frequency: Maximum and minimum document frequency, in increments of 10%...No Impact
vect = CountVectorizer()
Output Number of Features in vect: 70,000 more more features
Model Selection: For both models, we did 12-fold cross-validation as follows:
skf = StratifiedKFold(y, n_folds=12, shuffle=True)
for train, test in skf: …
Another cross-validation for both Multinomial NB and Logistic Regression as follows:
cross_val_score(pipe,X,y,cv=12,scoring='accuracy').mean()
ROOM FOR IMPROVEMENT
➔ BuzzFeed’s public API does not share the whole
story--Include data points from other sources
➔ Limit focus to English-speaking countries limited
ability to see impact of cultural context outside of
the US content-engine’s orbit.
➔ With more time, might apply a better methodology
to the Title Generator
➔ With more time, might stand up the user-facing web
application and capture user data to improve the
model and generate better recommendations