Anzeige

Team BuzzFeed: Project Presentation

12. Sep 2016
Anzeige

Más contenido relacionado

Similar a Team BuzzFeed: Project Presentation(20)

Anzeige

Team BuzzFeed: Project Presentation

  1. BUZZ FEEDER FINDING OUT THE TRENDS BEHIND WHAT’S TRENDING
  2. TEAM ➔ Anurag Khaitan ➔ Josh Erb ➔ Walter Tyrna
  3. CONTEXT
  4. WHAT IS BUZZFEED? “BuzzFeed is a cross-platform, global network for news and entertainment that generates seven billion views each month. BuzzFeed creates and distributes content for a global audience and utilizes proprietary technology to continuously test, learn and optimize.” (buzzfeed website) ● More than 7 billion monthly global content views ● More than 200M monthly unique visitors to BuzzFeed.com ● 11 international editions including US, UK, Germany, Espanol, France, Spain, India, Canada, Mexico, Brazil, Australia and Japan (buzzfeed website)
  5. PROBLEM ➔ There is good money to be made from consistently generating popular content on the internet. ➔ A significant portion (20%-30%) of Buzzfeed’s articles generate very little traffic.
  6. Hypothesis We believe there may be a correlation between the content of the language associated with an article (title, description, tags, etc.) and how likely it is to go viral. We also believe that this likelihood is tied to the country in which an article goes viral
  7. WHY DOES IT MATTER? ➔ BuzzFeed could hypothetically save money and improve user experience by informing content by what topics consistently draw readership
  8. OUR APPROACH ➔ Visualization to help identify underlying themes in a given dataset through three lenses-the title, the content of the article itself, or the tags ascribed to it by the author. ➔ Title Generator to suggest topics and themes based upon recent trends in the Social Media to guide the editing staff in writing content that is likely to generate significant online traffic. ➔ Given sufficient number of articles in our data and trending topics, we believe that the product of reasonable title generator can be fed into a predictor to help assess its potential virality.
  9. OUR EXPERIENCE
  10. INGESTION “You need to start pulling data, like, now.” - Ben Bengfort, 1st Day of Class ➔ Project required us to gather data from 5 separate public APIs ➔ Before anything else, it was necessary to automate the process of querying the APIs ➔ Set up an ubuntu instance on Amazon Web Services’ Elastic Compute Cloud (EC2) ➔ Run Python Script hourly (crontab) to capture .json files on a server-side WORM -- 5 calls/hour, each for Australia, Canada, India, UK and US Data Collection began: May 18, 2016. Data Collection ended: Aug 31, 2016 Total raw data size in WORM: 1.16GB. Number of records pulled: 330,000 (25 articles/hr each for 5 countries for 100 days)
  11. ARCHITECTURE
  12. WRANGLING ➔ Clean Raw Data ◆ Remove tags, images and other content outside the scope of our analysis ◆ Used insight from this to drop irrelevant variables and identify gaps that could be accounted for ➔ Understand Target Variable (Measure of Virality) ◆ A frequency column to understand how each article was “persisting”, as a measure of virality ◆ Understand the accuracy and applicability of Number of Impressions provided in the data ➔ Capture all Instances, Features and Target Variables in Postgres Table to use downstream in the pipeline
  13. WHAT DOES THE DATA LOOK LIKE? Australia Canada India UK US 9% 5% 7% 17% 62%
  14. ANALYSIS ➔ Word Clouds ◆ What terms “jump out”? ➔ Natural Language Toolkit ◆ What sorts of analysis can we run on our textual data? ➔ Sci-Kit Learn ◆ What can Machine Learning models can help us predict?
  15. TOP TERMS Tags: Australia 1. game 2. thrones 3. australia 4. season 5. 6 6. fan 7. twitter 8. quiz 9. stark 10. hot Canada 1. canada 2. canadian 3. news 4. social 5. quiz 6. animals 7. twitter 8. funny 9. lol 10. food India 1. social 2. news 3. india 4. bollywood 5. indian 6. twitter 7. desi 8. khan 9. stories 10. women UK 1. quiz 2. british 3. uk 4. food 5. trivia 6. twitter 7. you 8. funny 9. celebrity 10. 00s US 1. test 2. quiz 3. food 4. recipes 5. you 6. funny 7. news 8. social 9. summer 10. music ● The United States, United Kingdom, and Canada share the most similar top tags (as well as titles) while Australia and India have more distinct preferences. ● Articles about Game of Thrones - and television in general - fare better in Buzzfeed Australia ● “Women/woman” only appears on the top list for India, perhaps reflective of readership ● Twitter does well across all five groups - evidence of the popularity of listicles (“27 Times Mindy Kaling Was Just Too Relatable On Twitter”)
  16. WORDCLOUDS Tags AUSTRALIA CANADA INDIA UNITED KINGDOM UNITED STATES
  17. WORDCLOUDS Titles AUSTRALIA CANADA INDIA UNITED KINGDOM UNITED STATES
  18. TITLE GENERATOR ➔ Generated a corpus of all the unique titles from API pulls ➔ Natural Language Toolkit: Trigram Collocation Finder & Trigram Assoc Metrics ➔ Grabbing most likely subsequent words using Likelihood Ratios ➔ Introduced minor stochasticity to prevent it always providing the same titles ➔ Notable Examples: ◆ “Canada Goose Is Most Calories” ◆ “You More Hilary Duff or Lohan?” ◆ “What Game of Thrones Fan if You Guess We Thrones”
  19. FEATURE SELECTION WHAT FEATURES ARE THE MOST TELLING - HYPOTHESIS CATEGORY: SOME SIGNAL There are 140+ categories on Buzzfeed? Is there a relationship between the categories and virality? METAVALUE: TOO BROAD - NO SIGNAL How many keywords are there? What is the relationship between virality and certain keywords? ➔ Each “Buzz” had 36 data points ◆ Some of these data points were standardized ◆ Some of them were not ➔ A significant amount of these data points did not contain any signal ➔ Other than category, only fields that contained signals had text/words that are contained in the article: ◆ Decription, Title, Primary Keywords ◆ Tags, containing phrases and words
  20. TARGET MEASURE OF VIRALITY IMPRESSIONS Number of times an article is views FREQUENCY Number of hours an article stays on a country’s BuzzFeed page. ➔ Impressions: Inaccurate and aggregated measure in the snapshot ➔ Frequency: Another measure but not always aligned with the corresponding impression provided in the instance ➔ Some f(Impressions, Frequency) worked ➔ Needed to use the function to identify classes ➔ Log Transformation to account for wide variability and skewed distribution as follows: Virality = Log (Impressions * Frequency) Non-Viral: Virality < mean- standard devitation Viral: Virality >= mean - standard deviation
  21. FEATURE ENGINEERING FEATURE ENGINEERING ATTEMPTED OBVIOUS ONES STOP WORDS OR COMMON WORDS COULD HAVE HELPED ➔ Title Length: Fairly constant and not a good indicator. ➔ Lists vs. Non-Lists: Contrary to our hypothesis, no such correlation in the data. ➔ Words in tags: To retain the context in the tags, we used individual phrases, as provided (simulated n-grams) and individual words (1-gram). ➔ Low Document Frequency: No positive impact on the predictability. ➔ High Document Frequency: Negative impact on the predictability on the model. ➔ Stop Words OR Common Words: Did not attempt it due to time constraints.
  22. MODELING WITH SCI-KIT LEARN Multinomial Naive Bayes and Logistic Regression: Feature Selection: For each instance, we used all the text contained in Title, Description, Category, Primary Keywords and Phrases in Tags. Document Frequency: Maximum and minimum document frequency, in increments of 10%...No Impact vect = CountVectorizer() Output Number of Features in vect: 70,000 more more features Model Selection: For both models, we did 12-fold cross-validation as follows: skf = StratifiedKFold(y, n_folds=12, shuffle=True) for train, test in skf: … Another cross-validation for both Multinomial NB and Logistic Regression as follows: cross_val_score(pipe,X,y,cv=12,scoring='accuracy').mean()
  23. MODEL RESULTS Multinomial NB Logistic Regression Accuracy 0.839620 0.865165 AUC 0.699976 0.677515 F1 0.904905 0.922518 Precision 0.908419 0.898182 Recall 0.901438 0.948231 CROSS VALIDATION ACCURACY SCORES Multinomial Naive Bayes: 0.840168 Logistic Regression: 0.864645
  24. TOOLS
  25. NLTK Word Cloud
  26. WHAT COULD BE DONE BETTER?
  27. ROOM FOR IMPROVEMENT ➔ BuzzFeed’s public API does not share the whole story--Include data points from other sources ➔ Limit focus to English-speaking countries limited ability to see impact of cultural context outside of the US content-engine’s orbit. ➔ With more time, might apply a better methodology to the Title Generator ➔ With more time, might stand up the user-facing web application and capture user data to improve the model and generate better recommendations
  28. QUESTIONS?
Anzeige