SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Downloaden Sie, um offline zu lesen
Viral or Bust!
Popularity Classification on News and Entertainment Media
The Data:
-40,000+ articles scraped from mashable.com
-Scraped and pre-processed with attention to linguistic features of each article
-56 resulting features to consider
The Data:
Among the 56 features, topics are:
-Words
-NLP
-Publication Time
-Digital Media Aspects
Goal:
Create a model that will
distinguish between
popular and unpopular
news
Exploring the Data:
Exploring the Data: Rate of +/- Words
Exploring the Data: +/- Polarity
Exploring the Data: Global Subjectivity
Exploring the Data: Self-reference Links
Exploring the Data: LDA Rank
Initial Analysis:
Model Accuracy Precision Recall F1
kNN 0.566000 0.594047 0.590866 0.592452
Naive Bayes 0.479654 0.623277 0.064094 0.116236
RandomForest 0.608804 0.640564 0.694331 0.666364
LogisticReg 0.591984 0.617579 0.668346 0.641960
SVC 0.533967 0.533928 1.000000 0.697104
Feature Reduction
● Principal Component Analysis to
find distribution of variance in the
data
Feature Reduction
● Eliminated features below variance threshhold (.8)
● Ran GridSearch Random and Grid Search CV on Random Forest to find the ideal
parameters
● Ran GridSearch CV with additional specified parameters and graph by feature importance
Most Important Features
Rank Feature
1 Average Keyword Score
2 Data Channel is Entertainment
3 Closeness to LDA topic 2
4 Average Token Length
5 Published on Weekend
6 Closeness to LDA topic 4
7 Data Channel is Technology
8 Max Keyword Score
9 Data Channel is World
Final Results:
Model Accuracy Precision Recall F1
kNN 0.562236 0.581848 0.591262 0.566142
Naive Bayes 0.523288 0.660920 0.140122 0.231222
RandomForest 0.662240 0.662520 0.695117 0.668421
LogisticReg 0.614035 0.638523 0.566057 0.600111
SVC 0.531645 0.532263 1.000000 0.697104
Final Results and Findings:
Small but consistent gain in accuracy:
● Data well-processed
● Correlation between features is minimal
Conclusion and
Next Steps
● In spite of the difficulty in
separating data, selected
model performed fairly well
● In the future, would like to rely
less on sentiment analysis and
focus on word vector
correlations

Weitere ähnliche Inhalte

Andere mochten auch (6)

Week Eight
Week EightWeek Eight
Week Eight
 
Triton cs collins_11.07
Triton cs collins_11.07Triton cs collins_11.07
Triton cs collins_11.07
 
Java
JavaJava
Java
 
EDSS 620 Day 2
EDSS 620 Day 2EDSS 620 Day 2
EDSS 620 Day 2
 
Didáctica crítica
Didáctica críticaDidáctica crítica
Didáctica crítica
 
Diseño de situación de aprendizaje
Diseño de situación de aprendizajeDiseño de situación de aprendizaje
Diseño de situación de aprendizaje
 

Ähnlich wie Project McNulty

open-data-presentation.pptx
open-data-presentation.pptxopen-data-presentation.pptx
open-data-presentation.pptx
DennicaRivera
 

Ähnlich wie Project McNulty (20)

Falling in and out and in love with Information Architecture
Falling in and out and in love with Information ArchitectureFalling in and out and in love with Information Architecture
Falling in and out and in love with Information Architecture
 
1140 track 1 weiss_using his mac
1140 track 1 weiss_using his mac1140 track 1 weiss_using his mac
1140 track 1 weiss_using his mac
 
Advanced Analytics and Data Science Expertise
Advanced Analytics and Data Science ExpertiseAdvanced Analytics and Data Science Expertise
Advanced Analytics and Data Science Expertise
 
Gain Super Powers in Data Science: Relationship Discovery Across Public Data
Gain Super Powers in Data Science: Relationship Discovery Across Public DataGain Super Powers in Data Science: Relationship Discovery Across Public Data
Gain Super Powers in Data Science: Relationship Discovery Across Public Data
 
Powerful Information Discovery with Big Knowledge Graphs –The Offshore Leaks ...
Powerful Information Discovery with Big Knowledge Graphs –The Offshore Leaks ...Powerful Information Discovery with Big Knowledge Graphs –The Offshore Leaks ...
Powerful Information Discovery with Big Knowledge Graphs –The Offshore Leaks ...
 
Using hadoop for big data
Using hadoop for big dataUsing hadoop for big data
Using hadoop for big data
 
Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015
 
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
 
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
 
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache SparkAI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
 
open-data-presentation.pptx
open-data-presentation.pptxopen-data-presentation.pptx
open-data-presentation.pptx
 
Boost your data analytics with open data and public news content
Boost your data analytics with open data and public news contentBoost your data analytics with open data and public news content
Boost your data analytics with open data and public news content
 
How to Reveal Hidden Relationships in Data and Risk Analytics
How to Reveal Hidden Relationships in Data and Risk AnalyticsHow to Reveal Hidden Relationships in Data and Risk Analytics
How to Reveal Hidden Relationships in Data and Risk Analytics
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Approaching Big Data: Lesson Plan
Approaching Big Data: Lesson Plan Approaching Big Data: Lesson Plan
Approaching Big Data: Lesson Plan
 
A Primer on Text Mining for Business
A Primer on Text Mining for BusinessA Primer on Text Mining for Business
A Primer on Text Mining for Business
 
Get smart: digitial business innovation
Get smart: digitial business innovationGet smart: digitial business innovation
Get smart: digitial business innovation
 
Recommender Systems @ Scale - PyData 2019
Recommender Systems @ Scale - PyData 2019Recommender Systems @ Scale - PyData 2019
Recommender Systems @ Scale - PyData 2019
 
Webinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewWebinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's New
 
Calin Constantinov - Neo4j - Keyboards and Mice - Craiova 2016
Calin Constantinov - Neo4j - Keyboards and Mice - Craiova 2016Calin Constantinov - Neo4j - Keyboards and Mice - Craiova 2016
Calin Constantinov - Neo4j - Keyboards and Mice - Craiova 2016
 

Project McNulty

  • 1. Viral or Bust! Popularity Classification on News and Entertainment Media
  • 2. The Data: -40,000+ articles scraped from mashable.com -Scraped and pre-processed with attention to linguistic features of each article -56 resulting features to consider
  • 3. The Data: Among the 56 features, topics are: -Words -NLP -Publication Time -Digital Media Aspects
  • 4. Goal: Create a model that will distinguish between popular and unpopular news
  • 6. Exploring the Data: Rate of +/- Words
  • 7. Exploring the Data: +/- Polarity
  • 8. Exploring the Data: Global Subjectivity
  • 9. Exploring the Data: Self-reference Links
  • 11. Initial Analysis: Model Accuracy Precision Recall F1 kNN 0.566000 0.594047 0.590866 0.592452 Naive Bayes 0.479654 0.623277 0.064094 0.116236 RandomForest 0.608804 0.640564 0.694331 0.666364 LogisticReg 0.591984 0.617579 0.668346 0.641960 SVC 0.533967 0.533928 1.000000 0.697104
  • 12. Feature Reduction ● Principal Component Analysis to find distribution of variance in the data
  • 13. Feature Reduction ● Eliminated features below variance threshhold (.8) ● Ran GridSearch Random and Grid Search CV on Random Forest to find the ideal parameters ● Ran GridSearch CV with additional specified parameters and graph by feature importance
  • 14. Most Important Features Rank Feature 1 Average Keyword Score 2 Data Channel is Entertainment 3 Closeness to LDA topic 2 4 Average Token Length 5 Published on Weekend 6 Closeness to LDA topic 4 7 Data Channel is Technology 8 Max Keyword Score 9 Data Channel is World
  • 15. Final Results: Model Accuracy Precision Recall F1 kNN 0.562236 0.581848 0.591262 0.566142 Naive Bayes 0.523288 0.660920 0.140122 0.231222 RandomForest 0.662240 0.662520 0.695117 0.668421 LogisticReg 0.614035 0.638523 0.566057 0.600111 SVC 0.531645 0.532263 1.000000 0.697104
  • 16. Final Results and Findings: Small but consistent gain in accuracy: ● Data well-processed ● Correlation between features is minimal
  • 17. Conclusion and Next Steps ● In spite of the difficulty in separating data, selected model performed fairly well ● In the future, would like to rely less on sentiment analysis and focus on word vector correlations