Project McNulty

•

0 gefällt mir•149 views

James Evers

Viral or Bust!
Popularity Classification on News and Entertainment Media

The Data:
-40,000+ articles scraped from mashable.com
-Scraped and pre-processed with attention to linguistic features of each article
-56 resulting features to consider

The Data:
Among the 56 features, topics are:
-Words
-NLP
-Publication Time
-Digital Media Aspects

Goal:
Create a model that will
distinguish between
popular and unpopular
news

Exploring the Data: Self-reference Links

Initial Analysis:
Model Accuracy Precision Recall F1
kNN 0.566000 0.594047 0.590866 0.592452
Naive Bayes 0.479654 0.623277 0.064094 0.116236
RandomForest 0.608804 0.640564 0.694331 0.666364
LogisticReg 0.591984 0.617579 0.668346 0.641960
SVC 0.533967 0.533928 1.000000 0.697104

Feature Reduction
● Principal Component Analysis to
find distribution of variance in the
data

Feature Reduction
● Eliminated features below variance threshhold (.8)
● Ran GridSearch Random and Grid Search CV on Random Forest to find the ideal
parameters
● Ran GridSearch CV with additional specified parameters and graph by feature importance

Most Important Features
Rank Feature
1 Average Keyword Score
2 Data Channel is Entertainment
3 Closeness to LDA topic 2
4 Average Token Length
5 Published on Weekend
6 Closeness to LDA topic 4
7 Data Channel is Technology
8 Max Keyword Score
9 Data Channel is World

Final Results:
Model Accuracy Precision Recall F1
kNN 0.562236 0.581848 0.591262 0.566142
Naive Bayes 0.523288 0.660920 0.140122 0.231222
RandomForest 0.662240 0.662520 0.695117 0.668421
LogisticReg 0.614035 0.638523 0.566057 0.600111
SVC 0.531645 0.532263 1.000000 0.697104

Final Results and Findings:
Small but consistent gain in accuracy:
● Data well-processed
● Correlation between features is minimal

Conclusion and
Next Steps
● In spite of the difficulty in
separating data, selected
model performed fairly well
● In the future, would like to rely
less on sentiment analysis and
focus on word vector
correlations

Weitere ähnliche Inhalte

Andere mochten auch

Week Eight

cogba983

Triton cs collins_11.07

Java

EDSS 620 Day 2

Didáctica crítica

Diseño de situación de aprendizaje

Elisa Jiménez

Andere mochten auch (6)

Week Eight

Triton cs collins_11.07

Java

EDSS 620 Day 2

Didáctica crítica

Diseño de situación de aprendizaje

Ähnlich wie Project McNulty

Falling in and out and in love with Information Architecture

Louis Rosenfeld

1140 track 1 weiss_using his mac

Rising Media, Inc.

Advanced Analytics and Data Science Expertise

SoftServe

Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Ontotext

Powerful Information Discovery with Big Knowledge Graphs –The Offshore Leaks ...

Connected Data World

Using hadoop for big data

Data Science Thailand

Dublin Ireland Spark Meetup October 15, 2015

Chris Fregly

Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...

Chi-Yi Kuan

Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...

Yongzheng (Tiger) Zhang

AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark

Value Amplify Consulting

open-data-presentation.pptx

DennicaRivera

Get guidance through the gigantic sea of freely available Open Data and learn how it can empower you analysis of any kind of sources. This webinar is a live demo of news and data analytics, based on rich links within big knowledge graphs. It will show you how to: Build ranking reports (e.g for people and organisations) View topics linked implicitly (e.g. daughter companies, key personnel, products …) Draw trend lines Extend your analytics with additional data sources

Boost your data analytics with open data and public news content

Ontotext

How to Reveal Hidden Relationships in Data and Risk Analytics

Ontotext

Case study of Rujhaan.com (A social news app )

Rahul Jain

Slides for class session I taught at USC Annenberg on approaching big data for a non-technical audience so that they can learn the project planning skills to work with technical teams. The goal is to teach students the mindset they should when taking in mixed methods and applying to large datasets prior to selecting software packages and methodology. The slides take us through a previous use case and guidance moving forward from a process and cross-functional team perspective.

Approaching Big Data: Lesson Plan

Bessie Chu

A Primer on Text Mining for Business

Clement Levallois

Welcome to the new age of platform technologies and smart services for every sector The world's innovation landscape is changing To compete in the marketplace and maintain relevancy, companies need to innovate constantly. But while there is a desire to more global, integrated and customer-centric innovates, actually getting new products and services to market are rare, and what we call frequent and radical innovations - new services and products that dramatically change the marketplace - is even rarer.

Get smart: digitial business innovation

Dr. Bülent Dal

Serving tens of billions of personalized recommendations a day under a latency of 30 milliseconds is a challenge. In this talk I'll share our algorithmic architecture, including its Spark-based offline layer, and its Elasticsearch-based serving layer, that enable running complex models under difficult scale constrains and shorten the cycle between research and production. Sonya Liberman leads the Personalization team @ Outbrain's Recommendations group, developing large-scale machine learning algorithms for Outbrain's content recommendations platform serving tens of billions real-time recommendations a day. She specializes in Information Retrieval, Machine Learning, and Computational Linguistics. Before joining Outbrain, she led the Research and Algorithms @ ConvertMedia (acquired by Taboola). She holds an MSc in Computer Science and a BSc in Computer Science and Computational Biology. This invited talk was given at PyData Meetup, April 2019 https://www.meetup.com/PyData-Tel-Aviv/

Recommender Systems @ Scale - PyData 2019

Sonya Liberman

Webinar: Fusion 3.1 - What's New

Lucidworks

Calin Constantinov - Neo4j - Keyboards and Mice - Craiova 2016

Calin Constantinov

Ähnlich wie Project McNulty (20)

Falling in and out and in love with Information Architecture

1140 track 1 weiss_using his mac

Advanced Analytics and Data Science Expertise

Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Powerful Information Discovery with Big Knowledge Graphs –The Offshore Leaks ...

Using hadoop for big data

Dublin Ireland Spark Meetup October 15, 2015

Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...

AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark

open-data-presentation.pptx

Boost your data analytics with open data and public news content

How to Reveal Hidden Relationships in Data and Risk Analytics

Case study of Rujhaan.com (A social news app )

Approaching Big Data: Lesson Plan

A Primer on Text Mining for Business

Get smart: digitial business innovation

Recommender Systems @ Scale - PyData 2019

Webinar: Fusion 3.1 - What's New

Calin Constantinov - Neo4j - Keyboards and Mice - Craiova 2016

Project McNulty

1. Viral or Bust! Popularity Classification on News and Entertainment Media

2. The Data: -40,000+ articles scraped from mashable.com -Scraped and pre-processed with attention to linguistic features of each article -56 resulting features to consider

3. The Data: Among the 56 features, topics are: -Words -NLP -Publication Time -Digital Media Aspects

4. Goal: Create a model that will distinguish between popular and unpopular news

5. Exploring the Data:

6. Exploring the Data: Rate of +/- Words

7. Exploring the Data: +/- Polarity

8. Exploring the Data: Global Subjectivity

9. Exploring the Data: Self-reference Links

10. Exploring the Data: LDA Rank

11. Initial Analysis: Model Accuracy Precision Recall F1 kNN 0.566000 0.594047 0.590866 0.592452 Naive Bayes 0.479654 0.623277 0.064094 0.116236 RandomForest 0.608804 0.640564 0.694331 0.666364 LogisticReg 0.591984 0.617579 0.668346 0.641960 SVC 0.533967 0.533928 1.000000 0.697104

12. Feature Reduction ● Principal Component Analysis to find distribution of variance in the data

13. Feature Reduction ● Eliminated features below variance threshhold (.8) ● Ran GridSearch Random and Grid Search CV on Random Forest to find the ideal parameters ● Ran GridSearch CV with additional specified parameters and graph by feature importance

14. Most Important Features Rank Feature 1 Average Keyword Score 2 Data Channel is Entertainment 3 Closeness to LDA topic 2 4 Average Token Length 5 Published on Weekend 6 Closeness to LDA topic 4 7 Data Channel is Technology 8 Max Keyword Score 9 Data Channel is World

15. Final Results: Model Accuracy Precision Recall F1 kNN 0.562236 0.581848 0.591262 0.566142 Naive Bayes 0.523288 0.660920 0.140122 0.231222 RandomForest 0.662240 0.662520 0.695117 0.668421 LogisticReg 0.614035 0.638523 0.566057 0.600111 SVC 0.531645 0.532263 1.000000 0.697104

16. Final Results and Findings: Small but consistent gain in accuracy: ● Data well-processed ● Correlation between features is minimal

17. Conclusion and Next Steps ● In spite of the difficulty in separating data, selected model performed fairly well ● In the future, would like to rely less on sentiment analysis and focus on word vector correlations

Project McNulty

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (6)

Ähnlich wie Project McNulty

Ähnlich wie Project McNulty (20)

Project McNulty