2. The Data:
-40,000+ articles scraped from mashable.com
-Scraped and pre-processed with attention to linguistic features of each article
-56 resulting features to consider
3. The Data:
Among the 56 features, topics are:
-Words
-NLP
-Publication Time
-Digital Media Aspects
13. Feature Reduction
● Eliminated features below variance threshhold (.8)
● Ran GridSearch Random and Grid Search CV on Random Forest to find the ideal
parameters
● Ran GridSearch CV with additional specified parameters and graph by feature importance
14. Most Important Features
Rank Feature
1 Average Keyword Score
2 Data Channel is Entertainment
3 Closeness to LDA topic 2
4 Average Token Length
5 Published on Weekend
6 Closeness to LDA topic 4
7 Data Channel is Technology
8 Max Keyword Score
9 Data Channel is World
16. Final Results and Findings:
Small but consistent gain in accuracy:
● Data well-processed
● Correlation between features is minimal
17. Conclusion and
Next Steps
● In spite of the difficulty in
separating data, selected
model performed fairly well
● In the future, would like to rely
less on sentiment analysis and
focus on word vector
correlations