Predicting News Popularity by Mining Online Discussions
1. Predicting News Popularity by
Mining Online Discussions
Georgios Rizos, Symeon Papadopoulos and Yiannis Kompatsiaris
Centre for Research and Technology Hellas (CERTH) – Information Technologies Institute (ITI)
SNOW/WWW 2016, April 12, 2016, Montréal, Québec, Canada.
2. Popularity
Operational definition ( $$$)
• Number of views
• Number of comments
• Number of users commenting
• Number of shares
• Number of thumbs up, thumbs down
….
• Bonus: Controversiality
#2
5. Comment Trees and User Graphs
#5
User X
I blah blah….
User Y
But bla bla….
User X
No!
User W
Are you kiddin?
User Z
Wat?
X
X
X
Z
Y
W
Y
WZ
A
Comment Tree
User Graph
User A story poster
A
14. Related Work: Post Popularity
News story discussion size prediction:
• Modelling the comment timestamp time-series
(Tatar et al., 2014)
• Feature set relevant to time of posting (hour, day),
entities mentioned, etc. (Tsagkias et al., 2009)
These do not leverage the graph structure!
Online forum thread analysis:
• Prediction of thread overall quality. Also includes
rudimentary graph features (Lee et al. 2014)
Only very simple comment tree features.
#14
15. Related Work: Post Diffusion
Twitter hashtag popularity prediction:
• Prediction based on adoption graph properties and
communities (Weng et al., 2014)
• Prediction based on geolocation and adoption graph
conductance (Bora et al., 2015)
Facebook post share count prediction:
• Prediction based on share graph, author and
temporal property features (Cheng et al., 2014)
Setting different than online discussion mining.
#15
16. Related Work: Discussion Mining
Uncovering other graph based qualities:
• Comment tree h-index hypothesized to be a proxy
for discussion controversiality (Gomez et al., 2008)
• A user-comment h-index variation hypothesized to
be a proxy for political discussion deliberation
(Gonzalez-Bailon et al., 2010)
• Share graph Wiener index shown to be a proxy of
quality/interestingness of the initial post via an SIS
diffusion simulation (Goel et al., 2015)
Indices not applied for popularity prediction!
#16
17. Comment Tree Features
#17
• Quantification of depth, width, bushy-ness, the
existence of multiple long threads and branching
complexity of comment tree structure.
18. User Graph Features
#18
• Quantification of user recurrence and branching
complexity of user graph structure.
20. Datasets
#20
• We used three datasets of news story posts and
online discussions.
• RedditNews dataset: Sample of posts in news-based
subreddits made in 2014 (thanks derp.institute!)
21. Evaluation: Model building
• Random Forest regression to handle inhomogeneous
features
• Prediction targets:
– Comment count: all datasets
– User count: eponymous users only; all datasets
– Score: #upvotes - #downvotes; RedditNews only
– Controversiality: #disagreements; RedditNews only
• Score and Controversiality are penalized for small
number of votes
• Different models built for different timepoints in the
discussion evolution corresponding to 1%-14% of the
stories lifetime (1% ~ 10 minutes)
#21
22. Evaluation: VS Simple Graph Features
#22
• The proposed graph features capture lead to better
prediction compared to rudimentary graph features.
• Large improvement when target is controversiality.
24. Evaluation: Feature Type Comparison
#24
• Comment tree features good for comment
prediction and user graph for user count.
• Integration of all feature types yields best results,
except for controversiality where all_graph is best.
26. Evaluation: Top-100 Controversial
#26
• Which stories will be the most controversial?
• We report the Jaccard Coefficient (x100) between
true top-100 and the top-100 predicted by the
prediction framework using the three feature sets.
27. Conclusion
• Key contributions
– Improved popularity prediction using lightweight graph-
based features
– Post controversiality prediction showed significant
improvement
• Future Work
– Leveraging other information modalities, such as text
– Investigate dependence on the topic category of a story or
the type of the post (e.g., text post or multimedia).
#27
28. Thank you!
• Resources:
Code: https://github.com/MKLab-ITI/news-popularity-prediction
Online demo: http://reveal-mklab.iti.gr/reveal/popularity
• Get in touch:
@sympap / papadop@iti.gr
@georgios_rizos/ georgerizos@iti.gr
#28
29. References (1/2)
• A. Tatar, P. Antoniadis, M. D. De Amorim, and S. Fdida. From
popularity prediction to ranking online news. Social Network
Analysis and Mining, 4(1):1–12, 2014.
• M. Tsagkias, W. Weerkamp, and M. De Rijke. Predicting the volume
of comments on online news stories. In Proceedings of the 18th
ACM Conference on Information and Knowledge Management,
pages 1765–1768. ACM, 2009.
• J. Lee, M. Yang, and H. Rim. Discovering high-Quality threaded
discussions in online forums. Journal of Computer Science and
Technology, 29(3):519–531, 2014.
• L. Weng, F. Menczer, and Y.-Y. Ahn. Predicting successful
memes using network and community structure. arXiv
preprint arXiv:1403.6199, 2014.
• S. Bora, H. Singh, A. Sen, A. Bagchi, and P. Singla. On the
role of conductance, geography and topology in predicting
hashtag virality. arXiv preprint arXiv:1504.05351, 2015.
#29
30. References (2/2)
• J. Cheng, L. Adamic, P. A. Dow, J. M. Kleinberg, and
J. Leskovec. Can cascades be predicted? In Proceedings of
the 23rd international conference on World Wide Web,
pages 925–936, 2014.
• V. G´omez, A. Kaltenbrunner, and V. L´opez. Statistical
analysis of the social network and discussion threads in
slashdot. In Proceedings of the 17th intern. conference on
World Wide Web, pages 645–654. ACM, 2008.
• S. Gonzalez-Bailon, A. Kaltenbrunner, and R. E. Banchs. The
structure of political discussion networks: a model for the
analysis of online deliberation. Journal of Information
Technology, 25(2):230–243, 2010.
• S. Goel, A. Anderson, J. Hofman, and D.J. Watts. The structural
virality of online diffusion. Management Science, 62(1): 180–
196, 2015
#30
Hinweis der Redaktion
Identifying top online news stories very early after their posting is very important.
Recently, the NY Times have launched Blossom tool that recommends which of their already published stories will become viral on Facebook.
Funny stories or pictures may attract lots of upvotes but insightful stories may raise discussion.
The discussion structure can be analyzed in order to reveal more intricate qualities about the initial post, such as controversiality.
Web publishers
Little featured content real estate select only best performing content
Social media accounts select only content that is going to create engagement
NYT recently built the Blossom Slack bot to help decide which stories to post to social media
For relatively early time (5%), top controversial story predicted by all_graph: Gun deaths for U.S. officers rose by 56 percent in 2014: report.
Users link to related info
Others claim title is worded to evoke sensationalism
Others claim that civilian deaths are more important