The problem of automatic detection of fake news insocial media, e.g., on Twitter, has recently drawn some attention. Although, from a technical perspective, it can be regarded
as a straight-forward, binary classification problem, the major
challenge is the collection of large enough training corpora, since manual annotation of tweets as fake or non-fake news is an expensive and tedious endeavor. In this paper, we discuss a weakly supervised approach, which automatically collects a large-scale, but very noisy training dataset comprising hundreds of thousands of tweets. During collection, we automatically label tweets by their source, i.e., trustworthy or untrustworthy source, and train a classifier on this dataset. We then use that classifier for a different classification target, i.e., the classification of fake and non-fake tweets. Although the labels are not accurate according to the new classification target (not all tweets by an untrustworthy source need to be fake news, and vice versa), we show that despite this unclean inaccurate dataset, it is possible to detect fake news with an F1 score of up to 0.9.
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Weakly Supervised Learning for Fake News Detection on Twitter
1. 08/30/18 Stefan Helmstetter, Heiko Paulheim 1
Weakly Supervised Learning for Fake News
Detection on Twitter
Stefan Helmstetter, Heiko Paulheim
2. 08/30/18 Stefan Helmstetter, Heiko Paulheim 2
Motivation
• Social media...
– ...are an increasingly important source of information
– ...can be manipulated easily
3. 08/30/18 Stefan Helmstetter, Heiko Paulheim 3
Motivation
• Fake news detection: a straight forward machine learning problem
– Simplest case: two classes
– Researched for several decades
– Used, e.g., for spam filtering
4. 08/30/18 Stefan Helmstetter, Heiko Paulheim 4
Motivation
• Challenge
– The more training data, the better
– Mass labeling data is difficult (e.g., requires investigations)
●
cf. spam filtering: labeling can be done “on the fly” by laymen
5. 08/30/18 Stefan Helmstetter, Heiko Paulheim 5
Approach
• We cannot easily tell a fake news tweet from a real one
• But we have information on fake and trustworthy sources
6. 08/30/18 Stefan Helmstetter, Heiko Paulheim 6
Approach
• Naive mass labeling:
– every tweet from a fake source is a fake tweet
– every tweet from a trustworthy source is a true tweet
• Our collection:
– 65 fake news sources
– 47 trustworthy news sources
– 401k tweets
●
111k fake news
●
291k real news
7. 08/30/18 Stefan Helmstetter, Heiko Paulheim 7
Approach
• Skew towards 2017
– time of crawling, limitations of Twitter API
– more real than fake news (intentionally!)
8. 08/30/18 Stefan Helmstetter, Heiko Paulheim 8
Approach
• Naive mass labeling:
– every tweet from a fake source is a fake tweet
– every tweet from a trustworthy source is a true tweet
9. 08/30/18 Stefan Helmstetter, Heiko Paulheim 9
Approach
• Mind the classification task
– if we train a classifier, we learn to identify
tweets from untrustworthy sources
– not necessarily the same as fake news tweets
• Assumption
– the training dataset is large
– non-fake news are also covered by trustworthy sources
– trustworthy copies outnumber fake news ones
●
incidental skew in the dataset
10. 08/30/18 Stefan Helmstetter, Heiko Paulheim 10
Approach
• Leaving that caveat aside, we use
– 53 user-level features
e.g., no. of followers, tweet frequency
– 69 tweet-level features
e.g., length, no. of hashtags, no. of URLs
– text features
as BoW (60k features) or doc2vec model (300 features)
– topic features
10-200 topics created using LDA
– eight features using sentiment and polarity analysis
• Classifiers
– Naive Bayes, Decision Trees, SVM, Neural Net (1 hidden layer),
Random Forest, xgboost
– Voting and weighted voting of the above
12. 08/30/18 Stefan Helmstetter, Heiko Paulheim 12
Evaluation
• Setting 1
– Cross validation on the training set
– Remember: actual target is trustworthiness of source
• Setting 2
– Validation against a gold standard
– Target here: trustworthiness of tweet
• Two variants each
– with and without user level features
– idea: judging tweets from known and unknown sources
13. 08/30/18 Stefan Helmstetter, Heiko Paulheim 13
Evaluation
• Setting 1
– Cross validation on the training set
– Remember: actual target is
trustworthiness of source
• Results
– up to .78 without user level tweets
– up to .94 with user level tweets
– xgboost and voting work best
14. 08/30/18 Stefan Helmstetter, Heiko Paulheim 14
Evaluation
• Setting 2
– Validation against a gold standard
– Target here: trustworthiness of tweet
• Results
– up to .77 without user level tweets
– up to .89 with user level tweets
– neural net works best
• Observation:
– results are not much worse than for setting 1
– i.e.: source labels seem to be a suitable proxy for tweet labels
15. 08/30/18 Stefan Helmstetter, Heiko Paulheim 15
Evaluation
• Feature weighting by xgboost:
– most important features are user level features
16. 08/30/18 Stefan Helmstetter, Heiko Paulheim 16
Evaluation
• Without user level features
– surface level features are strong
– content/topics are not too important
17. 08/30/18 Stefan Helmstetter, Heiko Paulheim 17
Conclusion
• Fake news detection is a straight forward classification task
– but training data is scarce
• Inexact mass-labeling can be done
– by using source instead of tweet labels
– collection of large-scale training is easy
– automatic re-collection is possible
(e.g., for new topics, changed twitter behavior)
• Results for tweet labeling
– not much worse than for source labeling
18. 08/30/18 Stefan Helmstetter, Heiko Paulheim 18
Weakly Supervised Learning for Fake News
Detection on Twitter
Stefan Helmstetter, Heiko Paulheim