Weakly Supervised Learning for Fake News Detection on Twitter

•Als ODP, PDF herunterladen•

1 gefällt mir•1,832 views

The problem of automatic detection of fake news insocial media, e.g., on Twitter, has recently drawn some attention. Although, from a technical perspective, it can be regarded as a straight-forward, binary classification problem, the major challenge is the collection of large enough training corpora, since manual annotation of tweets as fake or non-fake news is an expensive and tedious endeavor. In this paper, we discuss a weakly supervised approach, which automatically collects a large-scale, but very noisy training dataset comprising hundreds of thousands of tweets. During collection, we automatically label tweets by their source, i.e., trustworthy or untrustworthy source, and train a classifier on this dataset. We then use that classifier for a different classification target, i.e., the classification of fake and non-fake tweets. Although the labels are not accurate according to the new classification target (not all tweets by an untrustworthy source need to be fake news, and vice versa), we show that despite this unclean inaccurate dataset, it is possible to detect fake news with an F1 score of up to 0.9.

Daten & Analysen

08/30/18 Stefan Helmstetter, Heiko Paulheim 1
Weakly Supervised Learning for Fake News
Detection on Twitter
Stefan Helmstetter, Heiko Paulheim

08/30/18 Stefan Helmstetter, Heiko Paulheim 2
Motivation
• Social media...
– ...are an increasingly important source of information
– ...can be manipulated easily

08/30/18 Stefan Helmstetter, Heiko Paulheim 3
Motivation
• Fake news detection: a straight forward machine learning problem
– Simplest case: two classes
– Researched for several decades
– Used, e.g., for spam filtering

08/30/18 Stefan Helmstetter, Heiko Paulheim 4
Motivation
• Challenge
– The more training data, the better
– Mass labeling data is difficult (e.g., requires investigations)
●
cf. spam filtering: labeling can be done “on the fly” by laymen

08/30/18 Stefan Helmstetter, Heiko Paulheim 5
Approach
• We cannot easily tell a fake news tweet from a real one
• But we have information on fake and trustworthy sources

08/30/18 Stefan Helmstetter, Heiko Paulheim 6
Approach
• Naive mass labeling:
– every tweet from a fake source is a fake tweet
– every tweet from a trustworthy source is a true tweet
• Our collection:
– 65 fake news sources
– 47 trustworthy news sources
– 401k tweets
●
111k fake news
●
291k real news

08/30/18 Stefan Helmstetter, Heiko Paulheim 7
Approach
• Skew towards 2017
– time of crawling, limitations of Twitter API
– more real than fake news (intentionally!)

08/30/18 Stefan Helmstetter, Heiko Paulheim 8
Approach
• Naive mass labeling:
– every tweet from a fake source is a fake tweet
– every tweet from a trustworthy source is a true tweet

08/30/18 Stefan Helmstetter, Heiko Paulheim 9
Approach
• Mind the classification task
– if we train a classifier, we learn to identify
tweets from untrustworthy sources
– not necessarily the same as fake news tweets
• Assumption
– the training dataset is large
– non-fake news are also covered by trustworthy sources
– trustworthy copies outnumber fake news ones
●
incidental skew in the dataset

08/30/18 Stefan Helmstetter, Heiko Paulheim 10
Approach
• Leaving that caveat aside, we use
– 53 user-level features
e.g., no. of followers, tweet frequency
– 69 tweet-level features
e.g., length, no. of hashtags, no. of URLs
– text features
as BoW (60k features) or doc2vec model (300 features)
– topic features
10-200 topics created using LDA
– eight features using sentiment and polarity analysis
• Classifiers
– Naive Bayes, Decision Trees, SVM, Neural Net (1 hidden layer),
Random Forest, xgboost
– Voting and weighted voting of the above

08/30/18 Stefan Helmstetter, Heiko Paulheim 11
Approach
• Optimal selection of features per classifier

08/30/18 Stefan Helmstetter, Heiko Paulheim 12
Evaluation
• Setting 1
– Cross validation on the training set
– Remember: actual target is trustworthiness of source
• Setting 2
– Validation against a gold standard
– Target here: trustworthiness of tweet
• Two variants each
– with and without user level features
– idea: judging tweets from known and unknown sources

08/30/18 Stefan Helmstetter, Heiko Paulheim 13
Evaluation
• Setting 1
– Cross validation on the training set
– Remember: actual target is
trustworthiness of source
• Results
– up to .78 without user level tweets
– up to .94 with user level tweets
– xgboost and voting work best

08/30/18 Stefan Helmstetter, Heiko Paulheim 14
Evaluation
• Setting 2
– Validation against a gold standard
– Target here: trustworthiness of tweet
• Results
– up to .77 without user level tweets
– up to .89 with user level tweets
– neural net works best
• Observation:
– results are not much worse than for setting 1
– i.e.: source labels seem to be a suitable proxy for tweet labels

08/30/18 Stefan Helmstetter, Heiko Paulheim 15
Evaluation
• Feature weighting by xgboost:
– most important features are user level features

08/30/18 Stefan Helmstetter, Heiko Paulheim 16
Evaluation
• Without user level features
– surface level features are strong
– content/topics are not too important

08/30/18 Stefan Helmstetter, Heiko Paulheim 17
Conclusion
• Fake news detection is a straight forward classification task
– but training data is scarce
• Inexact mass-labeling can be done
– by using source instead of tweet labels
– collection of large-scale training is easy
– automatic re-collection is possible
(e.g., for new topics, changed twitter behavior)
• Results for tweet labeling
– not much worse than for source labeling

08/30/18 Stefan Helmstetter, Heiko Paulheim 18
Weakly Supervised Learning for Fake News
Detection on Twitter
Stefan Helmstetter, Heiko Paulheim

Weitere ähnliche Inhalte

Was ist angesagt?

Kaggle presentationHJ van Veen

Sentiment analysis - Our approach and use casesKarol Chlasta

Advanced data structures & algorithms important questionsselvaraniArunkumar

AI & ML in Cyber Security - Why Algorithms are DangerousRaffael Marty

Email investigationAnimesh Shaw

Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira

Unsupervised Machine LearningLivares Technologies Pvt Ltd

Lecture_1_Introduction_to_Adversarial_Machine_Learning.pptxDevRaj646424

화자인식 기술 및 관련 연구 소개NAVER Engineering

Link analysis .. Data MiningMustafa Salam

A Friendly Introduction to Machine LearningHaptik

Hidden Markov Model & It's Application in PythonAbhay Dodiya

CNIT 152: 6. Scope & 7. Live Data CollectionSam Bowne

Secure Computer Forensics and its toolsKathirvel Ayyaswamy

CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptxAsst.prof M.Gokilavani

Roadmap: How to Learn Machine Learning in 6 MonthsIDEAS - Int'l Data Engineering and Science Association

From OSINT to Phishing presentationJesse Ratcliffe, OSCP

[2021 Google I/O] LaMDA : Language Models for DialogApplicationstaeseon ryu

Social Media Mining - Chapter 8 (Influence and Homophily)SocialMediaMining

Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Simplilearn

Was ist angesagt? (20)

Kaggle presentation

Sentiment analysis - Our approach and use cases

Advanced data structures & algorithms important questions

AI & ML in Cyber Security - Why Algorithms are Dangerous

Email investigation

Feature Engineering - Getting most out of data for predictive models

Unsupervised Machine Learning

Lecture_1_Introduction_to_Adversarial_Machine_Learning.pptx

화자인식 기술 및 관련 연구 소개

Link analysis .. Data Mining

A Friendly Introduction to Machine Learning

Hidden Markov Model & It's Application in Python

CNIT 152: 6. Scope & 7. Live Data Collection

Secure Computer Forensics and its tools

CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptx

Roadmap: How to Learn Machine Learning in 6 Months

From OSINT to Phishing presentation

[2021 Google I/O] LaMDA : Language Models for DialogApplications

Social Media Mining - Chapter 8 (Influence and Homophily)

Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...

Ähnlich wie Weakly Supervised Learning for Fake News Detection on Twitter

Exploiting Linked Open Data as Background Knowledge in Data MiningHeiko Paulheim

Open Analytics: Building Effective Frameworks for Social Media Analysisikanow

Open analytics social media frameworkOpen Analytics

Usable Privacy and Security: A Grand Challenge for HCI, Human Computer Inter...Jason Hong

Using Twitter as a data source: An overview of ethical challengesDr Wasim Ahmed

Building Effective Frameworks for Social Media Analysisikanow

Analysis of Cyberbullying Tweets in Trending World Eventskcortis

Fast Approximate A-box Consistency Checking using Machine LearningHeiko Paulheim

Group Chapter Presentationfeinal

Hallam healthresearch tweetchatsHeidi Probst

Building Effective Frameworks for Social Media AnalysisOpen Analytics

Online Forums vs. Social Networks: Two Case Studies to support eGovernment wi...Timo Wandhoefer

Sentiment Analysis of Twitter DataSumit Raj

DP1_160430723010_Divya.pptxDivyaPatel729457

Social Network Analysis Basics for Social Media Profs - HandoutMatthew J. Kushin, Ph.D.

Social Media Crawling & Mining Seminar Symeon Papadopoulos

Ähnlich wie Weakly Supervised Learning for Fake News Detection on Twitter (16)

Exploiting Linked Open Data as Background Knowledge in Data Mining

Open Analytics: Building Effective Frameworks for Social Media Analysis

Open analytics social media framework

Usable Privacy and Security: A Grand Challenge for HCI, Human Computer Inter...

Using Twitter as a data source: An overview of ethical challenges

Building Effective Frameworks for Social Media Analysis

Analysis of Cyberbullying Tweets in Trending World Events

Fast Approximate A-box Consistency Checking using Machine Learning

Group Chapter Presentation

Hallam healthresearch tweetchats

Building Effective Frameworks for Social Media Analysis

Online Forums vs. Social Networks: Two Case Studies to support eGovernment wi...

Sentiment Analysis of Twitter Data

DP1_160430723010_Divya.pptx

Social Network Analysis Basics for Social Media Profs - Handout

Social Media Crawling & Mining Seminar

Mehr von Heiko Paulheim

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...Heiko Paulheim

What_do_Knowledge_Graph_Embeddings_Learn.pdfHeiko Paulheim

New Adventures in RDF2vecHeiko Paulheim

Knowledge Matters! The Role of Knowledge Graphs in Modern AI SystemsHeiko Paulheim

From Wikis to Knowledge GraphsHeiko Paulheim

Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...Heiko Paulheim

Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph BlockHeiko Paulheim

Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Heiko Paulheim

Machine Learning & Embeddings for Large Knowledge GraphsHeiko Paulheim

From Wikipedia to Thousands of Wikis – The DBkWik Knowledge GraphHeiko Paulheim

Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Heiko Paulheim

Make Embeddings Semantic Again!Heiko Paulheim

How much is a Triple?Heiko Paulheim

Machine Learning with and for Semantic Web Knowledge GraphsHeiko Paulheim

Towards Knowledge Graph ProfilingHeiko Paulheim

Knowledge Graphs on the WebHeiko Paulheim

Data-driven Joint Debugging of the DBpedia Mappings and OntologyHeiko Paulheim

Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopHeiko Paulheim

Combining Ontology Matchers via Anomaly DetectionHeiko Paulheim

Mehr von Heiko Paulheim (20)

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...

What_do_Knowledge_Graph_Embeddings_Learn.pdf

New Adventures in RDF2vec

Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems

From Wikis to Knowledge Graphs

Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...

Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block

Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...

Machine Learning & Embeddings for Large Knowledge Graphs

From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph

Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...

Make Embeddings Semantic Again!

How much is a Triple?

Machine Learning with and for Semantic Web Knowledge Graphs

Towards Knowledge Graph Profiling

Knowledge Graphs on the Web

Data-driven Joint Debugging of the DBpedia Mappings and Ontology

Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top

Combining Ontology Matchers via Anomaly Detection

Kürzlich hochgeladen

Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...gajnagarg

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823

➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823

➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823

Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...gajnagarg

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823

Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...gajnagarg

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7Call Girls in Nagpur High Profile Call Girls

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823

Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...gajnagarg

Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal

Kürzlich hochgeladen (20)

Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand

Aspirational Block Program Block Syaldey District - Almora

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand

Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...

➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...

➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...

Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...

Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand

Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...

Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...

Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -

Weakly Supervised Learning for Fake News Detection on Twitter

1. 08/30/18 Stefan Helmstetter, Heiko Paulheim 1 Weakly Supervised Learning for Fake News Detection on Twitter Stefan Helmstetter, Heiko Paulheim

2. 08/30/18 Stefan Helmstetter, Heiko Paulheim 2 Motivation • Social media... – ...are an increasingly important source of information – ...can be manipulated easily

3. 08/30/18 Stefan Helmstetter, Heiko Paulheim 3 Motivation • Fake news detection: a straight forward machine learning problem – Simplest case: two classes – Researched for several decades – Used, e.g., for spam filtering

4. 08/30/18 Stefan Helmstetter, Heiko Paulheim 4 Motivation • Challenge – The more training data, the better – Mass labeling data is difficult (e.g., requires investigations) ● cf. spam filtering: labeling can be done “on the fly” by laymen

5. 08/30/18 Stefan Helmstetter, Heiko Paulheim 5 Approach • We cannot easily tell a fake news tweet from a real one • But we have information on fake and trustworthy sources

6. 08/30/18 Stefan Helmstetter, Heiko Paulheim 6 Approach • Naive mass labeling: – every tweet from a fake source is a fake tweet – every tweet from a trustworthy source is a true tweet • Our collection: – 65 fake news sources – 47 trustworthy news sources – 401k tweets ● 111k fake news ● 291k real news

7. 08/30/18 Stefan Helmstetter, Heiko Paulheim 7 Approach • Skew towards 2017 – time of crawling, limitations of Twitter API – more real than fake news (intentionally!)

8. 08/30/18 Stefan Helmstetter, Heiko Paulheim 8 Approach • Naive mass labeling: – every tweet from a fake source is a fake tweet – every tweet from a trustworthy source is a true tweet

9. 08/30/18 Stefan Helmstetter, Heiko Paulheim 9 Approach • Mind the classification task – if we train a classifier, we learn to identify tweets from untrustworthy sources – not necessarily the same as fake news tweets • Assumption – the training dataset is large – non-fake news are also covered by trustworthy sources – trustworthy copies outnumber fake news ones ● incidental skew in the dataset

10. 08/30/18 Stefan Helmstetter, Heiko Paulheim 10 Approach • Leaving that caveat aside, we use – 53 user-level features e.g., no. of followers, tweet frequency – 69 tweet-level features e.g., length, no. of hashtags, no. of URLs – text features as BoW (60k features) or doc2vec model (300 features) – topic features 10-200 topics created using LDA – eight features using sentiment and polarity analysis • Classifiers – Naive Bayes, Decision Trees, SVM, Neural Net (1 hidden layer), Random Forest, xgboost – Voting and weighted voting of the above

11. 08/30/18 Stefan Helmstetter, Heiko Paulheim 11 Approach • Optimal selection of features per classifier

12. 08/30/18 Stefan Helmstetter, Heiko Paulheim 12 Evaluation • Setting 1 – Cross validation on the training set – Remember: actual target is trustworthiness of source • Setting 2 – Validation against a gold standard – Target here: trustworthiness of tweet • Two variants each – with and without user level features – idea: judging tweets from known and unknown sources

13. 08/30/18 Stefan Helmstetter, Heiko Paulheim 13 Evaluation • Setting 1 – Cross validation on the training set – Remember: actual target is trustworthiness of source • Results – up to .78 without user level tweets – up to .94 with user level tweets – xgboost and voting work best

14. 08/30/18 Stefan Helmstetter, Heiko Paulheim 14 Evaluation • Setting 2 – Validation against a gold standard – Target here: trustworthiness of tweet • Results – up to .77 without user level tweets – up to .89 with user level tweets – neural net works best • Observation: – results are not much worse than for setting 1 – i.e.: source labels seem to be a suitable proxy for tweet labels

15. 08/30/18 Stefan Helmstetter, Heiko Paulheim 15 Evaluation • Feature weighting by xgboost: – most important features are user level features

16. 08/30/18 Stefan Helmstetter, Heiko Paulheim 16 Evaluation • Without user level features – surface level features are strong – content/topics are not too important

17. 08/30/18 Stefan Helmstetter, Heiko Paulheim 17 Conclusion • Fake news detection is a straight forward classification task – but training data is scarce • Inexact mass-labeling can be done – by using source instead of tweet labels – collection of large-scale training is easy – automatic re-collection is possible (e.g., for new topics, changed twitter behavior) • Results for tweet labeling – not much worse than for source labeling

18. 08/30/18 Stefan Helmstetter, Heiko Paulheim 18 Weakly Supervised Learning for Fake News Detection on Twitter Stefan Helmstetter, Heiko Paulheim

Weakly Supervised Learning for Fake News Detection on Twitter

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Weakly Supervised Learning for Fake News Detection on Twitter

Ähnlich wie Weakly Supervised Learning for Fake News Detection on Twitter (16)

Mehr von Heiko Paulheim

Mehr von Heiko Paulheim (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Weakly Supervised Learning for Fake News Detection on Twitter