SlideShare ist ein Scribd-Unternehmen logo
1 von 15
The Grammar of Truth and Lies
Using NLP to detect Fake News
Peter J Bleackley
Playful Technology Limited
peter.bleackley@playfultechnology.co.uk
The Problem
● “A lie can run around the world before the truth can get its
boots on.”
● Fake News spreads six times faster than real news on Twitter
● The spread of true and false news online, Sorush Vosougi,
Deb Roy, Sinan Aral, Science, Vol. 359, Issue 6380, pp.
1146-1151, 9th
March 2018
● https://science.sciencemag.org/content/359/6380/1146
The Data
● “Getting Real about Fake News” Kaggle Dataset
● https://www.kaggle.com/mrisdal/fake-news
● 12999 articles from sites flagged as unreliable by the BS Detector
chrome extension
● Reuters-21578, Distribution 1.0 Corpus
● 10000 articles from Reuters Newswire, 1987
● http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
● Available from NLTK
Don’t Use Vocabulary!
● Potential for bias, especially as corpora are from different
time periods
● Difficult to generalise
● Could be reverse-engineered by a bad actor
Sentence structure features
● Perform Part of Speech tagging with TextBlob
● Concatenate tags to form a feature for each sentence
● “Pete Bleackley is a self-employed data scientist and
computational linguist.”
● 'NNP_NNP_VBZ_DT_JJ_NNS_NN_CC_JJ_NN'
● Very large, very sparse feature set
First model
● Train LSI model (Gensim) on sentence structure features
from whole dataset
● 70/30 split between training and test data
● Sentence structure features => LSI => Logistic Regression
(scikit-learn)
● https://www.kaggle.com/petebleackley/the-grammar-of-truth-an
Performance
● Precision 61%
● Recall 96%
● Accuracy 70%
● Matthews Correlation Coefficient 50%
● Precision measures our ability to catch the bad guys.
Sentiment analysis
● Used VADER model in NLTK
● Produces Positive, Negative and Neutral scores for each
sentence
● Sum over document
● Precision 71%, Recall 88%, Accuracy 79%, Matthews 59%
Sentence Structure + Sentiments
● Precision 74%
● Recall 90%
● Accuracy 81%
● Matthews 64%
● Slight improvement, but it looks like sentiment is doing
most of the work
Random Forests
Precision Recall Accuracy Matthews
Sentence
structure
83% 89% 86% 71%
Sentiments 75% 75% 78% 76%
Both 84% 89% 87% 76%
Understanding the models
● Out of 333264 sentence structure features, 298332 occur
only in a single document
● Out of 23000 documents, 11276 have no features in
common with others
● We need some denser features
Function words
● Pronouns, prepositions, conjunctions, auxilliaries
● Present in every document – most common words
● Usually discarded as “stopwords”...
● ...but useful for stylometric analysis, eg document
attribution
● NLTK stopwords corpus
New model
● Sentence structure features + function words => LSI =>
Logistic Regression
● Precision 90%
● Recall 96%
● Accuracy 93%
● Matthews 87%
What have we learnt?
● Grammatical and stylistic features can be used to
distinguish between real and fake news
● Good choice of features is the key to success
● Will this generalise to other sources?
See also...
● The (mis)informed citizen
● Alan Turing Institute project
● https://www.turing.ac.uk/research/research-projects/misinforme

Weitere ähnliche Inhalte

Ähnlich wie The Grammar of Truth and Lies

Grammar of truth and lies
Grammar of truth and liesGrammar of truth and lies
Grammar of truth and liesPeter Bleackley
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with RStephen Withington
 
The zen of predictive modelling
The zen of predictive modellingThe zen of predictive modelling
The zen of predictive modellingQuinton Anderson
 
Word Cloud Plus with Will and Ray Poynter
Word Cloud Plus with Will and Ray PoynterWord Cloud Plus with Will and Ray Poynter
Word Cloud Plus with Will and Ray PoynterRay Poynter
 
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Machine Learning Prague
 
A field guide to the Financial Times, Rhys Evans, Financial Times
A field guide to the Financial Times, Rhys Evans, Financial TimesA field guide to the Financial Times, Rhys Evans, Financial Times
A field guide to the Financial Times, Rhys Evans, Financial TimesNeo4j
 
Webinar: Modern Techniques for Better Search Relevance with Fusion
Webinar: Modern Techniques for Better Search Relevance with FusionWebinar: Modern Techniques for Better Search Relevance with Fusion
Webinar: Modern Techniques for Better Search Relevance with FusionLucidworks
 
Social media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATESocial media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATEDiana Maynard
 
Reanimating DevOps to Build Things that Work
Reanimating DevOps to Build Things that WorkReanimating DevOps to Build Things that Work
Reanimating DevOps to Build Things that WorkDevOpsDays Baltimore
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018HJ van Veen
 
How Machine Learning Works for Business
How Machine Learning Works for BusinessHow Machine Learning Works for Business
How Machine Learning Works for Business10x Nation
 
Troubleshooting and Optimizing Named Entity Resolution Systems in the Industry
Troubleshooting and Optimizing Named Entity Resolution Systems in the IndustryTroubleshooting and Optimizing Named Entity Resolution Systems in the Industry
Troubleshooting and Optimizing Named Entity Resolution Systems in the IndustryPanos Alexopoulos
 
OTel Orientation: How to Train Teams (OTel in Practice)
OTel Orientation: How to Train Teams (OTel in Practice)OTel Orientation: How to Train Teams (OTel in Practice)
OTel Orientation: How to Train Teams (OTel in Practice)Paige Cruz
 
The agile forecast joe tristano southern fried agile 2018_ final
The agile forecast joe tristano  southern fried agile 2018_ finalThe agile forecast joe tristano  southern fried agile 2018_ final
The agile forecast joe tristano southern fried agile 2018_ finalJoe Tristano
 
Expertise on Demand - How machine learning puts the best-of-the-best at your ...
Expertise on Demand - How machine learning puts the best-of-the-best at your ...Expertise on Demand - How machine learning puts the best-of-the-best at your ...
Expertise on Demand - How machine learning puts the best-of-the-best at your ...10x Nation
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language ModelsLeon Dohmen
 

Ähnlich wie The Grammar of Truth and Lies (20)

Grammar of truth and lies
Grammar of truth and liesGrammar of truth and lies
Grammar of truth and lies
 
Entity Search Engine
Entity Search Engine Entity Search Engine
Entity Search Engine
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
Passwords
PasswordsPasswords
Passwords
 
The zen of predictive modelling
The zen of predictive modellingThe zen of predictive modelling
The zen of predictive modelling
 
Ml masterclass
Ml masterclassMl masterclass
Ml masterclass
 
Word Cloud Plus with Will and Ray Poynter
Word Cloud Plus with Will and Ray PoynterWord Cloud Plus with Will and Ray Poynter
Word Cloud Plus with Will and Ray Poynter
 
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
 
A field guide to the Financial Times, Rhys Evans, Financial Times
A field guide to the Financial Times, Rhys Evans, Financial TimesA field guide to the Financial Times, Rhys Evans, Financial Times
A field guide to the Financial Times, Rhys Evans, Financial Times
 
Vikrant data scientist
Vikrant data scientistVikrant data scientist
Vikrant data scientist
 
Webinar: Modern Techniques for Better Search Relevance with Fusion
Webinar: Modern Techniques for Better Search Relevance with FusionWebinar: Modern Techniques for Better Search Relevance with Fusion
Webinar: Modern Techniques for Better Search Relevance with Fusion
 
Social media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATESocial media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATE
 
Reanimating DevOps to Build Things that Work
Reanimating DevOps to Build Things that WorkReanimating DevOps to Build Things that Work
Reanimating DevOps to Build Things that Work
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
How Machine Learning Works for Business
How Machine Learning Works for BusinessHow Machine Learning Works for Business
How Machine Learning Works for Business
 
Troubleshooting and Optimizing Named Entity Resolution Systems in the Industry
Troubleshooting and Optimizing Named Entity Resolution Systems in the IndustryTroubleshooting and Optimizing Named Entity Resolution Systems in the Industry
Troubleshooting and Optimizing Named Entity Resolution Systems in the Industry
 
OTel Orientation: How to Train Teams (OTel in Practice)
OTel Orientation: How to Train Teams (OTel in Practice)OTel Orientation: How to Train Teams (OTel in Practice)
OTel Orientation: How to Train Teams (OTel in Practice)
 
The agile forecast joe tristano southern fried agile 2018_ final
The agile forecast joe tristano  southern fried agile 2018_ finalThe agile forecast joe tristano  southern fried agile 2018_ final
The agile forecast joe tristano southern fried agile 2018_ final
 
Expertise on Demand - How machine learning puts the best-of-the-best at your ...
Expertise on Demand - How machine learning puts the best-of-the-best at your ...Expertise on Demand - How machine learning puts the best-of-the-best at your ...
Expertise on Demand - How machine learning puts the best-of-the-best at your ...
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language Models
 

Kürzlich hochgeladen

定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 

Kürzlich hochgeladen (20)

定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 

The Grammar of Truth and Lies

  • 1. The Grammar of Truth and Lies Using NLP to detect Fake News Peter J Bleackley Playful Technology Limited peter.bleackley@playfultechnology.co.uk
  • 2. The Problem ● “A lie can run around the world before the truth can get its boots on.” ● Fake News spreads six times faster than real news on Twitter ● The spread of true and false news online, Sorush Vosougi, Deb Roy, Sinan Aral, Science, Vol. 359, Issue 6380, pp. 1146-1151, 9th March 2018 ● https://science.sciencemag.org/content/359/6380/1146
  • 3. The Data ● “Getting Real about Fake News” Kaggle Dataset ● https://www.kaggle.com/mrisdal/fake-news ● 12999 articles from sites flagged as unreliable by the BS Detector chrome extension ● Reuters-21578, Distribution 1.0 Corpus ● 10000 articles from Reuters Newswire, 1987 ● http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html ● Available from NLTK
  • 4. Don’t Use Vocabulary! ● Potential for bias, especially as corpora are from different time periods ● Difficult to generalise ● Could be reverse-engineered by a bad actor
  • 5. Sentence structure features ● Perform Part of Speech tagging with TextBlob ● Concatenate tags to form a feature for each sentence ● “Pete Bleackley is a self-employed data scientist and computational linguist.” ● 'NNP_NNP_VBZ_DT_JJ_NNS_NN_CC_JJ_NN' ● Very large, very sparse feature set
  • 6. First model ● Train LSI model (Gensim) on sentence structure features from whole dataset ● 70/30 split between training and test data ● Sentence structure features => LSI => Logistic Regression (scikit-learn) ● https://www.kaggle.com/petebleackley/the-grammar-of-truth-an
  • 7. Performance ● Precision 61% ● Recall 96% ● Accuracy 70% ● Matthews Correlation Coefficient 50% ● Precision measures our ability to catch the bad guys.
  • 8. Sentiment analysis ● Used VADER model in NLTK ● Produces Positive, Negative and Neutral scores for each sentence ● Sum over document ● Precision 71%, Recall 88%, Accuracy 79%, Matthews 59%
  • 9. Sentence Structure + Sentiments ● Precision 74% ● Recall 90% ● Accuracy 81% ● Matthews 64% ● Slight improvement, but it looks like sentiment is doing most of the work
  • 10. Random Forests Precision Recall Accuracy Matthews Sentence structure 83% 89% 86% 71% Sentiments 75% 75% 78% 76% Both 84% 89% 87% 76%
  • 11. Understanding the models ● Out of 333264 sentence structure features, 298332 occur only in a single document ● Out of 23000 documents, 11276 have no features in common with others ● We need some denser features
  • 12. Function words ● Pronouns, prepositions, conjunctions, auxilliaries ● Present in every document – most common words ● Usually discarded as “stopwords”... ● ...but useful for stylometric analysis, eg document attribution ● NLTK stopwords corpus
  • 13. New model ● Sentence structure features + function words => LSI => Logistic Regression ● Precision 90% ● Recall 96% ● Accuracy 93% ● Matthews 87%
  • 14. What have we learnt? ● Grammatical and stylistic features can be used to distinguish between real and fake news ● Good choice of features is the key to success ● Will this generalise to other sources?
  • 15. See also... ● The (mis)informed citizen ● Alan Turing Institute project ● https://www.turing.ac.uk/research/research-projects/misinforme