The Grammar of Truth and Lies
Using NLP to detect Fake News
Peter J Bleackley
Playful Technology Limited
peter.bleackley@playfultechnology.co.uk
The Problem
● “A lie can run around the world before the truth can get its
boots on.”
● Fake News spreads six times faster than real news on Twitter
● The spread of true and false news online, Sorush Vosougi,
Deb Roy, Sinan Aral, Science, Vol. 359, Issue 6380, pp.
1146-1151, 9th
March 2018
● https://science.sciencemag.org/content/359/6380/1146
The Data
● “Getting Real about Fake News” Kaggle Dataset
● https://www.kaggle.com/mrisdal/fake-news
● 12999 articles from sites flagged as unreliable by the BS Detector
chrome extension
● Reuters-21578, Distribution 1.0 Corpus
● 10000 articles from Reuters Newswire, 1987
● http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
● Available from NLTK
Don’t Use Vocabulary!
● Potential for bias, especially as corpora are from different
time periods
● Difficult to generalise
● Could be reverse-engineered by a bad actor
Sentence structure features
● Perform Part of Speech tagging with TextBlob
● Concatenate tags to form a feature for each sentence
● “Pete Bleackley is a self-employed data scientist and
computational linguist.”
● 'NNP_NNP_VBZ_DT_JJ_NNS_NN_CC_JJ_NN'
● Very large, very sparse feature set
First model
● Train LSI model (Gensim) on sentence structure features
from whole dataset
● 70/30 split between training and test data
● Sentence structure features => LSI => Logistic Regression
(scikit-learn)
● https://www.kaggle.com/petebleackley/the-grammar-of-truth-an
Performance
● Precision 61%
● Recall 96%
● Accuracy 70%
● Matthews Correlation Coefficient 50%
● Precision measures our ability to catch the bad guys.
Sentiment analysis
● Used VADER model in NLTK
● Produces Positive, Negative and Neutral scores for each
sentence
● Sum over document
● Precision 71%, Recall 88%, Accuracy 79%, Matthews 59%
Sentence Structure + Sentiments
● Precision 74%
● Recall 90%
● Accuracy 81%
● Matthews 64%
● Slight improvement, but it looks like sentiment is doing
most of the work
Understanding the models
● Out of 333264 sentence structure features, 298332 occur
only in a single document
● Out of 23000 documents, 11276 have no features in
common with others
● We need some denser features
Function words
● Pronouns, prepositions, conjunctions, auxilliaries
● Present in every document – most common words
● Usually discarded as “stopwords”...
● ...but useful for stylometric analysis, eg document
attribution
● NLTK stopwords corpus
New model
● Sentence structure features + function words => LSI =>
Logistic Regression
● Precision 90%
● Recall 96%
● Accuracy 93%
● Matthews 87%
What have we learnt?
● Grammatical and stylistic features can be used to
distinguish between real and fake news
● Good choice of features is the key to success
● Will this generalise to other sources?
See also...
● The (mis)informed citizen
● Alan Turing Institute project
● https://www.turing.ac.uk/research/research-projects/misinforme