SlideShare ist ein Scribd-Unternehmen logo
Fake News and Their Detection
Data Science and Big Data Analysis
Professor: Antonino Nocera
Team Name: 4V’s
Group members:
Arnold Fonkou
Vignesh Kumar Kembu
Ashina Nurkoo
Seyedkourosh Sajjadi
WELFake
Fake News Detection (WELFake) dataset
of 72,134 news articles with 35,028 real
and 37,106 fake news.
This dataset is a part of an ongoing
research on "Fake News Prediction on
Social Media Website" as a doctoral
degree program of Mr. Pawan Kumar
Verma and is partially supported by the
ARTICONF project funded by the
European Union’s Horizon 2020 research
and innovation program.
Columns:
- Serial number (starting from 0)
- Title (about the text news heading)
Text (about the news content)
- Label (0 = fake and 1 = real)
Data Stream Ingestion Hadoop MapReduce
MongoDB Sandbox
PySpark HDFS
Analysis
Architecture
PySpark
Ingestion
From CSV to JSON
Data Conversion
we have converted the
file into JSON to be
closer to reality.
Reading Data
Using PySpark
We used the
DATAFRAME client of
SPARK to read our big
data.
Saving to Hadoop
Write into Hadoop
We read from the data
frame and then we write
it to Hadoop.
Reading Section
Import findspark
findspark.init()
import pyspark
from pyspark.sql import *
spark = SparkSession.builder 
.master("local[1]") 
.appName("PySpark Read JSON") 
.getOrCreate()
# Reading multiline json file
multiline_dataframe = spark.read.option("multiline","true") 
.json("project_data_sample.json")
multiline_dataframe.head()
Saving Section
multiline_dataframe.write.save('/usr/local/hadoop/user3/dsba1.json',
format='json')
And the data is shown as below:
sqlContext = SQLContext(spark)
df = sqlContext.read.format('json').load('/usr/local/hadoop/user3/dsba1.json')
df.show()
Hadoop
Hadoop
Component
HDFS
Component
MapReduce
Mapper (BoW Creation)
Read Lines
Input Data
The data is given as input
lines to the mapper.
Extract Text
Title and Text Extraction
After reading each line as
a JSON object, we
extract the title and the
text related to that piece
of news from it.
Tokenize
Word Extraction
We perform some data
cleaning and then we
extract every single word
from it.
Text Cleaning
import sys
import re
import json
def clean_text(text):
text =
re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|(
?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)
text = re.sub(r'[^a-zA-Zs]+', '', text)
return text
Tokenizing
def tokenize(text):
if not isinstance(text, str):
text = str(text)
text = clean_text(text)
text = str.lower(text)
return text.split()
Execution
for line in sys.stdin:
line = line.strip()
Try:
json_obj = json.loads(line)
except:
continue
title = json_obj.get("title", "")
text = json_obj.get("text", "")
title_words = tokenize(title)
text_words = tokenize(text)
for word in title_words + text_words:
print(f"{word}t1")
Reducer
Read Lines
Input Data
The data is given as input
lines each containing 2
elements.
Initialize Counter
Word and Count
Extraction
After reading each line a
JON object, we extract
the title and the text
related to that piece of
news from it.
Create BoW
Dictionary
Create a dictionary and
add each word as the key
and its associated count
value as the value.
Counter Initialization
import sys
from collections import Counter
import json
bag_of_words = Counter()
Execution
for line in sys.stdin:
line = line.strip()
try:
word, count = line.split("t")
except:
continue
count = int(count)
bag_of_words[word] += count
with open('bow_data.json', 'w') as f:
json.dump(bag_of_words, f)
Moving to MongoDb
import json
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017')
db = client['bow']
collection = db['bow_collection']
with open('bow_data.json', 'r') as f:
bow_data = json.load(f)
collection.insert_one(bow_data)
Performing MapReduce Operation
In the Terminal:
cat db.json | python3 bow_mapper.py | sort | python3 bow_reducer.py
HDFS
In the case of dealing with big data, we
could partition our dataset into a number
of batches instead of saving it in a single
file.
Instead of:
multiline_dataframe.write.save('/usr/local/hadoop/user3/dsba1.j
son', format='json')
Use:
partitioned_df = multiline_dataframe.repartition(4, "Unnamed: 0")
partitioned_df.write.save('/usr/local/hadoop/user3/dsba1.json',
format='json')
partition_counts = partitioned_df.rdd.mapPartitions(lambda it:
[sum(1 for _ in it)]).collect()
print(partition_counts)
[482, 480, 519, 519]
Create
Database
Create a database for
containing the data.
Import From
Hadoop
Import the JSON File
from Hadoop via
PySpark.
View Data &
Backup
View the data and if it is
inserted correctly then
create a backup before
starting the modifications.
Clean Data
Remove
non-alphanumeric
characters.
Display
Modified Data
Display the modified
content to view
changes.
MongoDB
Creating Database
Use an existing database or create a new one:
>use dsdb_dev
Viewing Data
>use dsdb_dev
>show collections
>db.fake_real_news.find()
>db.fake_real_news.aggregate([{$group : {_id: "$label", rest_number
: {$sum : 1}}}])
Creating a Copy
In the Terminal:
mongodump --db dsdb_dev --collection fake_real_news --out
/home/ds/Documents/
Importing From Hadoop
In the Terminal:
mongoimport --db dsdb_dev --collection fake_real_news --file
/usr/local/hadoop/user3/dsba1.json/part-00000-d1623440-4fde-4b
72-b87d-5943bec596d3-c000.json
Importing from Hadoop Using PySpark
with open('sampled_data.json', 'w') as file: for line in json_data: file.write(line +
'n')
import json
with open('sampled_data.json') as file:
data = file.readlines()
collection.insert_many([json.loads(line) for line in data])
df =
spark.read.json("/usr/local/hadoop/user3/dsba1.json/part-00000-d16234
40-4fde-4b72-b87d-5943bec596d3-c000.json")
sampled_df = df.sample(fraction=0.8, seed=42)
from pymongo import MongoClient
conn = MongoClient()
db = conn.dsdb_dev
collection = db['sampled_data']
json_data = sampled_df.toJSON().collect()
Data Cleaning
>db.fake_real_news.aggregate([
{‘$project': {‘_id': 1, 'Unnamed: 0': 1, 'label': 1, 'text': 1, 'title': 1} }
]). forEach(function(doc) {
if (doc.title) {
var newTitle = doc.title.replace(/[^a-zA-Z0-9 ]/g, '');
db.fake_real_news.update({ '_id': doc._id }, { '$set': { 'title': newTitle } });
}
});
Modified Content Display
>db.fake_real_news.aggregate([
{‘$project': {‘_id': 1, 'Unnamed: 0': 1, 'label': 1, 'text': 1, 'title': 1} }
]);
The file is now ready for word occurrence counting,
which can be done using Jupyter Notebook and
PyMongo.
Backup Restoration
In case of any need, restore the initial file:
>db.fake_real_news.drop()
mongorestore --db dsdb_dev --collection fake_real_news
/home/ds/Documents/dsdb_dev/fake_real_news.bson
Count the Number of Words
db.fake_real_news.aggregate ([
{
'$match': {
'label': "0" # the condition for the 'label'
field to be 1
}
},
{
'$project': {
'words': {'$split': [{'$toLower': '$title'}, ' ']} #
Split the lowercase version of the title field into
an array of words
}
},
{
'$unwind': '$words' # Separate documents
for each word
},
{
'$group': {
'_id': {
'word': '$words', # Group by word field
and count
},
'count': {'$sum': 1}
}
},
{
'$project': {
# Project to return only word field, count,
and id
'word': '$_id.word',
'count': 1
}
},
{
'$match': {
'word': {'$ne': None}, # Exclude null or
non-existent values
}
},
{
'$match': {
'$expr': {'$ne': ['$word', '']} # Exclude
empty strings
}
},
{
'$sort': {'count':-1}
}
])
Hypotheses
H1
Generation of fake news shall be with
the help of stop words.
Metrics - Average number of stop
words in title shall be higher in fake
news.
H2
Real news shall be short and crisp in
order to generate easy value.
Metrics - Length of the fake news shall
be more than the real ones.
H1
We used NLTK to extract stop words from
the title column and compared the
averages between fake and real titles.
The hypothesis is false, as shown by the
figure: fake news (0) is less frequent than
real news (1).
H2
The hypothesis is true, as shown by the figures:
fake news (0) tends to be longer than real news
(1).
Insights on Data &
Pre-processing
To gain quick insights from the data, we
used word clouds for the titles overall and
for fake/real data.
Real News
Fake News
Null Values
The title column contains some null
values, which may cause issues in data
analysis or processing.
We need to fill the null values in the title
column to ensure accurate data analysis.
Text Normalization
To further prepare the data, we applied text normalization techniques, including converting
the title and text to lowercase and removing punctuation marks.
Classification Model
For the binary classification of the News, we have choose
Random Forest Classifier
Splitting of data in x and y variable and Test and train split of
the data has been performed with 77 & 33 size.
The bag of words has been performed to the text of the news
(X_train & X_test) and by removing the stop words in English.
The Label Y_train & Y_test has the class of the news (Fake = 0
& Real = 1 )
Now the train data is feed to the RandomForestClassifier with
500 trees and the model has been tested with the test data
and the model classification confusion matrix is below.
Thank You For Your Attention!

Weitere ähnliche Inhalte

Was ist angesagt?

Representing TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMMLRepresenting TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMML
Villu Ruusmann
 
Geospatial Options in Apache Spark
Geospatial Options in Apache SparkGeospatial Options in Apache Spark
Geospatial Options in Apache Spark
Databricks
 
Introduction to SparkR
Introduction to SparkRIntroduction to SparkR
Introduction to SparkR
Ankara Big Data Meetup
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
엔터프라이즈의 AI/ML 활용을 돕는 Paxata 지능형 데이터 전처리 플랫폼 (최문규 이사, PAXATA) :: AWS Techforum...
엔터프라이즈의 AI/ML 활용을 돕는 Paxata 지능형 데이터 전처리 플랫폼 (최문규 이사, PAXATA) :: AWS Techforum...엔터프라이즈의 AI/ML 활용을 돕는 Paxata 지능형 데이터 전처리 플랫폼 (최문규 이사, PAXATA) :: AWS Techforum...
엔터프라이즈의 AI/ML 활용을 돕는 Paxata 지능형 데이터 전처리 플랫폼 (최문규 이사, PAXATA) :: AWS Techforum...
Amazon Web Services Korea
 
Amazon Athena で実現する データ分析の広がり
Amazon Athena で実現する データ分析の広がりAmazon Athena で実現する データ分析の広がり
Amazon Athena で実現する データ分析の広がり
Amazon Web Services Japan
 
ZenmuTechのご紹介
ZenmuTechのご紹介ZenmuTechのご紹介
ZenmuTechのご紹介
ZenmuTech, Inc.
 
「HDR広色域映像のための色再現性を考慮した色域トーンマッピング」スライド Color Gamut Tone Mapping Considering Ac...
「HDR広色域映像のための色再現性を考慮した色域トーンマッピング」スライド Color Gamut Tone Mapping Considering Ac...「HDR広色域映像のための色再現性を考慮した色域トーンマッピング」スライド Color Gamut Tone Mapping Considering Ac...
「HDR広色域映像のための色再現性を考慮した色域トーンマッピング」スライド Color Gamut Tone Mapping Considering Ac...
doboncho
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
NTT DATA Technology & Innovation
 
Graph convolutional networks in apache spark
Graph convolutional networks in apache sparkGraph convolutional networks in apache spark
Graph convolutional networks in apache spark
Emiliano Martinez Sanchez
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
Kazuaki Ishizaki
 
ApacheCon 2022: From Column-Level to Cell-Level_ Towards Finer-grained Encryp...
ApacheCon 2022: From Column-Level to Cell-Level_ Towards Finer-grained Encryp...ApacheCon 2022: From Column-Level to Cell-Level_ Towards Finer-grained Encryp...
ApacheCon 2022: From Column-Level to Cell-Level_ Towards Finer-grained Encryp...
XinliShang1
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
Hadoopを用いた大規模ログ解析
Hadoopを用いた大規模ログ解析Hadoopを用いた大規模ログ解析
Hadoopを用いた大規模ログ解析
shuichi iida
 
Virksomhedens afsætningsforhold, pixi
Virksomhedens afsætningsforhold, pixiVirksomhedens afsætningsforhold, pixi
Virksomhedens afsætningsforhold, pixi
Michael Reber Knudsen
 
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsTracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Databricks
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark Summit
 
Boost Your Neo4j with User-Defined Procedures
Boost Your Neo4j with User-Defined ProceduresBoost Your Neo4j with User-Defined Procedures
Boost Your Neo4j with User-Defined Procedures
Neo4j
 

Was ist angesagt? (20)

Representing TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMMLRepresenting TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMML
 
Geospatial Options in Apache Spark
Geospatial Options in Apache SparkGeospatial Options in Apache Spark
Geospatial Options in Apache Spark
 
Introduction to SparkR
Introduction to SparkRIntroduction to SparkR
Introduction to SparkR
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
엔터프라이즈의 AI/ML 활용을 돕는 Paxata 지능형 데이터 전처리 플랫폼 (최문규 이사, PAXATA) :: AWS Techforum...
엔터프라이즈의 AI/ML 활용을 돕는 Paxata 지능형 데이터 전처리 플랫폼 (최문규 이사, PAXATA) :: AWS Techforum...엔터프라이즈의 AI/ML 활용을 돕는 Paxata 지능형 데이터 전처리 플랫폼 (최문규 이사, PAXATA) :: AWS Techforum...
엔터프라이즈의 AI/ML 활용을 돕는 Paxata 지능형 데이터 전처리 플랫폼 (최문규 이사, PAXATA) :: AWS Techforum...
 
Amazon Athena で実現する データ分析の広がり
Amazon Athena で実現する データ分析の広がりAmazon Athena で実現する データ分析の広がり
Amazon Athena で実現する データ分析の広がり
 
ZenmuTechのご紹介
ZenmuTechのご紹介ZenmuTechのご紹介
ZenmuTechのご紹介
 
「HDR広色域映像のための色再現性を考慮した色域トーンマッピング」スライド Color Gamut Tone Mapping Considering Ac...
「HDR広色域映像のための色再現性を考慮した色域トーンマッピング」スライド Color Gamut Tone Mapping Considering Ac...「HDR広色域映像のための色再現性を考慮した色域トーンマッピング」スライド Color Gamut Tone Mapping Considering Ac...
「HDR広色域映像のための色再現性を考慮した色域トーンマッピング」スライド Color Gamut Tone Mapping Considering Ac...
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
 
Graph convolutional networks in apache spark
Graph convolutional networks in apache sparkGraph convolutional networks in apache spark
Graph convolutional networks in apache spark
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
ApacheCon 2022: From Column-Level to Cell-Level_ Towards Finer-grained Encryp...
ApacheCon 2022: From Column-Level to Cell-Level_ Towards Finer-grained Encryp...ApacheCon 2022: From Column-Level to Cell-Level_ Towards Finer-grained Encryp...
ApacheCon 2022: From Column-Level to Cell-Level_ Towards Finer-grained Encryp...
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Hadoopを用いた大規模ログ解析
Hadoopを用いた大規模ログ解析Hadoopを用いた大規模ログ解析
Hadoopを用いた大規模ログ解析
 
Virksomhedens afsætningsforhold, pixi
Virksomhedens afsætningsforhold, pixiVirksomhedens afsætningsforhold, pixi
Virksomhedens afsætningsforhold, pixi
 
MapReduce入門
MapReduce入門MapReduce入門
MapReduce入門
 
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsTracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
 
Boost Your Neo4j with User-Defined Procedures
Boost Your Neo4j with User-Defined ProceduresBoost Your Neo4j with User-Defined Procedures
Boost Your Neo4j with User-Defined Procedures
 

Ähnlich wie Fake News and Their Detection

Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
Kavika Roy
 
Lesson 2 data preprocessing
Lesson 2   data preprocessingLesson 2   data preprocessing
Lesson 2 data preprocessing
AbdurRazzaqe1
 
Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014MongoDB
 
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Red Hat Developers
 
Predicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systemsPredicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systems
Universiti Technologi Malaysia (UTM)
 
Study material ip class 12th
Study material ip class 12thStudy material ip class 12th
Study material ip class 12th
animesh dwivedi
 
weatherr.pptx
weatherr.pptxweatherr.pptx
weatherr.pptx
AnirudhAni20
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache Pinot
Siddharth Teotia
 
Numerical data.
Numerical data.Numerical data.
Numerical data.
Adewumi Ezekiel Adebayo
 
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen TatarynovWorkshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Fwdays
 
Python Pandas
Python PandasPython Pandas
Python Pandas
Sunil OS
 
Educational Objectives After successfully completing this assignmen.pdf
Educational Objectives After successfully completing this assignmen.pdfEducational Objectives After successfully completing this assignmen.pdf
Educational Objectives After successfully completing this assignmen.pdf
rajeshjangid1865
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptx
PremaGanesh1
 
The steps of R code Master.pptx
The steps of R code Master.pptxThe steps of R code Master.pptx
The steps of R code Master.pptx
Fatma Sayed Ibrahim
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Yao Yao
 
Introduction to Machine Learning by MARK
Introduction to Machine Learning by MARKIntroduction to Machine Learning by MARK
Introduction to Machine Learning by MARK
MRKUsafzai0607
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)
Hansol Kang
 
Introduction to objects and inputoutput
Introduction to objects and inputoutput Introduction to objects and inputoutput
Introduction to objects and inputoutput
Ahmad Idrees
 
FDP-faculty deveopmemt program on python
FDP-faculty deveopmemt program on pythonFDP-faculty deveopmemt program on python
FDP-faculty deveopmemt program on python
kannikadg
 
Big data analytics project report
Big data analytics project reportBig data analytics project report
Big data analytics project report
Manav Deshmukh
 

Ähnlich wie Fake News and Their Detection (20)

Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
 
Lesson 2 data preprocessing
Lesson 2   data preprocessingLesson 2   data preprocessing
Lesson 2 data preprocessing
 
Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014
 
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
 
Predicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systemsPredicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systems
 
Study material ip class 12th
Study material ip class 12thStudy material ip class 12th
Study material ip class 12th
 
weatherr.pptx
weatherr.pptxweatherr.pptx
weatherr.pptx
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache Pinot
 
Numerical data.
Numerical data.Numerical data.
Numerical data.
 
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen TatarynovWorkshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
 
Python Pandas
Python PandasPython Pandas
Python Pandas
 
Educational Objectives After successfully completing this assignmen.pdf
Educational Objectives After successfully completing this assignmen.pdfEducational Objectives After successfully completing this assignmen.pdf
Educational Objectives After successfully completing this assignmen.pdf
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptx
 
The steps of R code Master.pptx
The steps of R code Master.pptxThe steps of R code Master.pptx
The steps of R code Master.pptx
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
Introduction to Machine Learning by MARK
Introduction to Machine Learning by MARKIntroduction to Machine Learning by MARK
Introduction to Machine Learning by MARK
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)
 
Introduction to objects and inputoutput
Introduction to objects and inputoutput Introduction to objects and inputoutput
Introduction to objects and inputoutput
 
FDP-faculty deveopmemt program on python
FDP-faculty deveopmemt program on pythonFDP-faculty deveopmemt program on python
FDP-faculty deveopmemt program on python
 
Big data analytics project report
Big data analytics project reportBig data analytics project report
Big data analytics project report
 

Kürzlich hochgeladen

原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 

Kürzlich hochgeladen (20)

原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 

Fake News and Their Detection

  • 1. Fake News and Their Detection Data Science and Big Data Analysis Professor: Antonino Nocera Team Name: 4V’s Group members: Arnold Fonkou Vignesh Kumar Kembu Ashina Nurkoo Seyedkourosh Sajjadi
  • 2. WELFake Fake News Detection (WELFake) dataset of 72,134 news articles with 35,028 real and 37,106 fake news. This dataset is a part of an ongoing research on "Fake News Prediction on Social Media Website" as a doctoral degree program of Mr. Pawan Kumar Verma and is partially supported by the ARTICONF project funded by the European Union’s Horizon 2020 research and innovation program. Columns: - Serial number (starting from 0) - Title (about the text news heading) Text (about the news content) - Label (0 = fake and 1 = real)
  • 3. Data Stream Ingestion Hadoop MapReduce MongoDB Sandbox PySpark HDFS Analysis Architecture PySpark
  • 4. Ingestion From CSV to JSON Data Conversion we have converted the file into JSON to be closer to reality. Reading Data Using PySpark We used the DATAFRAME client of SPARK to read our big data. Saving to Hadoop Write into Hadoop We read from the data frame and then we write it to Hadoop.
  • 5. Reading Section Import findspark findspark.init() import pyspark from pyspark.sql import * spark = SparkSession.builder .master("local[1]") .appName("PySpark Read JSON") .getOrCreate() # Reading multiline json file multiline_dataframe = spark.read.option("multiline","true") .json("project_data_sample.json") multiline_dataframe.head() Saving Section multiline_dataframe.write.save('/usr/local/hadoop/user3/dsba1.json', format='json') And the data is shown as below: sqlContext = SQLContext(spark) df = sqlContext.read.format('json').load('/usr/local/hadoop/user3/dsba1.json') df.show()
  • 7. Mapper (BoW Creation) Read Lines Input Data The data is given as input lines to the mapper. Extract Text Title and Text Extraction After reading each line as a JSON object, we extract the title and the text related to that piece of news from it. Tokenize Word Extraction We perform some data cleaning and then we extract every single word from it.
  • 8. Text Cleaning import sys import re import json def clean_text(text): text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|( ?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text) text = re.sub(r'[^a-zA-Zs]+', '', text) return text Tokenizing def tokenize(text): if not isinstance(text, str): text = str(text) text = clean_text(text) text = str.lower(text) return text.split() Execution for line in sys.stdin: line = line.strip() Try: json_obj = json.loads(line) except: continue title = json_obj.get("title", "") text = json_obj.get("text", "") title_words = tokenize(title) text_words = tokenize(text) for word in title_words + text_words: print(f"{word}t1")
  • 9. Reducer Read Lines Input Data The data is given as input lines each containing 2 elements. Initialize Counter Word and Count Extraction After reading each line a JON object, we extract the title and the text related to that piece of news from it. Create BoW Dictionary Create a dictionary and add each word as the key and its associated count value as the value.
  • 10. Counter Initialization import sys from collections import Counter import json bag_of_words = Counter() Execution for line in sys.stdin: line = line.strip() try: word, count = line.split("t") except: continue count = int(count) bag_of_words[word] += count with open('bow_data.json', 'w') as f: json.dump(bag_of_words, f)
  • 11. Moving to MongoDb import json from pymongo import MongoClient client = MongoClient('mongodb://localhost:27017') db = client['bow'] collection = db['bow_collection'] with open('bow_data.json', 'r') as f: bow_data = json.load(f) collection.insert_one(bow_data) Performing MapReduce Operation In the Terminal: cat db.json | python3 bow_mapper.py | sort | python3 bow_reducer.py
  • 12. HDFS In the case of dealing with big data, we could partition our dataset into a number of batches instead of saving it in a single file. Instead of: multiline_dataframe.write.save('/usr/local/hadoop/user3/dsba1.j son', format='json') Use: partitioned_df = multiline_dataframe.repartition(4, "Unnamed: 0") partitioned_df.write.save('/usr/local/hadoop/user3/dsba1.json', format='json') partition_counts = partitioned_df.rdd.mapPartitions(lambda it: [sum(1 for _ in it)]).collect() print(partition_counts) [482, 480, 519, 519]
  • 13. Create Database Create a database for containing the data. Import From Hadoop Import the JSON File from Hadoop via PySpark. View Data & Backup View the data and if it is inserted correctly then create a backup before starting the modifications. Clean Data Remove non-alphanumeric characters. Display Modified Data Display the modified content to view changes. MongoDB
  • 14. Creating Database Use an existing database or create a new one: >use dsdb_dev Viewing Data >use dsdb_dev >show collections >db.fake_real_news.find() >db.fake_real_news.aggregate([{$group : {_id: "$label", rest_number : {$sum : 1}}}]) Creating a Copy In the Terminal: mongodump --db dsdb_dev --collection fake_real_news --out /home/ds/Documents/ Importing From Hadoop In the Terminal: mongoimport --db dsdb_dev --collection fake_real_news --file /usr/local/hadoop/user3/dsba1.json/part-00000-d1623440-4fde-4b 72-b87d-5943bec596d3-c000.json
  • 15. Importing from Hadoop Using PySpark with open('sampled_data.json', 'w') as file: for line in json_data: file.write(line + 'n') import json with open('sampled_data.json') as file: data = file.readlines() collection.insert_many([json.loads(line) for line in data]) df = spark.read.json("/usr/local/hadoop/user3/dsba1.json/part-00000-d16234 40-4fde-4b72-b87d-5943bec596d3-c000.json") sampled_df = df.sample(fraction=0.8, seed=42) from pymongo import MongoClient conn = MongoClient() db = conn.dsdb_dev collection = db['sampled_data'] json_data = sampled_df.toJSON().collect()
  • 16. Data Cleaning >db.fake_real_news.aggregate([ {‘$project': {‘_id': 1, 'Unnamed: 0': 1, 'label': 1, 'text': 1, 'title': 1} } ]). forEach(function(doc) { if (doc.title) { var newTitle = doc.title.replace(/[^a-zA-Z0-9 ]/g, ''); db.fake_real_news.update({ '_id': doc._id }, { '$set': { 'title': newTitle } }); } }); Modified Content Display >db.fake_real_news.aggregate([ {‘$project': {‘_id': 1, 'Unnamed: 0': 1, 'label': 1, 'text': 1, 'title': 1} } ]); The file is now ready for word occurrence counting, which can be done using Jupyter Notebook and PyMongo. Backup Restoration In case of any need, restore the initial file: >db.fake_real_news.drop() mongorestore --db dsdb_dev --collection fake_real_news /home/ds/Documents/dsdb_dev/fake_real_news.bson
  • 17. Count the Number of Words db.fake_real_news.aggregate ([ { '$match': { 'label': "0" # the condition for the 'label' field to be 1 } }, { '$project': { 'words': {'$split': [{'$toLower': '$title'}, ' ']} # Split the lowercase version of the title field into an array of words } }, { '$unwind': '$words' # Separate documents for each word }, { '$group': { '_id': { 'word': '$words', # Group by word field and count }, 'count': {'$sum': 1} } }, { '$project': { # Project to return only word field, count, and id 'word': '$_id.word', 'count': 1 } }, { '$match': { 'word': {'$ne': None}, # Exclude null or non-existent values } }, { '$match': { '$expr': {'$ne': ['$word', '']} # Exclude empty strings } }, { '$sort': {'count':-1} } ])
  • 18. Hypotheses H1 Generation of fake news shall be with the help of stop words. Metrics - Average number of stop words in title shall be higher in fake news. H2 Real news shall be short and crisp in order to generate easy value. Metrics - Length of the fake news shall be more than the real ones.
  • 19. H1 We used NLTK to extract stop words from the title column and compared the averages between fake and real titles. The hypothesis is false, as shown by the figure: fake news (0) is less frequent than real news (1).
  • 20. H2 The hypothesis is true, as shown by the figures: fake news (0) tends to be longer than real news (1).
  • 21. Insights on Data & Pre-processing To gain quick insights from the data, we used word clouds for the titles overall and for fake/real data.
  • 24. Null Values The title column contains some null values, which may cause issues in data analysis or processing. We need to fill the null values in the title column to ensure accurate data analysis.
  • 25. Text Normalization To further prepare the data, we applied text normalization techniques, including converting the title and text to lowercase and removing punctuation marks.
  • 26. Classification Model For the binary classification of the News, we have choose Random Forest Classifier Splitting of data in x and y variable and Test and train split of the data has been performed with 77 & 33 size. The bag of words has been performed to the text of the news (X_train & X_test) and by removing the stop words in English. The Label Y_train & Y_test has the class of the news (Fake = 0 & Real = 1 ) Now the train data is feed to the RandomForestClassifier with 500 trees and the model has been tested with the test data and the model classification confusion matrix is below.
  • 27. Thank You For Your Attention!