SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
Exploratory
Data Analysis
.
01
.
02
.
03
04
Agenda
What is Exploratory Data Analysis?
Why EDA is important?
Visualisation
▪ Important Charts for visualisation
Steps Involved in EDA:-
▪ Data Sourcing
▪ Data Cleaning
▪ Univariate analysis with visualisation
▪ Bivariate analysis with visualisation
▪ Derived metrics
Uses Cases
05
Data Science Process
Exploratory
Data Analysis
.
❑ Exploratory Data Analysis is an approach to
analyse the datasets to summarize their
main characteristics in form of visual
methods.
❑ EDA is nothing but an data exploration
technique to understand various aspects of
the data.
❑ The main aim of EDA is to obtain confidence
in a data to an extent where we are ready to
engage a machine learning model.
What is
Exploratory
Data
Analysis?
Why EDA is
Important?
❖ EDA is important to analyse the data it’s a
first steps in data analysis process.
❖ EDA give a basic idea to understand the data
and make sense of the data to figure out the
question you need to ask and find out the
best way to manipulate the dataset to get
the answer of your question.
❖ Exploratory data analysis help us to finding
the errors, discovering data, mapping out
data structure, finding out anomalies.
❖ Exploratory data analysis is important for
business process because we are preparing
dataset for deep through analysis that will
detect you business problem.
❖ EDA help to build a quick and dirty model,
or a baseline model, which can serve as a
comparison against later models that you
will build.
Visualization
.
.
Easily analyse the data and summarize it.
Easily understand the features of the data.
Help to find the trend or pattern of the
data.
Help to get meaningful insights from the
data.
Visualisation is the presentation of the data in the graphical
or visual form to understand the data more clearly.
Visualisation is easy to understand the data.
Histogram represent the frequency distribution of
the data
Histogram
Bar graph represent the total observation in the data for a
particular category.
Bar Chart
Important Charts for Visualisation
Pie chart represent the percentage of the data by each
category.
Pie Chart
Boxplot display the distribution of the data based on five number
summary(minimum, first quartile, median, third quartile, maximum)
Box Plot
Scatter plot represent the relationship between two
numerical variable. It show the correlation between two
variable.
Scatter Plot
Pair plot show the bivariate distribution of the datasets. It show the
pairwise relationship between the variable
Pair-plot
Line chart are used to track change over line and short
period of time. Line chart are used in time series data.
Line Chart
Violin chart are used to plot numeric data
Violin Plot
Box Plot vs ViolinPlot
● Violin Graphs are better than Box Plots
● Violin graph is visually intuitive & attractive.
Image Courtesy: https://blog.bioturing.com/
A heat map is data analysis software that uses color
the way a bar graph uses height and width.
Heatmaps
Strip plots are used to draw a scatter plot for on the
categorical basis
Strip Plot
Steps Involved in EDA
01
03
05
02
04
.
Data Sourcing
Data Cleaning
Univariate Analysis
with Visualisation
Bivariate Analysis with
Visualisation
Derived Metrics
Data
Sourcing
Data Sourcing is the process of gathering data
from multiple sources as external or internal
data collection.
There are two major kind of data which can
be classified according to the source:
1. Public data 2. Private data
Public Data:- The data which is easy to access
without taking any permission from the agencies
is called public data. The agencies made the data
public for the purpose of the research. Like
government and other public sector or
ecommerce sites made there data public.
Private Data:- The data which is not available
on public platform and to access the data we
have to take the permission of organisation is
called private data. Like Banking ,telecom ,retail
sector are there which not made their data
publicly available.
Data
Cleaning
• After collecting the data , the next step is
data cleaning. Data cleaning means that you
get rid of any information that doesn’t need
to be there and clean up by mistake.
• Data Cleaning is the process of clean the data
to improve the quality of the data for further
data analysis and building a machine
learning model.
• The benefit of data cleaning is that all the
incorrect and irrelevant data is gone and we
get the good quality of data which will help
in improving the accuracy of our machine
learning model.
The following are some steps involve in Data
Cleaning:
Handle Missing Values
Standardisation of the data
Outlier Treatment
Handle Invalid values
Some Questions you need to ask yourself about the data before
cleaning the data
.
.
.
.
.
what you are reading, does it
make sense?
Does this data make sense?
Does this data you’re looking at
match the column labels?
Does the data follow the rules for this
field?
After compute the summarize
statistics for numerical data, does it
make sense?
Handle Missing Value
Option
A
Option
B
Option
C
Option
D
Prediction model is one of the advanced method to handle missing values. In this
method dataset with no missing value become training set and dataset with
missing value become the test set and the missing values is treated as target
variable.
Predicting the missing values
This method can be used on independent variable when it has numerical
variables. On categorical feature we apply mode method to fill the missing value.
.
Replacing with mean/median/mode
This method we commonly used to handle missing values. Rows can be deleted if it has
insignificant number of missing value Columns can be delete if it has more than 75% of missing
value
Some machine learning algorithm supports to handle missing value in the
datasets. Like KNN, Naïve Bayes, Random forest.
.
Algorithm Imputation
Delete Rows/Columns
Airlines Ticket Price
Indigo 3887
Air Asia 7662
Jet Airways -
Air India 5221
SpiceJet 4321
Suppose we have Airlines ticket price data in which there is missing value.
Steps to fill the numeric missing value:-
1) Compute the mean/median of the data
(3887+7662+5221+4321)/4 = 5272.75
2) Substitute the Mean of the value in missing place.
Airlines Ticket price
Indigo 3887
Indigo 7675
Air Asia 4236
- 6524
Jet Airways 4321
Suppose we have Missing values in the categorical data:
Then we take the mode of the dataset a to fill the missing values:
Here :
Mode = Indigo
We substitute the Indigo in place of missing value in Airline column.
For categorical Data
For Numerical Data Example
Standardization/
Feature Scaling
Feature scaling is the method to rescale the values
present in the features. In feature scaling we convert
the scale of different measurement into a single
scale. It standardise the whole dataset in one range.
Importance of Feature Scaling:- When we are
dealing with independent variable or features that
differ from each other in terms of range of values or
units of the features, then we have to
normalise/standardise the data so that the difference
in range of values doesn’t affect the outcome of the
data.
Feature Scaling Technique
Standard scaler ensures that for each feature, the mean is
zero and the standard deviation is 1,bringing all feature to the
same magnitude. In simple words Standardization helps you
to scale down your feature based on the standard normal
distribution
Standard Scaler
Normalization helps you to scale down your features
between a range 0 to 1
Min-Max Scaler
.
Normalization
Standardization
Age Income
(£)
New value
24 15000 (15000 – 12000)/18000 = 0.16667
30 12000 (12000 – 12000)/18000 =0
28 30000 (30000 – 12000)/18000 =1
Average = (15000 + 12000 + 30000)/3 = 19000
Standard deviation = ??
Income Minimum = 12000
Income Maximum = 30000
(Max – min) = (30000 – 12000) = 18000
Hence, we have converted the income values between 0 and 1
Please note, the new values have
Minimum = 0
Maximum = 1
Age Income
(£)
New value of Income
24 15000 (15000 - 19000)/std = ??
30 12000 (12000 - 19000)/std = ??
28 30000 (30000-19000)/std = ??
Standardization
Normalisation
Example
.
Outlier Treatment
Outliers are the most extremes values in the data. It is a
abnormal observations that deviate from the norm.
Outliers do not fit in the normal behaviour of the data.
Detect Outliers using following methods:
1.Boxplot 2. Histogram 3. Scatter plot
4.Z-score
5. Interquartile range(values out of 1.5 time of IQR)
Handle Outlier using following methods:-
1.Remove the outliers.
2.Replace outlier with suitable values by using following
methods:-
• Quantile method
• Interquartile range
3.Use that ML model which are not sensitive to outliers
Like:-KNN,Decision Tree,SVM,NaïveBayes,Ensemble
methods
.
Handle Invalid Value
• Encode Unicode properly:-In case the data is being read as junk
characters, try to change encoding, E.g. CP1252 instead of UTF-8.
• Convert incorrect data types:- Correct the incorrect data types
to the correct data types for ease of analysis. E.g. if numeric values
are stored as strings, it would not be possible to calculate metrics
such as mean, median, etc. Some of the common data type
corrections are — string to number: "12,300" to “12300”; string
to date: "2013-Aug" to “2013/08”; number to string: “PIN Code
110001” to "110001"; etc.
• Correct values that go beyond range:- If some of the values are
beyond logical range, e.g. temperature less than -273° C (0° K),
you would need to correct them as required. A close look would
help you check if there is scope for correction, or if the value
needs to be removed.
• Correct wrong structure:- Values that don’t follow a defined
structure can be removed. E.g. In a data set containing pin codes
of Indian cities, a pin code of 12 digits would be an invalid value
and needs to be removed. Similarly, a phone number of 12 digits
would be an invalid value
Univariate Analysis
Frequency
Central Tendency
.
Dispersion
.
Univariate analysis deal with analysing one
feature at a time. Univariate analysis is
important to understand the single variable at
a time. The main purpose of univariate
analysis is to make data easier to interpret.
Mainly, Histogram is used to visualize
univariate
Measure of frequency include frequency table, where
you tabulate how often a particular category or
particular value appear in the dataset.
It is a typical value for a probability distribution for
univariate measure of central tendency we have
Mean, Median, Mode.
It tries to capture how your data points jump
around.
Univariate Types:
Types of Data
Qualitative Quantitative
A variable which able to describe quality of
the population.(Numerical values)
A variable which quantify the
population(distinct categories)
Ordinal Discrete Continuous
Nominal
Nominal
Ordinal
.
Discrete
.
Continuous
It represent qualitative information
without order. Value represent a
discrete units.
Like Gender: Male/Female ,Eye colour.
It represent qualitative information with order.
It indicate the measurement classification are
different and can be ranked. Lets say Economic
status: high/medium/low which can ordered as
low,medium,high.
A number within a range of a value is
usually measured, such as height.
It has a discrete value that means it
take only counted value not a
decimal values. Like count of
student in class
Types of Data
.
Segmented
Univariate Analysis
Segmented Univariate Analysis allow you to compare
subset of data it help us to understand how the
relevant metric varies across the different segment.
The Standard process of segmented univariate
analysis is as follow:
• Take a raw data
• Group by dimensions
• Summarise using a relevant metric like mean
,median.
• Compare the aggregate metric across the
categories.
BivariateAnalysis
Correlation
.
.
Data which has two variables ,you often
want to measure the relationship that
exists between these two variables.
Correlation measure the strength as well as the
direction of the linear relationship between the two
variables. Its range is from -1 to +1.
Bi-variate Types:
Covariance measure how much two random
variable vary together. Its range is from -∞ to +∞.
Covariance
● If one increases as the other increases, the correlation is positive
● If one decreases as the other increases, the correlation is negative
● If one stays constant as the other varies, the correlation is zero
To see the distribution of two categorical variables. For example, if you want to compare the number of boys and girls who
play games, you can make a ‘cross table’ as given below:
To see the distribution of two categorical variables with one continuous variable. For example, you saw how a student’s
percentage in science is distributed based on the father’s occupation (categorical variable 1) and the poverty level
(categorical variable 2).
Example
Derived Metrics
Feature Binning
.
.
Derived metrics create a new variable from the
existing variable to get a insightful information
from the data by analysing the data.
Feature Encoding
From Domain Knowledge
Calculated from Data
.
Feature Binning
Feature binning converts or transform
continuous/numeric variable to categorical
variable. It can also be used to identify missing
values or outliers.
Type of Binning:-
1. Unsupervised Binning
a)Equal width binning
b)Equal frequency binning
2. Supervised Binning
a) Entropy based binning
.
Supervised Binning
.
Un Supervised Binning
It transform continuous or numeric
variable into categorical value without
taking dependent variable into
consideration
It transform continuous or numeric
variable into categorical value taking
dependent variable into consideration.
Equal Width
Equal width separate the continuous variable to several
categories having same range of width.
Equal Frequency
Equal frequency separate the continuous variable into
several categories having approximately same number of
values.
Entropy Based
Entropy based binning separate the continuous
or numeric variable majority of values in a
category belong to same label of class.
Feature Encoding
Feature encoding help us to transform categorical data into numeric data.
Label encoding
Label encoding is technique to transform
categorical variables into numerical variables
by assigning a numerical value to each of the
categories.
One-Hot encoding
This technique is used when independent
variables are nominal. It creates k different
columns each for a category and replaces one
column with 1 rest of the columns is 0.
Here, 0 represents the absence,
and 1 represents the presence of
that category.
Target Encoding
In target encoding, we calculate the average of
the dependent variable for each category and
replace the category variable with the mean
value
TheHashencoder represents categorical independent
variable using the new dimensions. Here, the user
can fix the number of dimensions after
transformation using component argument.
Hash Encoder
Uses Cases
Uses
Cases
Basically EDA is important in every business problem,
it’s a first crucial step in data analysis process.
Some of the use cases where we use EDA is:-
• Cancer Data Analysis :- In this data set we have to
predict who are suffering from cancer and who’s
not.
• Fraud Data Analysis in E-commerce
Transactions :- In this dataset we have to detect
the fraud in a E-commerce transaction.
THANK YOU

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Data Visualization
Introduction to Data VisualizationIntroduction to Data Visualization
Introduction to Data VisualizationStephen Tracy
 
Missing data
Missing dataMissing data
Missing datamandava57
 
Pca(principal components analysis)
Pca(principal components analysis)Pca(principal components analysis)
Pca(principal components analysis)kalung0313
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-Ihktripathy
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data AnalysisUmair Shafique
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning pyingkodi maran
 
Data Visualization
Data VisualizationData Visualization
Data Visualizationsimonwandrew
 
Application of predictive analytics
Application of predictive analyticsApplication of predictive analytics
Application of predictive analyticsPrasad Narasimhan
 
The Importance of Data Visualization
The Importance of Data VisualizationThe Importance of Data Visualization
The Importance of Data VisualizationCenterline Digital
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality ReductionSaad Elbeleidy
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisgokulprasath06
 
Data visualization introduction
Data visualization introductionData visualization introduction
Data visualization introductionManokamnaKochar1
 
Association Rule Learning Part 1: Frequent Itemset Generation
Association Rule Learning Part 1: Frequent Itemset GenerationAssociation Rule Learning Part 1: Frequent Itemset Generation
Association Rule Learning Part 1: Frequent Itemset GenerationKnoldus Inc.
 

Was ist angesagt? (20)

Introduction to Data Visualization
Introduction to Data VisualizationIntroduction to Data Visualization
Introduction to Data Visualization
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Missing data
Missing dataMissing data
Missing data
 
Pca(principal components analysis)
Pca(principal components analysis)Pca(principal components analysis)
Pca(principal components analysis)
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-I
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
Data Visualization
Data VisualizationData Visualization
Data Visualization
 
Application of predictive analytics
Application of predictive analyticsApplication of predictive analytics
Application of predictive analytics
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysis
 
The Importance of Data Visualization
The Importance of Data VisualizationThe Importance of Data Visualization
The Importance of Data Visualization
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Data visualization introduction
Data visualization introductionData visualization introduction
Data visualization introduction
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
EDA by Sastry.pptx
EDA by Sastry.pptxEDA by Sastry.pptx
EDA by Sastry.pptx
 
Missing data handling
Missing data handlingMissing data handling
Missing data handling
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Decision tree
Decision treeDecision tree
Decision tree
 
Association Rule Learning Part 1: Frequent Itemset Generation
Association Rule Learning Part 1: Frequent Itemset GenerationAssociation Rule Learning Part 1: Frequent Itemset Generation
Association Rule Learning Part 1: Frequent Itemset Generation
 

Ähnlich wie Exploratory Data Analysis - Satyajit.pdf

Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2Gokulks007
 
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHONUNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHONNandakumar P
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessingKnoldus Inc.
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data scienceTanujaSomvanshi1
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in MalaysiaAhmed Elmalla
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - ReportAkanksha Gohil
 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.pptDeadpool120050
 
Data Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxData Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxPratikshaSurve4
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Rohit Dubey
 
EDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptx
EDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptxEDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptx
EDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptxsaurav3107pandey
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionDerek Kane
 

Ähnlich wie Exploratory Data Analysis - Satyajit.pdf (20)

ML-Unit-4.pdf
ML-Unit-4.pdfML-Unit-4.pdf
ML-Unit-4.pdf
 
1234
12341234
1234
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
 
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHONUNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
 
Pelatihan Data Analitik
Pelatihan Data AnalitikPelatihan Data Analitik
Pelatihan Data Analitik
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - Report
 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.ppt
 
Data Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxData Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptx
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
 
QQ Plot.pptx
QQ Plot.pptxQQ Plot.pptx
QQ Plot.pptx
 
Data processing
Data processingData processing
Data processing
 
Krupa rm
Krupa rmKrupa rm
Krupa rm
 
EDA-Unit 1.pdf
EDA-Unit 1.pdfEDA-Unit 1.pdf
EDA-Unit 1.pdf
 
EDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptx
EDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptxEDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptx
EDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptx
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model Selection
 

Kürzlich hochgeladen

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制vexqp
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss ConfederationEfruzAsilolu
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdftheeltifs
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样wsppdmt
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxVivek487417
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........EfruzAsilolu
 

Kürzlich hochgeladen (20)

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 

Exploratory Data Analysis - Satyajit.pdf

  • 2. . 01 . 02 . 03 04 Agenda What is Exploratory Data Analysis? Why EDA is important? Visualisation ▪ Important Charts for visualisation Steps Involved in EDA:- ▪ Data Sourcing ▪ Data Cleaning ▪ Univariate analysis with visualisation ▪ Bivariate analysis with visualisation ▪ Derived metrics Uses Cases 05
  • 4. Exploratory Data Analysis . ❑ Exploratory Data Analysis is an approach to analyse the datasets to summarize their main characteristics in form of visual methods. ❑ EDA is nothing but an data exploration technique to understand various aspects of the data. ❑ The main aim of EDA is to obtain confidence in a data to an extent where we are ready to engage a machine learning model. What is Exploratory Data Analysis?
  • 5. Why EDA is Important? ❖ EDA is important to analyse the data it’s a first steps in data analysis process. ❖ EDA give a basic idea to understand the data and make sense of the data to figure out the question you need to ask and find out the best way to manipulate the dataset to get the answer of your question. ❖ Exploratory data analysis help us to finding the errors, discovering data, mapping out data structure, finding out anomalies. ❖ Exploratory data analysis is important for business process because we are preparing dataset for deep through analysis that will detect you business problem. ❖ EDA help to build a quick and dirty model, or a baseline model, which can serve as a comparison against later models that you will build.
  • 6. Visualization . . Easily analyse the data and summarize it. Easily understand the features of the data. Help to find the trend or pattern of the data. Help to get meaningful insights from the data. Visualisation is the presentation of the data in the graphical or visual form to understand the data more clearly. Visualisation is easy to understand the data.
  • 7. Histogram represent the frequency distribution of the data Histogram Bar graph represent the total observation in the data for a particular category. Bar Chart Important Charts for Visualisation
  • 8. Pie chart represent the percentage of the data by each category. Pie Chart Boxplot display the distribution of the data based on five number summary(minimum, first quartile, median, third quartile, maximum) Box Plot
  • 9. Scatter plot represent the relationship between two numerical variable. It show the correlation between two variable. Scatter Plot Pair plot show the bivariate distribution of the datasets. It show the pairwise relationship between the variable Pair-plot
  • 10. Line chart are used to track change over line and short period of time. Line chart are used in time series data. Line Chart Violin chart are used to plot numeric data Violin Plot
  • 11. Box Plot vs ViolinPlot ● Violin Graphs are better than Box Plots ● Violin graph is visually intuitive & attractive. Image Courtesy: https://blog.bioturing.com/
  • 12. A heat map is data analysis software that uses color the way a bar graph uses height and width. Heatmaps Strip plots are used to draw a scatter plot for on the categorical basis Strip Plot
  • 13. Steps Involved in EDA 01 03 05 02 04 . Data Sourcing Data Cleaning Univariate Analysis with Visualisation Bivariate Analysis with Visualisation Derived Metrics
  • 14. Data Sourcing Data Sourcing is the process of gathering data from multiple sources as external or internal data collection. There are two major kind of data which can be classified according to the source: 1. Public data 2. Private data Public Data:- The data which is easy to access without taking any permission from the agencies is called public data. The agencies made the data public for the purpose of the research. Like government and other public sector or ecommerce sites made there data public. Private Data:- The data which is not available on public platform and to access the data we have to take the permission of organisation is called private data. Like Banking ,telecom ,retail sector are there which not made their data publicly available.
  • 15. Data Cleaning • After collecting the data , the next step is data cleaning. Data cleaning means that you get rid of any information that doesn’t need to be there and clean up by mistake. • Data Cleaning is the process of clean the data to improve the quality of the data for further data analysis and building a machine learning model. • The benefit of data cleaning is that all the incorrect and irrelevant data is gone and we get the good quality of data which will help in improving the accuracy of our machine learning model. The following are some steps involve in Data Cleaning: Handle Missing Values Standardisation of the data Outlier Treatment Handle Invalid values
  • 16. Some Questions you need to ask yourself about the data before cleaning the data . . . . . what you are reading, does it make sense? Does this data make sense? Does this data you’re looking at match the column labels? Does the data follow the rules for this field? After compute the summarize statistics for numerical data, does it make sense?
  • 17. Handle Missing Value Option A Option B Option C Option D Prediction model is one of the advanced method to handle missing values. In this method dataset with no missing value become training set and dataset with missing value become the test set and the missing values is treated as target variable. Predicting the missing values This method can be used on independent variable when it has numerical variables. On categorical feature we apply mode method to fill the missing value. . Replacing with mean/median/mode This method we commonly used to handle missing values. Rows can be deleted if it has insignificant number of missing value Columns can be delete if it has more than 75% of missing value Some machine learning algorithm supports to handle missing value in the datasets. Like KNN, Naïve Bayes, Random forest. . Algorithm Imputation Delete Rows/Columns
  • 18. Airlines Ticket Price Indigo 3887 Air Asia 7662 Jet Airways - Air India 5221 SpiceJet 4321 Suppose we have Airlines ticket price data in which there is missing value. Steps to fill the numeric missing value:- 1) Compute the mean/median of the data (3887+7662+5221+4321)/4 = 5272.75 2) Substitute the Mean of the value in missing place. Airlines Ticket price Indigo 3887 Indigo 7675 Air Asia 4236 - 6524 Jet Airways 4321 Suppose we have Missing values in the categorical data: Then we take the mode of the dataset a to fill the missing values: Here : Mode = Indigo We substitute the Indigo in place of missing value in Airline column. For categorical Data For Numerical Data Example
  • 19. Standardization/ Feature Scaling Feature scaling is the method to rescale the values present in the features. In feature scaling we convert the scale of different measurement into a single scale. It standardise the whole dataset in one range. Importance of Feature Scaling:- When we are dealing with independent variable or features that differ from each other in terms of range of values or units of the features, then we have to normalise/standardise the data so that the difference in range of values doesn’t affect the outcome of the data.
  • 20. Feature Scaling Technique Standard scaler ensures that for each feature, the mean is zero and the standard deviation is 1,bringing all feature to the same magnitude. In simple words Standardization helps you to scale down your feature based on the standard normal distribution Standard Scaler Normalization helps you to scale down your features between a range 0 to 1 Min-Max Scaler . Normalization Standardization
  • 21. Age Income (£) New value 24 15000 (15000 – 12000)/18000 = 0.16667 30 12000 (12000 – 12000)/18000 =0 28 30000 (30000 – 12000)/18000 =1 Average = (15000 + 12000 + 30000)/3 = 19000 Standard deviation = ?? Income Minimum = 12000 Income Maximum = 30000 (Max – min) = (30000 – 12000) = 18000 Hence, we have converted the income values between 0 and 1 Please note, the new values have Minimum = 0 Maximum = 1 Age Income (£) New value of Income 24 15000 (15000 - 19000)/std = ?? 30 12000 (12000 - 19000)/std = ?? 28 30000 (30000-19000)/std = ?? Standardization Normalisation Example
  • 22. . Outlier Treatment Outliers are the most extremes values in the data. It is a abnormal observations that deviate from the norm. Outliers do not fit in the normal behaviour of the data. Detect Outliers using following methods: 1.Boxplot 2. Histogram 3. Scatter plot 4.Z-score 5. Interquartile range(values out of 1.5 time of IQR) Handle Outlier using following methods:- 1.Remove the outliers. 2.Replace outlier with suitable values by using following methods:- • Quantile method • Interquartile range 3.Use that ML model which are not sensitive to outliers Like:-KNN,Decision Tree,SVM,NaïveBayes,Ensemble methods
  • 23. . Handle Invalid Value • Encode Unicode properly:-In case the data is being read as junk characters, try to change encoding, E.g. CP1252 instead of UTF-8. • Convert incorrect data types:- Correct the incorrect data types to the correct data types for ease of analysis. E.g. if numeric values are stored as strings, it would not be possible to calculate metrics such as mean, median, etc. Some of the common data type corrections are — string to number: "12,300" to “12300”; string to date: "2013-Aug" to “2013/08”; number to string: “PIN Code 110001” to "110001"; etc. • Correct values that go beyond range:- If some of the values are beyond logical range, e.g. temperature less than -273° C (0° K), you would need to correct them as required. A close look would help you check if there is scope for correction, or if the value needs to be removed. • Correct wrong structure:- Values that don’t follow a defined structure can be removed. E.g. In a data set containing pin codes of Indian cities, a pin code of 12 digits would be an invalid value and needs to be removed. Similarly, a phone number of 12 digits would be an invalid value
  • 24. Univariate Analysis Frequency Central Tendency . Dispersion . Univariate analysis deal with analysing one feature at a time. Univariate analysis is important to understand the single variable at a time. The main purpose of univariate analysis is to make data easier to interpret. Mainly, Histogram is used to visualize univariate Measure of frequency include frequency table, where you tabulate how often a particular category or particular value appear in the dataset. It is a typical value for a probability distribution for univariate measure of central tendency we have Mean, Median, Mode. It tries to capture how your data points jump around. Univariate Types:
  • 25. Types of Data Qualitative Quantitative A variable which able to describe quality of the population.(Numerical values) A variable which quantify the population(distinct categories) Ordinal Discrete Continuous Nominal
  • 26. Nominal Ordinal . Discrete . Continuous It represent qualitative information without order. Value represent a discrete units. Like Gender: Male/Female ,Eye colour. It represent qualitative information with order. It indicate the measurement classification are different and can be ranked. Lets say Economic status: high/medium/low which can ordered as low,medium,high. A number within a range of a value is usually measured, such as height. It has a discrete value that means it take only counted value not a decimal values. Like count of student in class Types of Data
  • 27. . Segmented Univariate Analysis Segmented Univariate Analysis allow you to compare subset of data it help us to understand how the relevant metric varies across the different segment. The Standard process of segmented univariate analysis is as follow: • Take a raw data • Group by dimensions • Summarise using a relevant metric like mean ,median. • Compare the aggregate metric across the categories.
  • 28. BivariateAnalysis Correlation . . Data which has two variables ,you often want to measure the relationship that exists between these two variables. Correlation measure the strength as well as the direction of the linear relationship between the two variables. Its range is from -1 to +1. Bi-variate Types: Covariance measure how much two random variable vary together. Its range is from -∞ to +∞. Covariance ● If one increases as the other increases, the correlation is positive ● If one decreases as the other increases, the correlation is negative ● If one stays constant as the other varies, the correlation is zero
  • 29. To see the distribution of two categorical variables. For example, if you want to compare the number of boys and girls who play games, you can make a ‘cross table’ as given below: To see the distribution of two categorical variables with one continuous variable. For example, you saw how a student’s percentage in science is distributed based on the father’s occupation (categorical variable 1) and the poverty level (categorical variable 2). Example
  • 30. Derived Metrics Feature Binning . . Derived metrics create a new variable from the existing variable to get a insightful information from the data by analysing the data. Feature Encoding From Domain Knowledge Calculated from Data
  • 31. . Feature Binning Feature binning converts or transform continuous/numeric variable to categorical variable. It can also be used to identify missing values or outliers. Type of Binning:- 1. Unsupervised Binning a)Equal width binning b)Equal frequency binning 2. Supervised Binning a) Entropy based binning
  • 32. . Supervised Binning . Un Supervised Binning It transform continuous or numeric variable into categorical value without taking dependent variable into consideration It transform continuous or numeric variable into categorical value taking dependent variable into consideration. Equal Width Equal width separate the continuous variable to several categories having same range of width. Equal Frequency Equal frequency separate the continuous variable into several categories having approximately same number of values. Entropy Based Entropy based binning separate the continuous or numeric variable majority of values in a category belong to same label of class.
  • 33. Feature Encoding Feature encoding help us to transform categorical data into numeric data. Label encoding Label encoding is technique to transform categorical variables into numerical variables by assigning a numerical value to each of the categories. One-Hot encoding This technique is used when independent variables are nominal. It creates k different columns each for a category and replaces one column with 1 rest of the columns is 0. Here, 0 represents the absence, and 1 represents the presence of that category. Target Encoding In target encoding, we calculate the average of the dependent variable for each category and replace the category variable with the mean value TheHashencoder represents categorical independent variable using the new dimensions. Here, the user can fix the number of dimensions after transformation using component argument. Hash Encoder
  • 34. Uses Cases Uses Cases Basically EDA is important in every business problem, it’s a first crucial step in data analysis process. Some of the use cases where we use EDA is:- • Cancer Data Analysis :- In this data set we have to predict who are suffering from cancer and who’s not. • Fraud Data Analysis in E-commerce Transactions :- In this dataset we have to detect the fraud in a E-commerce transaction.