The document discusses the key steps involved in data pre-processing for machine learning:
1. Data cleaning involves removing noise from data by handling missing values, smoothing outliers, and resolving inconsistencies.
2. Data transformation strategies include data aggregation, feature scaling, normalization, and feature selection to prepare the data for analysis.
3. Data reduction techniques like dimensionality reduction and sampling are used to reduce large datasets size by removing redundant features or clustering data while maintaining most of the information.
Advantages of Hiring UIUX Design Service Providers for Your Business
KNOLX_Data_preprocessing
1. Presented By: Aayush Srivastava
& Niraj Kumar
Data Pre –processing
& Steps Involved In It
2. Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Join the session 5 minutes prior to
the session start time. We start on
time and conclude on time!
Feedback
Make sure to submit a constructive
feedback for all sessions as it is
very helpful for the presenter.
Silent Mode
Keep your mobile devices in silent
mode, feel free to move out of
session in case you need to attend
an urgent call.
Avoid Disturbance
Avoid unwanted chit chat during
the session.
3. Our Agenda
01 What and Why Data
Preprocessing
02 Data Cleaning
03
Data Transformation
04
Data Reductions
05
Demo
05
4
06
Data Integration
5. What is Machine Learning ?
● According to Arthur Samuel(1959), Machine Learning algorithms enable the computers to learn from data,
and even improve themselves, without being explicitly programmed.
● Machine learning (ML) is a category of an algorithm that allows software applications to become more
accurate in predicting outcomes without being explicitly programmed.
● The basic premise of machine learning is to build algorithms that can receive input data and use statistical
analysis to predict an output while updating outputs as new data becomes available.
9. ● Data preprocessing is an important step in ML.
● The phrase "garbage in, garbage out" is particularly applicable to data.
● It is the process of transforming raw data into a useful, understandable format.
● Real-world or raw data usually has inconsistent formatting, human errors, and can also be incomplete.
● Data preprocessing resolves such issues and makes datasets more complete and efficient to perform
data analysis.
● It’s a crucial process that can affect the success of data mining and machine learning projects.
● It makes knowledge discovery from data sets faster and can ultimately affect the performance of
machine learning models.
What is Data Pre-Processing
11. ● Data is the real world is “dirty”
○ incomplete: missing attribute value, lack of certain attributes of interest, or containing only aggregate
data.
■ e.g. department = “”
○ noisy: containing errors or outliers
■ salary = “-10”
○ inconsistent: containing discrepancies between in codes or names
■ Age = 42 Birthday = “27/02/1997”
● These mistakes, redundancies, missing values, and inconsistencies compromise the integrity of the set.
● We need to fix all those issues for a more accurate outcome. Chances are that the system will develop
biases and deviations that will produce a poor user experience.
Why Data Pre-Processing
12. Data Understanding: Relevance of data
• What data is available for the task?
• Is this data relevant?
• Is additional relevant data available?
• How much historical data is available?
15. ● Data cleaning or cleansing is the process of cleaning datasets by accounting for missing values, removing
outliers, correcting inconsistent data points, and smoothing noisy data.
● In essence, the motive behind data cleaning is to offer complete and accurate samples for machine learning
models.
Data Cleaning
Some Effective data cleaning techniques:
● Remove duplicates
○ Duplicate entries are problematic for multiple reasons.
○ First off, when an entry appears more than once, it receives a disproportionate weight during training.
○ Thus models that succeed on frequent entries will look like they perform well, while in reality this is not the
case.
● Remove irrelevant data
○ Data often comes from multiple sources, and there is a significant probability that a given table or database
includes entries that do not really belong for our use case. In some cases filtering out outdated entries will be
required
16. ● Fix Errors
○ It probably goes without saying that we will need to carefully remove any errors from our data. Errors as
avoidable as typos could lead us missing out on key findings from your data. Some of these can be
avoided with something as simple as a quick spell-check.
○ Example: Spelling mistakes or extra punctuation in data like an email address could mean you miss out on
communicating with your customers. It could also lead to you sending unwanted emails to people who
didn’t sign up for them.
● Handle missing values
○ Missing data is defined as the values or data that is not stored (or not present) for some variable/s in the
given database
Data Cleaning
17. ● Remove Noisy Data
○ Noisy data are random error or variance in a measured variable.
○ Incorrect attribute values may due to
■ faulty data collection instruments
■ data entry problems
■ data transmission problems
■ technology limitation
■ inconsistency in naming convention
Data Cleaning
18. Handling missing data
Ignore the tuple (loss of information)
• Fill in missing values manually: tedious, infeasible?
• Fill in it automatically with
a global constant : e.g., unknown, a new class?!
Imputation: Use the attribute mean/median/mode to fill in the
missing value,
Use the most probable value to fill in the missing value.
19. Handling noisy data
Binning method:
● Binning method is used to smoothing data or to handle noisy data.
● In this method, the data is first sorted and then the sorted values are distributed into a
number of buckets or bins.
● As binning methods consult the neighbourhood of values, they perform local
smoothing.
● Three kinds of smoothing methods:-
○ Smoothing by bin means : In smoothing by bin means, each value in a bin is replaced by the mean
value of the bin.
○ Smoothing by bin median : In this method each bin value is replaced by its bin median value.
○ Smoothing by bin boundary : In smoothing by bin boundaries, the minimum and maximum values
in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest
boundary value.
21. Data Integration
● Data integration
○ Its combines data from multiple sources which are stored using various technologies and provide a unified view of
the data
● Schema integration
○ Integrate metadata from different sources
○ Entity identification problem: identify real world entities from
○ multiple data sources, e.g., A.cust-id ≡ B.cust-#
● Detecting and resolving data value conflicts
for the same real world entity, attribute values from different
sources are different, e.g., different scales, metric vs. British units
● Removing duplicates and redundant data
22. With data cleaning, we’ve already begun to modify our data, but data transformation will
begin the process of turning the data into the proper format(s) we will need for analysis
and other downstream processes.
Data transformation Strategies:
● Aggregation - Data aggregation is the process where data is collected and presented in a summarized format
for statistical analysis.This process finding sum ,average, max etc
● Feature Scaling - Feature Scaling is a technique to standardize the independent features present in the data in a
fixed range. It is performed during the data pre-processing to handle highly varying magnitudes or values or units.
Data Transformation
23. ● Normalization - Data normalization is the method of organizing data to appear similar across all records and fields. In
this technique we rescale each row of data to a length of 1. This is mainly useful for sparse datasets with lots of zeros.
Performing so always results in getting higher quality data.
Normalization can be 2 types:
1.L1 Normalization
It is defined as the normalization technique that modifies the dataset values in a way that in each row the sum of the
absolute values will always be up to 1. It is also known as Least Absolute Deviations.
2.L2 Normalization
It is defined as the normalization technique that modifies the dataset values in a way that in each row the sum of the
squares will always be up to 1. It is also called least squares. It also penalises large weights.
Data Transformation
24. ● Feature selection - Feature Selection is the method of reducing the input variable to your model by using only
relevant data.
Benefits of feature selection
1.Performing feature selection before data modeling will reduce the overfitting.
2.Performing feature selection before data modeling will increases the accuracy of ML model.
3.Performing feature selection before data modeling will reduce the training time
Data Transformation
25. ● Dimensionality reduction, also known as dimension reduction, reduces the number of features or input variables in
a dataset.
● The number of features or input variables of a dataset is called its dimensionality.
● The higher the number of features, the more troublesome it is to visualize the training dataset and create a
predictive model.
● In some cases, most of these attributes are correlated, hence redundant; therefore, dimensionality reduction
algorithms can be used to reduce the number of random variables and obtain a set of principal variables.
Data reduction strategies
● Dimensionality reduction(PCA)
Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the
dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most
of the information in the large set.
● Aggregation and clustering
1. Remove redundant or close associated ones
2. Partition data set into clusters, and one can store cluster representation only.
3. Can be very effective if data is clustered but not if data is dirty.
4. There are many choices of clustering and clustering algorithms.
Data Reduction
26. ● Sampling
1.Choose a representative subset of the data
2.Simply selecting random sampling may have improve performance in the presence of scenario .
3.Develop adaptive sampling methods:
4.Stratified sampling: here we divide a population into homogeneous subpopulations called strata based on
specific characteristics (e.g., age, race, gender identity, location)
5.Approximate the percentage of each class (or subpopulation of interest) in the overall database
Data Reduction
27. Thank You !
Get in touch with us:
Lorem Studio, Lord Building
D4456, LA, USA