SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
Presented By: Aayush Srivastava
& Niraj Kumar
Data Pre –processing
& Steps Involved In It
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Join the session 5 minutes prior to
the session start time. We start on
time and conclude on time!
Feedback
Make sure to submit a constructive
feedback for all sessions as it is
very helpful for the presenter.
Silent Mode
Keep your mobile devices in silent
mode, feel free to move out of
session in case you need to attend
an urgent call.
Avoid Disturbance
Avoid unwanted chit chat during
the session.
Our Agenda
01 What and Why Data
Preprocessing
02 Data Cleaning
03
Data Transformation
04
Data Reductions
05
Demo
05
4
06
Data Integration
Data Cleaning Services
Good data preparation is key to produce valid and reliable
models.
What is Machine Learning ?
● According to Arthur Samuel(1959), Machine Learning algorithms enable the computers to learn from data,
and even improve themselves, without being explicitly programmed.
● Machine learning (ML) is a category of an algorithm that allows software applications to become more
accurate in predicting outcomes without being explicitly programmed.
● The basic premise of machine learning is to build algorithms that can receive input data and use statistical
analysis to predict an output while updating outputs as new data becomes available.
Types of Machine Learning
Machine Learning Life Cycle
● Data preprocessing is an important step in ML.
● The phrase "garbage in, garbage out" is particularly applicable to data.
● It is the process of transforming raw data into a useful, understandable format.
● Real-world or raw data usually has inconsistent formatting, human errors, and can also be incomplete.
● Data preprocessing resolves such issues and makes datasets more complete and efficient to perform
data analysis.
● It’s a crucial process that can affect the success of data mining and machine learning projects.
● It makes knowledge discovery from data sets faster and can ultimately affect the performance of
machine learning models.
What is Data Pre-Processing
Why Data Pre-Processing
● Data is the real world is “dirty”
○ incomplete: missing attribute value, lack of certain attributes of interest, or containing only aggregate
data.
■ e.g. department = “”
○ noisy: containing errors or outliers
■ salary = “-10”
○ inconsistent: containing discrepancies between in codes or names
■ Age = 42 Birthday = “27/02/1997”
● These mistakes, redundancies, missing values, and inconsistencies compromise the integrity of the set.
● We need to fix all those issues for a more accurate outcome. Chances are that the system will develop
biases and deviations that will produce a poor user experience.
Why Data Pre-Processing
Data Understanding: Relevance of data
• What data is available for the task?
• Is this data relevant?
• Is additional relevant data available?
• How much historical data is available?
Data Pre Processing Steps
● Data cleaning or cleansing is the process of cleaning datasets by accounting for missing values, removing
outliers, correcting inconsistent data points, and smoothing noisy data.
● In essence, the motive behind data cleaning is to offer complete and accurate samples for machine learning
models.
Data Cleaning
Some Effective data cleaning techniques:
● Remove duplicates
○ Duplicate entries are problematic for multiple reasons.
○ First off, when an entry appears more than once, it receives a disproportionate weight during training.
○ Thus models that succeed on frequent entries will look like they perform well, while in reality this is not the
case.
● Remove irrelevant data
○ Data often comes from multiple sources, and there is a significant probability that a given table or database
includes entries that do not really belong for our use case. In some cases filtering out outdated entries will be
required
● Fix Errors
○ It probably goes without saying that we will need to carefully remove any errors from our data. Errors as
avoidable as typos could lead us missing out on key findings from your data. Some of these can be
avoided with something as simple as a quick spell-check.
○ Example: Spelling mistakes or extra punctuation in data like an email address could mean you miss out on
communicating with your customers. It could also lead to you sending unwanted emails to people who
didn’t sign up for them.
● Handle missing values
○ Missing data is defined as the values or data that is not stored (or not present) for some variable/s in the
given database
Data Cleaning
● Remove Noisy Data
○ Noisy data are random error or variance in a measured variable.
○ Incorrect attribute values may due to
■ faulty data collection instruments
■ data entry problems
■ data transmission problems
■ technology limitation
■ inconsistency in naming convention
Data Cleaning
Handling missing data
Ignore the tuple (loss of information)
• Fill in missing values manually: tedious, infeasible?
• Fill in it automatically with
a global constant : e.g., unknown, a new class?!
Imputation: Use the attribute mean/median/mode to fill in the
missing value,
Use the most probable value to fill in the missing value.
Handling noisy data
Binning method:
● Binning method is used to smoothing data or to handle noisy data.
● In this method, the data is first sorted and then the sorted values are distributed into a
number of buckets or bins.
● As binning methods consult the neighbourhood of values, they perform local
smoothing.
● Three kinds of smoothing methods:-
○ Smoothing by bin means : In smoothing by bin means, each value in a bin is replaced by the mean
value of the bin.
○ Smoothing by bin median : In this method each bin value is replaced by its bin median value.
○ Smoothing by bin boundary : In smoothing by bin boundaries, the minimum and maximum values
in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest
boundary value.
• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
-Bin 1: 4, 8, 9, 15
-Bin 2: 21, 21, 24, 25
-Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
-Bin 1: 9, 9, 9, 9
(4+8+9+15/4) =9
-Bin 2: 23, 23, 23, 23 (21+21+24+25/4)=23
-Bin 3: 29, 29, 29, 29 (26+28+29+34/4) =29
* Smoothing by bin boundaries:
-Bin 1: 4, 4, 4, 15
-Bin 2: 21, 21, 25, 25
-Bin 3: 26, 26, 26, 34
Handling noisy data
Data Integration
● Data integration
○ Its combines data from multiple sources which are stored using various technologies and provide a unified view of
the data
● Schema integration
○ Integrate metadata from different sources
○ Entity identification problem: identify real world entities from
○ multiple data sources, e.g., A.cust-id ≡ B.cust-#
● Detecting and resolving data value conflicts
for the same real world entity, attribute values from different
sources are different, e.g., different scales, metric vs. British units
● Removing duplicates and redundant data
With data cleaning, we’ve already begun to modify our data, but data transformation will
begin the process of turning the data into the proper format(s) we will need for analysis
and other downstream processes.
Data transformation Strategies:
● Aggregation - Data aggregation is the process where data is collected and presented in a summarized format
for statistical analysis.This process finding sum ,average, max etc
● Feature Scaling - Feature Scaling is a technique to standardize the independent features present in the data in a
fixed range. It is performed during the data pre-processing to handle highly varying magnitudes or values or units.
Data Transformation
● Normalization - Data normalization is the method of organizing data to appear similar across all records and fields. In
this technique we rescale each row of data to a length of 1. This is mainly useful for sparse datasets with lots of zeros.
Performing so always results in getting higher quality data.
Normalization can be 2 types:
1.L1 Normalization
It is defined as the normalization technique that modifies the dataset values in a way that in each row the sum of the
absolute values will always be up to 1. It is also known as Least Absolute Deviations.
2.L2 Normalization
It is defined as the normalization technique that modifies the dataset values in a way that in each row the sum of the
squares will always be up to 1. It is also called least squares. It also penalises large weights.
Data Transformation
● Feature selection - Feature Selection is the method of reducing the input variable to your model by using only
relevant data.
Benefits of feature selection
1.Performing feature selection before data modeling will reduce the overfitting.
2.Performing feature selection before data modeling will increases the accuracy of ML model.
3.Performing feature selection before data modeling will reduce the training time
Data Transformation
● Dimensionality reduction, also known as dimension reduction, reduces the number of features or input variables in
a dataset.
● The number of features or input variables of a dataset is called its dimensionality.
● The higher the number of features, the more troublesome it is to visualize the training dataset and create a
predictive model.
● In some cases, most of these attributes are correlated, hence redundant; therefore, dimensionality reduction
algorithms can be used to reduce the number of random variables and obtain a set of principal variables.
Data reduction strategies
● Dimensionality reduction(PCA)
Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the
dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most
of the information in the large set.
● Aggregation and clustering
1. Remove redundant or close associated ones
2. Partition data set into clusters, and one can store cluster representation only.
3. Can be very effective if data is clustered but not if data is dirty.
4. There are many choices of clustering and clustering algorithms.
Data Reduction
● Sampling
1.Choose a representative subset of the data
2.Simply selecting random sampling may have improve performance in the presence of scenario .
3.Develop adaptive sampling methods:
4.Stratified sampling: here we divide a population into homogeneous subpopulations called strata based on
specific characteristics (e.g., age, race, gender identity, location)
5.Approximate the percentage of each class (or subpopulation of interest) in the overall database
Data Reduction
Thank You !
Get in touch with us:
Lorem Studio, Lord Building
D4456, LA, USA

Weitere ähnliche Inhalte

Ähnlich wie KNOLX_Data_preprocessing

Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfAmmarAhmedSiddiqui2
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedYugal Kumar
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial IndustrySubrat Panda, PhD
 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.pptDeadpool120050
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningNandakumar P
 
Chapter 3 Data Preprocessing techniques.pptx
Chapter 3 Data Preprocessing techniques.pptxChapter 3 Data Preprocessing techniques.pptx
Chapter 3 Data Preprocessing techniques.pptxManishaPatil932723
 
Dimensionality Reduction.pptx
Dimensionality Reduction.pptxDimensionality Reduction.pptx
Dimensionality Reduction.pptxPriyadharshiniG41
 
Data Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data QualityData Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data Qualitypriyanka rajput
 
Data pre processing
Data pre processingData pre processing
Data pre processingpommurajopt
 
Anwar kamal .pdf.pptx
Anwar kamal .pdf.pptxAnwar kamal .pdf.pptx
Anwar kamal .pdf.pptxLuminous8
 

Ähnlich wie KNOLX_Data_preprocessing (20)

1234
12341234
1234
 
Assignmentdatamining
AssignmentdataminingAssignmentdatamining
Assignmentdatamining
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.ppt
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
 
Chapter 3 Data Preprocessing techniques.pptx
Chapter 3 Data Preprocessing techniques.pptxChapter 3 Data Preprocessing techniques.pptx
Chapter 3 Data Preprocessing techniques.pptx
 
Data preprocess
Data preprocessData preprocess
Data preprocess
 
Dimensionality Reduction.pptx
Dimensionality Reduction.pptxDimensionality Reduction.pptx
Dimensionality Reduction.pptx
 
Data preprocessing.pdf
Data preprocessing.pdfData preprocessing.pdf
Data preprocessing.pdf
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Dmblog
DmblogDmblog
Dmblog
 
Data Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data QualityData Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data Quality
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
Module-4_Part-II.pptx
Module-4_Part-II.pptxModule-4_Part-II.pptx
Module-4_Part-II.pptx
 
Anwar kamal .pdf.pptx
Anwar kamal .pdf.pptxAnwar kamal .pdf.pptx
Anwar kamal .pdf.pptx
 

Mehr von Knoldus Inc.

Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingMastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingKnoldus Inc.
 
Akka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On IntroductionAkka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On IntroductionKnoldus Inc.
 
Entity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptxEntity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptxKnoldus Inc.
 
Introduction to Redis and its features.pptx
Introduction to Redis and its features.pptxIntroduction to Redis and its features.pptx
Introduction to Redis and its features.pptxKnoldus Inc.
 
GraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdfGraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdfKnoldus Inc.
 
NuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptxNuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptxKnoldus Inc.
 
Data Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingData Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingKnoldus Inc.
 
K8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose KubernetesK8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose KubernetesKnoldus Inc.
 
Introduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptxIntroduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptxKnoldus Inc.
 
Robusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxRobusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxKnoldus Inc.
 
Optimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptxOptimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptxKnoldus Inc.
 
Azure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptxAzure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptxKnoldus Inc.
 
CQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptxCQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptxKnoldus Inc.
 
ETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake PresentationETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake PresentationKnoldus Inc.
 
Scripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics PresentationScripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics PresentationKnoldus Inc.
 
Getting started with dotnet core Web APIs
Getting started with dotnet core Web APIsGetting started with dotnet core Web APIs
Getting started with dotnet core Web APIsKnoldus Inc.
 
Introduction To Rust part II Presentation
Introduction To Rust part II PresentationIntroduction To Rust part II Presentation
Introduction To Rust part II PresentationKnoldus Inc.
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Configuring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRAConfiguring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRAKnoldus Inc.
 
Advanced Python (with dependency injection and hydra configuration packages)
Advanced Python (with dependency injection and hydra configuration packages)Advanced Python (with dependency injection and hydra configuration packages)
Advanced Python (with dependency injection and hydra configuration packages)Knoldus Inc.
 

Mehr von Knoldus Inc. (20)

Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingMastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
 
Akka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On IntroductionAkka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On Introduction
 
Entity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptxEntity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptx
 
Introduction to Redis and its features.pptx
Introduction to Redis and its features.pptxIntroduction to Redis and its features.pptx
Introduction to Redis and its features.pptx
 
GraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdfGraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdf
 
NuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptxNuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptx
 
Data Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingData Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable Testing
 
K8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose KubernetesK8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose Kubernetes
 
Introduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptxIntroduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptx
 
Robusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxRobusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptx
 
Optimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptxOptimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptx
 
Azure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptxAzure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptx
 
CQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptxCQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptx
 
ETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake PresentationETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake Presentation
 
Scripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics PresentationScripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics Presentation
 
Getting started with dotnet core Web APIs
Getting started with dotnet core Web APIsGetting started with dotnet core Web APIs
Getting started with dotnet core Web APIs
 
Introduction To Rust part II Presentation
Introduction To Rust part II PresentationIntroduction To Rust part II Presentation
Introduction To Rust part II Presentation
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Configuring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRAConfiguring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRA
 
Advanced Python (with dependency injection and hydra configuration packages)
Advanced Python (with dependency injection and hydra configuration packages)Advanced Python (with dependency injection and hydra configuration packages)
Advanced Python (with dependency injection and hydra configuration packages)
 

Kürzlich hochgeladen

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 

Kürzlich hochgeladen (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 

KNOLX_Data_preprocessing

  • 1. Presented By: Aayush Srivastava & Niraj Kumar Data Pre –processing & Steps Involved In It
  • 2. Lack of etiquette and manners is a huge turn off. KnolX Etiquettes Punctuality Join the session 5 minutes prior to the session start time. We start on time and conclude on time! Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter. Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call. Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3. Our Agenda 01 What and Why Data Preprocessing 02 Data Cleaning 03 Data Transformation 04 Data Reductions 05 Demo 05 4 06 Data Integration
  • 4. Data Cleaning Services Good data preparation is key to produce valid and reliable models.
  • 5. What is Machine Learning ? ● According to Arthur Samuel(1959), Machine Learning algorithms enable the computers to learn from data, and even improve themselves, without being explicitly programmed. ● Machine learning (ML) is a category of an algorithm that allows software applications to become more accurate in predicting outcomes without being explicitly programmed. ● The basic premise of machine learning is to build algorithms that can receive input data and use statistical analysis to predict an output while updating outputs as new data becomes available.
  • 6. Types of Machine Learning
  • 7.
  • 9. ● Data preprocessing is an important step in ML. ● The phrase "garbage in, garbage out" is particularly applicable to data. ● It is the process of transforming raw data into a useful, understandable format. ● Real-world or raw data usually has inconsistent formatting, human errors, and can also be incomplete. ● Data preprocessing resolves such issues and makes datasets more complete and efficient to perform data analysis. ● It’s a crucial process that can affect the success of data mining and machine learning projects. ● It makes knowledge discovery from data sets faster and can ultimately affect the performance of machine learning models. What is Data Pre-Processing
  • 11. ● Data is the real world is “dirty” ○ incomplete: missing attribute value, lack of certain attributes of interest, or containing only aggregate data. ■ e.g. department = “” ○ noisy: containing errors or outliers ■ salary = “-10” ○ inconsistent: containing discrepancies between in codes or names ■ Age = 42 Birthday = “27/02/1997” ● These mistakes, redundancies, missing values, and inconsistencies compromise the integrity of the set. ● We need to fix all those issues for a more accurate outcome. Chances are that the system will develop biases and deviations that will produce a poor user experience. Why Data Pre-Processing
  • 12. Data Understanding: Relevance of data • What data is available for the task? • Is this data relevant? • Is additional relevant data available? • How much historical data is available?
  • 14.
  • 15. ● Data cleaning or cleansing is the process of cleaning datasets by accounting for missing values, removing outliers, correcting inconsistent data points, and smoothing noisy data. ● In essence, the motive behind data cleaning is to offer complete and accurate samples for machine learning models. Data Cleaning Some Effective data cleaning techniques: ● Remove duplicates ○ Duplicate entries are problematic for multiple reasons. ○ First off, when an entry appears more than once, it receives a disproportionate weight during training. ○ Thus models that succeed on frequent entries will look like they perform well, while in reality this is not the case. ● Remove irrelevant data ○ Data often comes from multiple sources, and there is a significant probability that a given table or database includes entries that do not really belong for our use case. In some cases filtering out outdated entries will be required
  • 16. ● Fix Errors ○ It probably goes without saying that we will need to carefully remove any errors from our data. Errors as avoidable as typos could lead us missing out on key findings from your data. Some of these can be avoided with something as simple as a quick spell-check. ○ Example: Spelling mistakes or extra punctuation in data like an email address could mean you miss out on communicating with your customers. It could also lead to you sending unwanted emails to people who didn’t sign up for them. ● Handle missing values ○ Missing data is defined as the values or data that is not stored (or not present) for some variable/s in the given database Data Cleaning
  • 17. ● Remove Noisy Data ○ Noisy data are random error or variance in a measured variable. ○ Incorrect attribute values may due to ■ faulty data collection instruments ■ data entry problems ■ data transmission problems ■ technology limitation ■ inconsistency in naming convention Data Cleaning
  • 18. Handling missing data Ignore the tuple (loss of information) • Fill in missing values manually: tedious, infeasible? • Fill in it automatically with a global constant : e.g., unknown, a new class?! Imputation: Use the attribute mean/median/mode to fill in the missing value, Use the most probable value to fill in the missing value.
  • 19. Handling noisy data Binning method: ● Binning method is used to smoothing data or to handle noisy data. ● In this method, the data is first sorted and then the sorted values are distributed into a number of buckets or bins. ● As binning methods consult the neighbourhood of values, they perform local smoothing. ● Three kinds of smoothing methods:- ○ Smoothing by bin means : In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. ○ Smoothing by bin median : In this method each bin value is replaced by its bin median value. ○ Smoothing by bin boundary : In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value.
  • 20. • Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: -Bin 1: 4, 8, 9, 15 -Bin 2: 21, 21, 24, 25 -Bin 3: 26, 28, 29, 34 * Smoothing by bin means: -Bin 1: 9, 9, 9, 9 (4+8+9+15/4) =9 -Bin 2: 23, 23, 23, 23 (21+21+24+25/4)=23 -Bin 3: 29, 29, 29, 29 (26+28+29+34/4) =29 * Smoothing by bin boundaries: -Bin 1: 4, 4, 4, 15 -Bin 2: 21, 21, 25, 25 -Bin 3: 26, 26, 26, 34 Handling noisy data
  • 21. Data Integration ● Data integration ○ Its combines data from multiple sources which are stored using various technologies and provide a unified view of the data ● Schema integration ○ Integrate metadata from different sources ○ Entity identification problem: identify real world entities from ○ multiple data sources, e.g., A.cust-id ≡ B.cust-# ● Detecting and resolving data value conflicts for the same real world entity, attribute values from different sources are different, e.g., different scales, metric vs. British units ● Removing duplicates and redundant data
  • 22. With data cleaning, we’ve already begun to modify our data, but data transformation will begin the process of turning the data into the proper format(s) we will need for analysis and other downstream processes. Data transformation Strategies: ● Aggregation - Data aggregation is the process where data is collected and presented in a summarized format for statistical analysis.This process finding sum ,average, max etc ● Feature Scaling - Feature Scaling is a technique to standardize the independent features present in the data in a fixed range. It is performed during the data pre-processing to handle highly varying magnitudes or values or units. Data Transformation
  • 23. ● Normalization - Data normalization is the method of organizing data to appear similar across all records and fields. In this technique we rescale each row of data to a length of 1. This is mainly useful for sparse datasets with lots of zeros. Performing so always results in getting higher quality data. Normalization can be 2 types: 1.L1 Normalization It is defined as the normalization technique that modifies the dataset values in a way that in each row the sum of the absolute values will always be up to 1. It is also known as Least Absolute Deviations. 2.L2 Normalization It is defined as the normalization technique that modifies the dataset values in a way that in each row the sum of the squares will always be up to 1. It is also called least squares. It also penalises large weights. Data Transformation
  • 24. ● Feature selection - Feature Selection is the method of reducing the input variable to your model by using only relevant data. Benefits of feature selection 1.Performing feature selection before data modeling will reduce the overfitting. 2.Performing feature selection before data modeling will increases the accuracy of ML model. 3.Performing feature selection before data modeling will reduce the training time Data Transformation
  • 25. ● Dimensionality reduction, also known as dimension reduction, reduces the number of features or input variables in a dataset. ● The number of features or input variables of a dataset is called its dimensionality. ● The higher the number of features, the more troublesome it is to visualize the training dataset and create a predictive model. ● In some cases, most of these attributes are correlated, hence redundant; therefore, dimensionality reduction algorithms can be used to reduce the number of random variables and obtain a set of principal variables. Data reduction strategies ● Dimensionality reduction(PCA) Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. ● Aggregation and clustering 1. Remove redundant or close associated ones 2. Partition data set into clusters, and one can store cluster representation only. 3. Can be very effective if data is clustered but not if data is dirty. 4. There are many choices of clustering and clustering algorithms. Data Reduction
  • 26. ● Sampling 1.Choose a representative subset of the data 2.Simply selecting random sampling may have improve performance in the presence of scenario . 3.Develop adaptive sampling methods: 4.Stratified sampling: here we divide a population into homogeneous subpopulations called strata based on specific characteristics (e.g., age, race, gender identity, location) 5.Approximate the percentage of each class (or subpopulation of interest) in the overall database Data Reduction
  • 27. Thank You ! Get in touch with us: Lorem Studio, Lord Building D4456, LA, USA