SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Data Preparation
By
Dr. Abhishek Kumar Singh
Data Issues
• Quality
• Accuracy
• Consistency
• Irrelevant data
Data Preparation
• Data preparation is the process of gathering,
combining, structuring and organizing data so
it can be used in business intelligence (BI),
analytics and data visualization applications.
• The components of data preparation include
data preprocessing, profiling, cleansing,
validation and transformation; it often also
involves pulling together data from different
internal systems and external sources.
Why Data Preperation
• There are several reasons why we need to
prepare the data.
– By preparing data, we actually prepare the miner
so that when using prepared data, the miner
produces better models faster.
– Good data is essential for producing efficient
models of any type.
– Data should be formatted according to required
software tool.
– Data need to be made adequate for given
method.
Benefits of Data Preparation
Data preparation helps:
• Fix errors quickly — Data preparation helps catch
errors before processing. After data has been
removed from its original source, these errors
become more difficult to understand and correct.
• Produce top-quality data — Cleaning and
reformatting datasets ensures that all data used in
analysis will be high quality.
• Make better business decisions — higher quality
data that can be processed and analyzed more
quickly and efficiently leads to more timely, efficient
and high quality business decisions.
Data preparation steps
1) Data Profiling
2) Data discretization
3) Data cleaning
4) Data integration
5) Data transformation
6) Data reduction
Data Profiling: Sourcing, selecting and
auditing appropriate data
Data Quality Measures
Accuracy Uniqueness
Integrity Consistency Density
Completeness Validity Scheme
Conformance
Uniformity
Data Profiling: Sourcing, selecting and
auditing appropriate data
• Assuring and improving data quality are two
of the primary reasons for data preprocessing.
• There are common criteria to measure and
evaluate the quality of data, which can be
categorized into two main elements; accuracy
and uniqueness.
Contd…
Data Profiling: Sourcing, selecting and
auditing appropriate data
• Accuracy is described as an aggregated value
over the quality criteria: Integrity, Consistency,
and Density.
• Intuitively this describes the extent to which
the data are an exact, uniform and complete
representation of the mini-world: the aspects
of the world that the data describe.
Contd…
Data Profiling: Sourcing, selecting and
auditing appropriate data
• Integrity: An integral data collection contains
representations of all the entities in the mini-
world and only of those.
• Access data from any source — no matter the
origin, format or narrative and integrating them
together. Increased access to data means less
manual work, faster insights and faster time to
value realized by your organization.
• Integrity requires both completeness and
validity.
Contd…
Data Profiling: Sourcing, selecting and
auditing appropriate data
• Completeness: Complete data give a
comprehensive representation of the mini-world
and contain no missing values.
• We achieve completeness within data cleansing by
correcting anomalies and not just deleting them.
• It is also possible that additional data are
generated, representing existing entities that are
currently unrepresented in the data.
• A problem with assessing completeness is that you
don't know what you don't know. As a result, there
are no known gold standard data, which can be
used as a reference to measure completeness.
Contd…
Data Profiling: Sourcing, selecting and
auditing appropriate data
• Validity: Data are valid when there are no
constraints violated.
• There are numerous mechanisms to increase
validity including mandatory fields, enforcing
unique values, and data schema/structure.
Contd…
Data Profiling: Sourcing, selecting and
auditing appropriate data
• Consistency: This quality concerns syntactic
anomalies as well as contradictions. The main
challenge concerning data consistency is
choosing which data source you trust for reliable
agreement among data across different sources.
– Schema conformance: This is especially true for the
relational database systems where the adherence of
domain formats relies on the user.
– Uniformity: is directly related to irregularities.
Contd…
Data Profiling: Sourcing, selecting and
auditing appropriate data
• Density: This criterion concerns the quotient of
missing values in the data. There still can be
non-existent values or properties that have to
be represented by null values having the exact
meaning of not being known.
The above three criteria of Integrity, Consistency,
and Density collectively represent the accuracy
measure.
Contd…
Data Cleansing
• Where the data contain noise or anomalies it may
be desirable to identify and remove outliers and
other suspect data points, or take other remedial
action.
• Data cleansing is defined as the process of
detecting and correcting (or removing) corrupt or
inaccurate records from a record set, table, or
database. Data cleansing can also be referred to
as data cleaning, data scrubbing, or data
reconciliation.
Data Cleansing
• More precisely, the process of data cleansing could
be explained as a four-stage process:
– Define and identify errors in data such as
incompleteness, incorrectness, inaccuracy or
irrelevancy.
– Clean and rectify these errors by replacing, modifying,
or deleting them
– Document error instances and error types; and finally
– Measure and verify to see whether the cleansing meets
the user's specified tolerance limits in terms of
cleanliness.
Contd…
Data anomalies
• The term data anomaly describes any
distortion of data resulting from the data
collection process.
• From this perspective, anomalies include
duplication, inconsistency, missing values,
outliers, noisy data or any kind of distortion
that can cause data imperfections.
Data anomalies
Anomalies can be classified at a high level into three
categories:
• Syntactic Anomalies: describe characteristics concerning
the format and values used for the representation of the
entities. Syntactic anomalies include lexical errors, domain
format errors, syntactical errors, and irregularities.
• Semantic Anomalies: hinder the data collection from being
a comprehensive and non-redundant representation of the
mini-world. These types of anomalies include integrity
constraint violations, contradictions, duplicates and invalid
tuples.
• Coverage Anomalies: decrease the number of entities and
entity properties from the mini-world that is represented in
the data collection. Coverage anomalies are categorized as
missing values and missing tuples.
Contd…
Data cleansing process
1. Data Auditing: This first step mainly identifies the types of anomalies
that reduce data quality. Data auditing checks the data using validation
rules that are pre-specified, and then creates a report of the quality of
the data and its problems. We often apply some statistical tests in this
step for examining the data.
2. Workflow specification: The next step is to detect and eliminate
anomalies by a sequence of operations on the data. The information
collected from data auditing is then used to create a data-cleaning plan.
It identifies the causes of the dirty data and plans steps to resolve them.
3. Workflow execution: The data cleaning plan is executed, applying a
variety of methods on the data set.
4. Post-processing and controlling: The post-processing or control step
involves examination of the workflow results and performs exception
handling for the data mishandled by the workflow.
Dealing with Missing values
• One major task in data cleansing is dealing with
missing values. It is important to determine whether
the data have missing values and, if so, to ensure that
appropriate measures are taken to allow the learning
system to handle this situation.
• Handling data that contain missing values is crucial for
the data cleansing process and data wrangling in
general. In real-life data, most of existing data sets
contain missing values that were not introduced or
were lost in the recording process for many reasons.
Handling outliers
• An outlier is another type of data anomaly that
requires attention in the cleansing process.
Outliers are data that do not conform to the
overall data distribution.
• Outliers can be seen from two different
perspectives; first, they might be seen as glitches
in the data. Alternatively, they might be also seen
as interesting elements that could potentially
represent significant elements in the data.
Data Enrichment/Integration
• Existing data may be augmented through data
enrichment. This commonly involves sourcing of
additional information about the data points on which
data are already held. For example, customer data might
be enriched by obtaining socio-economic data about
individual customers.
• Data integration is a crucial task in data preparation.
Combining data from different sources is not trivial
especially when dealing with large amounts of data and
heterogeneous sources. Data are typically presented in
different forms (structured, semi- structured or
unstructured) as well as from different sources (web,
database) that could be stored locally or distributed.
Data Transformation
• It is frequently necessary to transform data from one
representation to another. There are many reasons for
changing representations:
– To generate symmetric distributions instead of the original
skewed distributions:
– Transformation improves visualisation of data that might
be tightly clustered relative to a few outliers
– Data are transformed to achieve better interpretability.
– Transformations are often used to improve the
compatibility of the data with assumptions underlying a
modelling process, for example, to linearize (straighten)
the relation between two variables whose relationship is
non-linear. Some of the data mining algorithms require the
relationship between data to be linear.
Discretization
• Discretization transforms continuous data into a
discrete form. This is useful in many cases for:
better data representation, data volume
reduction, better data visualization and
representing data at a various level of granularity
for data analysis. Data discretization approaches
are categorized as supervised, unsupervised,
bottom-up or top down. Approaches for data
discretization include Binning, Entropy based,
Nominal to numeric, 3-4-5 rule and Concept
hierarchy.
Data preparation example
• There are multiple values that are commonly used to represent the
virus. A virus like COVID-19 could be represented by ‘SAR-Cov2’,
‘Corona’, ‘Covid’ or ‘Covid-19’ to name a few.
• A data preparation tool could be used in this scenario to identify an
incorrect number of unique values (in the case of virus, a unique
count greater than suitable number in case of covid would raise a
flag, as there are only few names aligned with virus). These values
would then need to be standardized to use only an abbreviation or
only full spelling in every row.
Data binning
• Data binning, bucketing is a data pre-
processing method used to minimize the
effects of small observation errors. The
original data values are divided into small
intervals known as bins and then they are
replaced by a general value calculated for that
bin. This has a smoothing effect on the input
data and may also reduce the chances of over
fitting in case of small datasets.
• There are 2 methods of dividing data into bins.
• Equal Frequency Binning: bins have equal
frequency.
• Equal Width Binning: bins have equal width
with a range of each bin are defined as [min +
w], [min + 2w] …. [min + nw] where w = (max
— min) / (no of bins).
Importance of Data Binning: -
• Binning is used for reducing the cardinality of continuous
and discrete data.
• Binning groups related values together in bins to reduce the
number of distinct values.
• Binning can improve resource utilization and model build
response time dramatically without significant loss in
model quality.
• Binning can improve model quality by strengthening the
relationship between attributes.
• Supervised binning is a form of intelligent binning in which
important characteristics of the data are used to determine
the bin boundaries.
• In supervised binning, the bin boundaries are identified by
a single-predictor decision tree that takes into account the
joint distribution with the target. Supervised binning can be
used for both numerical and categorical attributes.
Advantages (Pros) of data smoothing
• Data smoothing clears the understandability of
different important hidden patterns in the data set.
• Data smoothing can be used to help predict trends.
Prediction is very helpful for getting the right decisions
at the right time.
• Data smoothing helps in getting accurate results from
the data.
Cons of data smoothing
• Data smoothing doesn’t always provide a clear
explanation of the patterns among the data.
• It is possible that certain data points being ignored by
focusing the other data points.

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
Data integration
Data integrationData integration
Data integration
 
Kdd process
Kdd processKdd process
Kdd process
 
3 Data Mining Tasks
3  Data Mining Tasks3  Data Mining Tasks
3 Data Mining Tasks
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
 
Association Analysis in Data Mining
Association Analysis in Data MiningAssociation Analysis in Data Mining
Association Analysis in Data Mining
 
Data Analytics Life Cycle
Data Analytics Life CycleData Analytics Life Cycle
Data Analytics Life Cycle
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Preparation and Processing
Data Preparation and ProcessingData Preparation and Processing
Data Preparation and Processing
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...
SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...
SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...
 
Data Modeling PPT
Data Modeling PPTData Modeling PPT
Data Modeling PPT
 
OLAP operations
OLAP operationsOLAP operations
OLAP operations
 
03. Data Exploration.pptx
03. Data Exploration.pptx03. Data Exploration.pptx
03. Data Exploration.pptx
 
Application of data mining
Application of data miningApplication of data mining
Application of data mining
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data mining tasks
Data mining tasksData mining tasks
Data mining tasks
 
Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data mining
 
Data mining
Data mining Data mining
Data mining
 

Ähnlich wie Data Preparation.pptx

Data Analytics-Unit 1 , this Is ppt for student help
Data Analytics-Unit 1 , this Is ppt for student helpData Analytics-Unit 1 , this Is ppt for student help
Data Analytics-Unit 1 , this Is ppt for student help
SaurabhJaiswal790114
 
Data pre processing
Data pre processingData pre processing
Data pre processing
pommurajopt
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
wekineheshete
 

Ähnlich wie Data Preparation.pptx (20)

Data Analytics-Unit 1 , this Is ppt for student help
Data Analytics-Unit 1 , this Is ppt for student helpData Analytics-Unit 1 , this Is ppt for student help
Data Analytics-Unit 1 , this Is ppt for student help
 
Quality Assurance in Knowledge Data Warehouse
Quality Assurance in Knowledge Data WarehouseQuality Assurance in Knowledge Data Warehouse
Quality Assurance in Knowledge Data Warehouse
 
Ch~2.pdf
Ch~2.pdfCh~2.pdf
Ch~2.pdf
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
How do you assess the quality and reliability of data sources in data analysi...
How do you assess the quality and reliability of data sources in data analysi...How do you assess the quality and reliability of data sources in data analysi...
How do you assess the quality and reliability of data sources in data analysi...
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
 
Ch_2.pdf
Ch_2.pdfCh_2.pdf
Ch_2.pdf
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
 
omama munir 58.pptx
omama munir 58.pptxomama munir 58.pptx
omama munir 58.pptx
 
Chapter 4 Organizational Aspects of Data Management.ppt
Chapter 4 Organizational Aspects of Data Management.pptChapter 4 Organizational Aspects of Data Management.ppt
Chapter 4 Organizational Aspects of Data Management.ppt
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
Datawarehousing Terminology
Datawarehousing TerminologyDatawarehousing Terminology
Datawarehousing Terminology
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdf
 
DATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptxDATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptx
 
Introduction to Data Analytics.pptx
Introduction to Data Analytics.pptxIntroduction to Data Analytics.pptx
Introduction to Data Analytics.pptx
 
Unit2
Unit2Unit2
Unit2
 
BDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxBDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptx
 
AIS PPt.pptx
AIS PPt.pptxAIS PPt.pptx
AIS PPt.pptx
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data Analytics
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 

Mehr von DrAbhishekKumarSingh3 (6)

Microsoft word.pptx
Microsoft word.pptxMicrosoft word.pptx
Microsoft word.pptx
 
Sorting and Filtering.pptx
Sorting and Filtering.pptxSorting and Filtering.pptx
Sorting and Filtering.pptx
 
Data Analysis Introduction.pptx
Data Analysis Introduction.pptxData Analysis Introduction.pptx
Data Analysis Introduction.pptx
 
BASIC STRUCTURE OF COMPUTERS.pptx
BASIC STRUCTURE OF COMPUTERS.pptxBASIC STRUCTURE OF COMPUTERS.pptx
BASIC STRUCTURE OF COMPUTERS.pptx
 
How to start writing a paper.pptx
How to start writing a paper.pptxHow to start writing a paper.pptx
How to start writing a paper.pptx
 
Optimization using lp.pptx
Optimization using lp.pptxOptimization using lp.pptx
Optimization using lp.pptx
 

Kürzlich hochgeladen

Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
amitlee9823
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
Nelamangala Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Nelamangala Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Nelamangala Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Nelamangala Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
daisycvs
 
Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
Anamikakaur10
 
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
lizamodels9
 
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
dlhescort
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usage
Matteo Carbone
 
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
amitlee9823
 
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
dollysharma2066
 

Kürzlich hochgeladen (20)

Call Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine ServiceCall Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine Service
 
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLMONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
 
Nelamangala Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Nelamangala Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Nelamangala Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Nelamangala Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Falcon's Invoice Discounting: Your Path to Prosperity
Falcon's Invoice Discounting: Your Path to ProsperityFalcon's Invoice Discounting: Your Path to Prosperity
Falcon's Invoice Discounting: Your Path to Prosperity
 
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
 
Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
 
Value Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsValue Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and pains
 
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
 
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
 
Organizational Transformation Lead with Culture
Organizational Transformation Lead with CultureOrganizational Transformation Lead with Culture
Organizational Transformation Lead with Culture
 
Uneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration PresentationUneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration Presentation
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usage
 
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls Zirakpur👧 Book Now📱7837612180 📞👉Call Girl Service In Zirakpur No A...
Call Girls Zirakpur👧 Book Now📱7837612180 📞👉Call Girl Service In Zirakpur No A...Call Girls Zirakpur👧 Book Now📱7837612180 📞👉Call Girl Service In Zirakpur No A...
Call Girls Zirakpur👧 Book Now📱7837612180 📞👉Call Girl Service In Zirakpur No A...
 
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
 
Monthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptxMonthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptx
 
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
 

Data Preparation.pptx

  • 2. Data Issues • Quality • Accuracy • Consistency • Irrelevant data
  • 3. Data Preparation • Data preparation is the process of gathering, combining, structuring and organizing data so it can be used in business intelligence (BI), analytics and data visualization applications. • The components of data preparation include data preprocessing, profiling, cleansing, validation and transformation; it often also involves pulling together data from different internal systems and external sources.
  • 4. Why Data Preperation • There are several reasons why we need to prepare the data. – By preparing data, we actually prepare the miner so that when using prepared data, the miner produces better models faster. – Good data is essential for producing efficient models of any type. – Data should be formatted according to required software tool. – Data need to be made adequate for given method.
  • 5. Benefits of Data Preparation Data preparation helps: • Fix errors quickly — Data preparation helps catch errors before processing. After data has been removed from its original source, these errors become more difficult to understand and correct. • Produce top-quality data — Cleaning and reformatting datasets ensures that all data used in analysis will be high quality. • Make better business decisions — higher quality data that can be processed and analyzed more quickly and efficiently leads to more timely, efficient and high quality business decisions.
  • 6. Data preparation steps 1) Data Profiling 2) Data discretization 3) Data cleaning 4) Data integration 5) Data transformation 6) Data reduction
  • 7. Data Profiling: Sourcing, selecting and auditing appropriate data Data Quality Measures Accuracy Uniqueness Integrity Consistency Density Completeness Validity Scheme Conformance Uniformity
  • 8. Data Profiling: Sourcing, selecting and auditing appropriate data • Assuring and improving data quality are two of the primary reasons for data preprocessing. • There are common criteria to measure and evaluate the quality of data, which can be categorized into two main elements; accuracy and uniqueness. Contd…
  • 9. Data Profiling: Sourcing, selecting and auditing appropriate data • Accuracy is described as an aggregated value over the quality criteria: Integrity, Consistency, and Density. • Intuitively this describes the extent to which the data are an exact, uniform and complete representation of the mini-world: the aspects of the world that the data describe. Contd…
  • 10. Data Profiling: Sourcing, selecting and auditing appropriate data • Integrity: An integral data collection contains representations of all the entities in the mini- world and only of those. • Access data from any source — no matter the origin, format or narrative and integrating them together. Increased access to data means less manual work, faster insights and faster time to value realized by your organization. • Integrity requires both completeness and validity. Contd…
  • 11. Data Profiling: Sourcing, selecting and auditing appropriate data • Completeness: Complete data give a comprehensive representation of the mini-world and contain no missing values. • We achieve completeness within data cleansing by correcting anomalies and not just deleting them. • It is also possible that additional data are generated, representing existing entities that are currently unrepresented in the data. • A problem with assessing completeness is that you don't know what you don't know. As a result, there are no known gold standard data, which can be used as a reference to measure completeness. Contd…
  • 12. Data Profiling: Sourcing, selecting and auditing appropriate data • Validity: Data are valid when there are no constraints violated. • There are numerous mechanisms to increase validity including mandatory fields, enforcing unique values, and data schema/structure. Contd…
  • 13. Data Profiling: Sourcing, selecting and auditing appropriate data • Consistency: This quality concerns syntactic anomalies as well as contradictions. The main challenge concerning data consistency is choosing which data source you trust for reliable agreement among data across different sources. – Schema conformance: This is especially true for the relational database systems where the adherence of domain formats relies on the user. – Uniformity: is directly related to irregularities. Contd…
  • 14. Data Profiling: Sourcing, selecting and auditing appropriate data • Density: This criterion concerns the quotient of missing values in the data. There still can be non-existent values or properties that have to be represented by null values having the exact meaning of not being known. The above three criteria of Integrity, Consistency, and Density collectively represent the accuracy measure. Contd…
  • 15. Data Cleansing • Where the data contain noise or anomalies it may be desirable to identify and remove outliers and other suspect data points, or take other remedial action. • Data cleansing is defined as the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Data cleansing can also be referred to as data cleaning, data scrubbing, or data reconciliation.
  • 16. Data Cleansing • More precisely, the process of data cleansing could be explained as a four-stage process: – Define and identify errors in data such as incompleteness, incorrectness, inaccuracy or irrelevancy. – Clean and rectify these errors by replacing, modifying, or deleting them – Document error instances and error types; and finally – Measure and verify to see whether the cleansing meets the user's specified tolerance limits in terms of cleanliness. Contd…
  • 17. Data anomalies • The term data anomaly describes any distortion of data resulting from the data collection process. • From this perspective, anomalies include duplication, inconsistency, missing values, outliers, noisy data or any kind of distortion that can cause data imperfections.
  • 18. Data anomalies Anomalies can be classified at a high level into three categories: • Syntactic Anomalies: describe characteristics concerning the format and values used for the representation of the entities. Syntactic anomalies include lexical errors, domain format errors, syntactical errors, and irregularities. • Semantic Anomalies: hinder the data collection from being a comprehensive and non-redundant representation of the mini-world. These types of anomalies include integrity constraint violations, contradictions, duplicates and invalid tuples. • Coverage Anomalies: decrease the number of entities and entity properties from the mini-world that is represented in the data collection. Coverage anomalies are categorized as missing values and missing tuples. Contd…
  • 19. Data cleansing process 1. Data Auditing: This first step mainly identifies the types of anomalies that reduce data quality. Data auditing checks the data using validation rules that are pre-specified, and then creates a report of the quality of the data and its problems. We often apply some statistical tests in this step for examining the data. 2. Workflow specification: The next step is to detect and eliminate anomalies by a sequence of operations on the data. The information collected from data auditing is then used to create a data-cleaning plan. It identifies the causes of the dirty data and plans steps to resolve them. 3. Workflow execution: The data cleaning plan is executed, applying a variety of methods on the data set. 4. Post-processing and controlling: The post-processing or control step involves examination of the workflow results and performs exception handling for the data mishandled by the workflow.
  • 20. Dealing with Missing values • One major task in data cleansing is dealing with missing values. It is important to determine whether the data have missing values and, if so, to ensure that appropriate measures are taken to allow the learning system to handle this situation. • Handling data that contain missing values is crucial for the data cleansing process and data wrangling in general. In real-life data, most of existing data sets contain missing values that were not introduced or were lost in the recording process for many reasons.
  • 21. Handling outliers • An outlier is another type of data anomaly that requires attention in the cleansing process. Outliers are data that do not conform to the overall data distribution. • Outliers can be seen from two different perspectives; first, they might be seen as glitches in the data. Alternatively, they might be also seen as interesting elements that could potentially represent significant elements in the data.
  • 22. Data Enrichment/Integration • Existing data may be augmented through data enrichment. This commonly involves sourcing of additional information about the data points on which data are already held. For example, customer data might be enriched by obtaining socio-economic data about individual customers. • Data integration is a crucial task in data preparation. Combining data from different sources is not trivial especially when dealing with large amounts of data and heterogeneous sources. Data are typically presented in different forms (structured, semi- structured or unstructured) as well as from different sources (web, database) that could be stored locally or distributed.
  • 23. Data Transformation • It is frequently necessary to transform data from one representation to another. There are many reasons for changing representations: – To generate symmetric distributions instead of the original skewed distributions: – Transformation improves visualisation of data that might be tightly clustered relative to a few outliers – Data are transformed to achieve better interpretability. – Transformations are often used to improve the compatibility of the data with assumptions underlying a modelling process, for example, to linearize (straighten) the relation between two variables whose relationship is non-linear. Some of the data mining algorithms require the relationship between data to be linear.
  • 24. Discretization • Discretization transforms continuous data into a discrete form. This is useful in many cases for: better data representation, data volume reduction, better data visualization and representing data at a various level of granularity for data analysis. Data discretization approaches are categorized as supervised, unsupervised, bottom-up or top down. Approaches for data discretization include Binning, Entropy based, Nominal to numeric, 3-4-5 rule and Concept hierarchy.
  • 25. Data preparation example • There are multiple values that are commonly used to represent the virus. A virus like COVID-19 could be represented by ‘SAR-Cov2’, ‘Corona’, ‘Covid’ or ‘Covid-19’ to name a few. • A data preparation tool could be used in this scenario to identify an incorrect number of unique values (in the case of virus, a unique count greater than suitable number in case of covid would raise a flag, as there are only few names aligned with virus). These values would then need to be standardized to use only an abbreviation or only full spelling in every row.
  • 26. Data binning • Data binning, bucketing is a data pre- processing method used to minimize the effects of small observation errors. The original data values are divided into small intervals known as bins and then they are replaced by a general value calculated for that bin. This has a smoothing effect on the input data and may also reduce the chances of over fitting in case of small datasets.
  • 27. • There are 2 methods of dividing data into bins. • Equal Frequency Binning: bins have equal frequency. • Equal Width Binning: bins have equal width with a range of each bin are defined as [min + w], [min + 2w] …. [min + nw] where w = (max — min) / (no of bins).
  • 28. Importance of Data Binning: - • Binning is used for reducing the cardinality of continuous and discrete data. • Binning groups related values together in bins to reduce the number of distinct values. • Binning can improve resource utilization and model build response time dramatically without significant loss in model quality. • Binning can improve model quality by strengthening the relationship between attributes. • Supervised binning is a form of intelligent binning in which important characteristics of the data are used to determine the bin boundaries. • In supervised binning, the bin boundaries are identified by a single-predictor decision tree that takes into account the joint distribution with the target. Supervised binning can be used for both numerical and categorical attributes.
  • 29. Advantages (Pros) of data smoothing • Data smoothing clears the understandability of different important hidden patterns in the data set. • Data smoothing can be used to help predict trends. Prediction is very helpful for getting the right decisions at the right time. • Data smoothing helps in getting accurate results from the data. Cons of data smoothing • Data smoothing doesn’t always provide a clear explanation of the patterns among the data. • It is possible that certain data points being ignored by focusing the other data points.