SlideShare ist ein Scribd-Unternehmen logo
1 von 13
WHY DATA PREPROCESSING?
Data in the real world is dirty
incomplete: missing attribute values, lack of certain
attributes of interest, or containing only aggregate data
 e.g., occupation=“”
noisy: containing errors or outliers
 e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or
names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
MAJOR TASKS IN DATA PREPROCESSING
Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers
and noisy data, and resolve inconsistencies
Data integration
 Integration of multiple databases, or files
Data transformation
 Normalization and aggregation
Data reduction
 Obtains reduced representation in volume but produces the same or
similar analytical results
Data discretization (for numerical data)
DATA CLEANING
Importance
 “Data cleaning is the number one problem in data
warehousing”
Data cleaning tasks – this routine attempts to
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration
MISSING DATA
Data is not always available
 E.g., many tuples have no recorded values for several attributes,
such as customer income in sales data
Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of entry
 not register history or changes of the data
DATA INTEGRATION
Data integration:
 combines data from multiple sources(data cubes, multiple db or flat
files)
Issues during data integration
 Schema integration
 integrate metadata (about the data) from different sources
 Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id  B.cust-#(same entity?)
 Detecting and resolving data value conflicts
 for the same real world entity, attribute values from different
sources are different, e.g., different scales, metric vs. British units
 Removing duplicates and redundant data
 An attribute can be derived from another table (annual revenue)
 Inconsistencies in attribute naming
DATA TRANSFORMATION
Smoothing: remove noise from data (binning, clustering, regression)
Normalization: scaled to fall within a small, specified range such as
–1.0 to 1.0 or 0.0 to 1.0
Attribute/feature construction
 New attributes constructed / added from the given ones
Aggregation: summarization or aggregation operations apply to data
Generalization: concept hierarchy climbing
 Low level/ primitive/raw data are replace by higher level concepts
DATA REDUCTION STRATEGIES
Data is too big to work with – may takes time,
impractical or infeasible analysis
Data reduction techniques
Obtain a reduced representation of the data set that is
much smaller in volume but yet produce the same (or
almost the same) analytical results
Data reduction strategies
Data cube aggregation – apply aggregation operations
(data cube)
CLUSTERING
Partition data set into clusters, and one can store cluster representation only
Can be very effective if data is clustered but not if data is “smeared”/ spread
There are many choices of clustering definitions and clustering algorithms. We will
discuss them later.
SAMPLING
Data reduction technique
A large data set to be represented by much smaller
random sample or subset.
4 types
Simple random sampling without replacement
(SRSWOR).
Simple random sampling with replacement (SRSWR).
Develop adaptive sampling methods such as cluster
sample and stratified sample
DISCRETIZATION AND CONCEPT HIERARCHY
Discretization
 reduce the number of values for a given continuous attribute
by dividing the range of the attribute into intervals. Interval
labels can then be used to replace actual data values
Concept hierarchies
 reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age) by
higher level concepts (such as young, middle-aged, or senior)
SOME TECHNIQUES
-Binning methods – equal-width, equal-frequency
-Histogram
- Entropy-based methods
SUMMARY
Data preparation is a big issue for data mining
Data preparation includes
Data cleaning and data integration
Data reduction and feature selection
Discretization
Many methods have been proposed but still an
active area of research

Weitere ähnliche Inhalte

Was ist angesagt?

03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
purnimatm
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
kayathri02
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unit
bhagathk
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 

Was ist angesagt? (14)

Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
 
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unit
 
Data preprocess
Data preprocessData preprocess
Data preprocess
 
Data preprocessing in Data Mining
Data preprocessing  in Data MiningData preprocessing  in Data Mining
Data preprocessing in Data Mining
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Mining: Data Preprocessing
Data Mining: Data PreprocessingData Mining: Data Preprocessing
Data Mining: Data Preprocessing
 

Andere mochten auch

Spss basic1
Spss basic1Spss basic1
Spss basic1
UPM
 
Overview of DATA PREPROCESS..
Overview of DATA PREPROCESS..Overview of DATA PREPROCESS..
Overview of DATA PREPROCESS..
killerkarthic
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Slideshare
 
Pa 298 measures of correlation
Pa 298 measures of correlationPa 298 measures of correlation
Pa 298 measures of correlation
Maria Theresa
 
Costaatt spss presentation
Costaatt spss presentationCostaatt spss presentation
Costaatt spss presentation
kesterdavid
 
One Way Anova
One Way AnovaOne Way Anova
One Way Anova
shoffma5
 
Questionnaire Results and Analysis
Questionnaire Results and AnalysisQuestionnaire Results and Analysis
Questionnaire Results and Analysis
antonia-roberts
 

Andere mochten auch (20)

Spss basic1
Spss basic1Spss basic1
Spss basic1
 
Overview of DATA PREPROCESS..
Overview of DATA PREPROCESS..Overview of DATA PREPROCESS..
Overview of DATA PREPROCESS..
 
Preprocess
PreprocessPreprocess
Preprocess
 
Introduction to spss – part 1
Introduction to spss – part 1Introduction to spss – part 1
Introduction to spss – part 1
 
4 preprocess
4 preprocess4 preprocess
4 preprocess
 
Descriptive statistics ii
Descriptive statistics iiDescriptive statistics ii
Descriptive statistics ii
 
Descriptives & Graphing
Descriptives & GraphingDescriptives & Graphing
Descriptives & Graphing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
BID CE workshop 1 session 08 - Biodiversity Data Cleaning
BID CE workshop 1   session 08 - Biodiversity Data CleaningBID CE workshop 1   session 08 - Biodiversity Data Cleaning
BID CE workshop 1 session 08 - Biodiversity Data Cleaning
 
Theory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineTheory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefine
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Correlation in simple terms
Correlation in simple termsCorrelation in simple terms
Correlation in simple terms
 
Pa 298 measures of correlation
Pa 298 measures of correlationPa 298 measures of correlation
Pa 298 measures of correlation
 
Correlation
CorrelationCorrelation
Correlation
 
Correlation in physical science
Correlation in physical science Correlation in physical science
Correlation in physical science
 
Costaatt spss presentation
Costaatt spss presentationCostaatt spss presentation
Costaatt spss presentation
 
Correlation VS Causation
Correlation VS CausationCorrelation VS Causation
Correlation VS Causation
 
Basic One-Way ANOVA
Basic One-Way ANOVABasic One-Way ANOVA
Basic One-Way ANOVA
 
One Way Anova
One Way AnovaOne Way Anova
One Way Anova
 
Questionnaire Results and Analysis
Questionnaire Results and AnalysisQuestionnaire Results and Analysis
Questionnaire Results and Analysis
 

Ähnlich wie Data preprocessing (20)

Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data1
Data1Data1
Data1
 
Data1
Data1Data1
Data1
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Datapreprocessingppt
DatapreprocessingpptDatapreprocessingppt
Datapreprocessingppt
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Datapreprocess
DatapreprocessDatapreprocess
Datapreprocess
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preparation
Data preparationData preparation
Data preparation
 
Data preparation
Data preparationData preparation
Data preparation
 
Data preparation
Data preparationData preparation
Data preparation
 

Kürzlich hochgeladen

Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Kürzlich hochgeladen (20)

Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 

Data preprocessing

  • 1.
  • 2. WHY DATA PREPROCESSING? Data in the real world is dirty incomplete: missing attribute values, lack of certain attributes of interest, or containing only aggregate data  e.g., occupation=“” noisy: containing errors or outliers  e.g., Salary=“-10” inconsistent: containing discrepancies in codes or names  e.g., Age=“42” Birthday=“03/07/1997”  e.g., Was rating “1,2,3”, now rating “A, B, C”  e.g., discrepancy between duplicate records
  • 3. MAJOR TASKS IN DATA PREPROCESSING Data cleaning  Fill in missing values, smooth noisy data, identify or remove outliers and noisy data, and resolve inconsistencies Data integration  Integration of multiple databases, or files Data transformation  Normalization and aggregation Data reduction  Obtains reduced representation in volume but produces the same or similar analytical results Data discretization (for numerical data)
  • 4. DATA CLEANING Importance  “Data cleaning is the number one problem in data warehousing” Data cleaning tasks – this routine attempts to  Fill in missing values  Identify outliers and smooth out noisy data  Correct inconsistent data  Resolve redundancy caused by data integration
  • 5. MISSING DATA Data is not always available  E.g., many tuples have no recorded values for several attributes, such as customer income in sales data Missing data may be due to  equipment malfunction  inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding  certain data may not be considered important at the time of entry  not register history or changes of the data
  • 6. DATA INTEGRATION Data integration:  combines data from multiple sources(data cubes, multiple db or flat files) Issues during data integration  Schema integration  integrate metadata (about the data) from different sources  Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id  B.cust-#(same entity?)  Detecting and resolving data value conflicts  for the same real world entity, attribute values from different sources are different, e.g., different scales, metric vs. British units  Removing duplicates and redundant data  An attribute can be derived from another table (annual revenue)  Inconsistencies in attribute naming
  • 7. DATA TRANSFORMATION Smoothing: remove noise from data (binning, clustering, regression) Normalization: scaled to fall within a small, specified range such as –1.0 to 1.0 or 0.0 to 1.0 Attribute/feature construction  New attributes constructed / added from the given ones Aggregation: summarization or aggregation operations apply to data Generalization: concept hierarchy climbing  Low level/ primitive/raw data are replace by higher level concepts
  • 8. DATA REDUCTION STRATEGIES Data is too big to work with – may takes time, impractical or infeasible analysis Data reduction techniques Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results Data reduction strategies Data cube aggregation – apply aggregation operations (data cube)
  • 9. CLUSTERING Partition data set into clusters, and one can store cluster representation only Can be very effective if data is clustered but not if data is “smeared”/ spread There are many choices of clustering definitions and clustering algorithms. We will discuss them later.
  • 10. SAMPLING Data reduction technique A large data set to be represented by much smaller random sample or subset. 4 types Simple random sampling without replacement (SRSWOR). Simple random sampling with replacement (SRSWR). Develop adaptive sampling methods such as cluster sample and stratified sample
  • 11. DISCRETIZATION AND CONCEPT HIERARCHY Discretization  reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values Concept hierarchies  reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior)
  • 12. SOME TECHNIQUES -Binning methods – equal-width, equal-frequency -Histogram - Entropy-based methods
  • 13. SUMMARY Data preparation is a big issue for data mining Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization Many methods have been proposed but still an active area of research