SlideShare ist ein Scribd-Unternehmen logo
1 von 13
WHY DATA PREPROCESSING?
Data in the real world is dirty
incomplete: missing attribute values, lack of certain
attributes of interest, or containing only aggregate data
 e.g., occupation=“”
noisy: containing errors or outliers
 e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or
names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
MAJOR TASKS IN DATA PREPROCESSING
Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers
and noisy data, and resolve inconsistencies
Data integration
 Integration of multiple databases, or files
Data transformation
 Normalization and aggregation
Data reduction
 Obtains reduced representation in volume but produces the same or
similar analytical results
Data discretization (for numerical data)
DATA CLEANING
Importance
 “Data cleaning is the number one problem in data
warehousing”
Data cleaning tasks – this routine attempts to
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration
MISSING DATA
Data is not always available
 E.g., many tuples have no recorded values for several attributes,
such as customer income in sales data
Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of entry
 not register history or changes of the data
DATA INTEGRATION
Data integration:
 combines data from multiple sources(data cubes, multiple db or flat
files)
Issues during data integration
 Schema integration
 integrate metadata (about the data) from different sources
 Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id  B.cust-#(same entity?)
 Detecting and resolving data value conflicts
 for the same real world entity, attribute values from different
sources are different, e.g., different scales, metric vs. British units
 Removing duplicates and redundant data
 An attribute can be derived from another table (annual revenue)
 Inconsistencies in attribute naming
DATA TRANSFORMATION
Smoothing: remove noise from data (binning, clustering, regression)
Normalization: scaled to fall within a small, specified range such as
–1.0 to 1.0 or 0.0 to 1.0
Attribute/feature construction
 New attributes constructed / added from the given ones
Aggregation: summarization or aggregation operations apply to data
Generalization: concept hierarchy climbing
 Low level/ primitive/raw data are replace by higher level concepts
DATA REDUCTION STRATEGIES
Data is too big to work with – may takes time,
impractical or infeasible analysis
Data reduction techniques
Obtain a reduced representation of the data set that is
much smaller in volume but yet produce the same (or
almost the same) analytical results
Data reduction strategies
Data cube aggregation – apply aggregation operations
(data cube)
CLUSTERING
Partition data set into clusters, and one can store cluster representation only
Can be very effective if data is clustered but not if data is “smeared”/ spread
There are many choices of clustering definitions and clustering algorithms. We will
discuss them later.
SAMPLING
Data reduction technique
A large data set to be represented by much smaller
random sample or subset.
4 types
Simple random sampling without replacement
(SRSWOR).
Simple random sampling with replacement (SRSWR).
Develop adaptive sampling methods such as cluster
sample and stratified sample
DISCRETIZATION AND CONCEPT HIERARCHY
Discretization
 reduce the number of values for a given continuous attribute
by dividing the range of the attribute into intervals. Interval
labels can then be used to replace actual data values
Concept hierarchies
 reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age) by
higher level concepts (such as young, middle-aged, or senior)
SOME TECHNIQUES
-Binning methods – equal-width, equal-frequency
-Histogram
- Entropy-based methods
SUMMARY
Data preparation is a big issue for data mining
Data preparation includes
Data cleaning and data integration
Data reduction and feature selection
Discretization
Many methods have been proposed but still an
active area of research

Weitere ähnliche Inhalte

Was ist angesagt?

03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
purnimatm
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
kayathri02
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unit
bhagathk
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 

Was ist angesagt? (14)

Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
 
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unit
 
Data preprocess
Data preprocessData preprocess
Data preprocess
 
Data preprocessing in Data Mining
Data preprocessing  in Data MiningData preprocessing  in Data Mining
Data preprocessing in Data Mining
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Mining: Data Preprocessing
Data Mining: Data PreprocessingData Mining: Data Preprocessing
Data Mining: Data Preprocessing
 

Ähnlich wie Data preprocessing (20)

Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data1
Data1Data1
Data1
 
Data1
Data1Data1
Data1
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Datapreprocessingppt
DatapreprocessingpptDatapreprocessingppt
Datapreprocessingppt
 
Preprocess
PreprocessPreprocess
Preprocess
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Datapreprocess
DatapreprocessDatapreprocess
Datapreprocess
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preparation
Data preparationData preparation
Data preparation
 
Data preparation
Data preparationData preparation
Data preparation
 

Kürzlich hochgeladen

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Kürzlich hochgeladen (20)

Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 

Data preprocessing

  • 1.
  • 2. WHY DATA PREPROCESSING? Data in the real world is dirty incomplete: missing attribute values, lack of certain attributes of interest, or containing only aggregate data  e.g., occupation=“” noisy: containing errors or outliers  e.g., Salary=“-10” inconsistent: containing discrepancies in codes or names  e.g., Age=“42” Birthday=“03/07/1997”  e.g., Was rating “1,2,3”, now rating “A, B, C”  e.g., discrepancy between duplicate records
  • 3. MAJOR TASKS IN DATA PREPROCESSING Data cleaning  Fill in missing values, smooth noisy data, identify or remove outliers and noisy data, and resolve inconsistencies Data integration  Integration of multiple databases, or files Data transformation  Normalization and aggregation Data reduction  Obtains reduced representation in volume but produces the same or similar analytical results Data discretization (for numerical data)
  • 4. DATA CLEANING Importance  “Data cleaning is the number one problem in data warehousing” Data cleaning tasks – this routine attempts to  Fill in missing values  Identify outliers and smooth out noisy data  Correct inconsistent data  Resolve redundancy caused by data integration
  • 5. MISSING DATA Data is not always available  E.g., many tuples have no recorded values for several attributes, such as customer income in sales data Missing data may be due to  equipment malfunction  inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding  certain data may not be considered important at the time of entry  not register history or changes of the data
  • 6. DATA INTEGRATION Data integration:  combines data from multiple sources(data cubes, multiple db or flat files) Issues during data integration  Schema integration  integrate metadata (about the data) from different sources  Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id  B.cust-#(same entity?)  Detecting and resolving data value conflicts  for the same real world entity, attribute values from different sources are different, e.g., different scales, metric vs. British units  Removing duplicates and redundant data  An attribute can be derived from another table (annual revenue)  Inconsistencies in attribute naming
  • 7. DATA TRANSFORMATION Smoothing: remove noise from data (binning, clustering, regression) Normalization: scaled to fall within a small, specified range such as –1.0 to 1.0 or 0.0 to 1.0 Attribute/feature construction  New attributes constructed / added from the given ones Aggregation: summarization or aggregation operations apply to data Generalization: concept hierarchy climbing  Low level/ primitive/raw data are replace by higher level concepts
  • 8. DATA REDUCTION STRATEGIES Data is too big to work with – may takes time, impractical or infeasible analysis Data reduction techniques Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results Data reduction strategies Data cube aggregation – apply aggregation operations (data cube)
  • 9. CLUSTERING Partition data set into clusters, and one can store cluster representation only Can be very effective if data is clustered but not if data is “smeared”/ spread There are many choices of clustering definitions and clustering algorithms. We will discuss them later.
  • 10. SAMPLING Data reduction technique A large data set to be represented by much smaller random sample or subset. 4 types Simple random sampling without replacement (SRSWOR). Simple random sampling with replacement (SRSWR). Develop adaptive sampling methods such as cluster sample and stratified sample
  • 11. DISCRETIZATION AND CONCEPT HIERARCHY Discretization  reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values Concept hierarchies  reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior)
  • 12. SOME TECHNIQUES -Binning methods – equal-width, equal-frequency -Histogram - Entropy-based methods
  • 13. SUMMARY Data preparation is a big issue for data mining Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization Many methods have been proposed but still an active area of research