SlideShare ist ein Scribd-Unternehmen logo
1 von 13
WHY DATA PREPROCESSING?
Data in the real world is dirty

 incomplete: missing attribute values, lack of certain
attributes of interest, or containing only aggregate
data
 e.g., occupation=“”

 noisy: containing errors or outliers
 e.g., Salary=“-10”

 inconsistent: containing discrepancies in codes or
names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
MAJOR TASKS IN DATA PREPROCESSING

Data cleaning

 Fill in missing values, smooth noisy data, identify or remove
outliers and noisy data, and resolve inconsistencies

Data integration

 Integration of multiple databases, or files

Data transformation

 Normalization and aggregation

Data reduction

 Obtains reduced representation in volume but produces the same
or similar analytical results

Data discretization (for numerical data)
DATA CLEANING
Importance

 “Data cleaning is the number one problem in data
warehousing”

Data cleaning tasks – this routine attempts to
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration
MISSING DATA
Data is not always available
 E.g., many tuples have no recorded values for several
attributes, such as customer income in sales data

Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of
entry
 not register history or changes of the data
DATA INTEGRATION
Data integration:
 combines data from multiple sources(data cubes, multiple db or
flat files)
Issues during data integration
 Schema integration
 integrate metadata (about the data) from different sources
 Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id B.cust-#(same entity?)
 Detecting and resolving data value conflicts
 for the same real world entity, attribute values from different
sources are different, e.g., different scales, metric vs. British
units
 Removing duplicates and redundant data
 An attribute can be derived from another table (annual revenue)
 Inconsistencies in attribute naming
DATA TRANSFORMATION
Smoothing: remove noise from data
(binning, clustering, regression)
Normalization: scaled to fall within a small, specified range
such as –1.0 to 1.0 or 0.0 to 1.0
Attribute/feature construction
 New attributes constructed / added from the given ones
Aggregation: summarization or aggregation operations
apply to data
Generalization: concept hierarchy climbing
 Low level/ primitive/raw data are replace by higher level
concepts
DATA REDUCTION STRATEGIES
Data is too big to work with – may takes
time, impractical or infeasible analysis
Data reduction techniques
 Obtain a reduced representation of the data set that
is much smaller in volume but yet produce the same
(or almost the same) analytical results

Data reduction strategies
 Data cube aggregation – apply aggregation
operations (data cube)
CLUSTERING
Partition data set into clusters, and one can store cluster representation
only
Can be very effective if data is clustered but not if data is “smeared”/

spread
There are many choices of clustering definitions and clustering algorithms.
We will discuss them later.
SAMPLING
Data reduction technique
 A large data set to be represented by much smaller
random sample or subset.

4 types
 Simple random sampling without replacement
(SRSWOR).
 Simple random sampling with replacement
(SRSWR).
 Develop adaptive sampling methods such as cluster
sample and stratified sample
DISCRETIZATION AND CONCEPT HIERARCHY
Discretization
 reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values

Concept hierarchies
 reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age)
by higher level concepts (such as young, middle-aged, or
senior)
SOME TECHNIQUES

-Binning methods – equal-width, equal-frequency
-Histogram
- Entropy-based methods
SUMMARY
Data preparation is a big issue for data
mining

Data preparation includes
 Data cleaning and data integration
 Data reduction and feature selection
 Discretization

Many methods have been proposed but
still an active area of research

Weitere ähnliche Inhalte

Was ist angesagt?

03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
purnimatm
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
kayathri02
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unit
bhagathk
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 

Was ist angesagt? (14)

Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
 
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unit
 
Data preprocess
Data preprocessData preprocess
Data preprocess
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
Data preprocessing in Data Mining
Data preprocessing  in Data MiningData preprocessing  in Data Mining
Data preprocessing in Data Mining
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 

Andere mochten auch (16)

Photos for Prohibition
Photos for ProhibitionPhotos for Prohibition
Photos for Prohibition
 
Herramientas 2
Herramientas 2Herramientas 2
Herramientas 2
 
Patrick2
Patrick2Patrick2
Patrick2
 
Chapter 5
Chapter 5Chapter 5
Chapter 5
 
Music video analysis
Music video analysisMusic video analysis
Music video analysis
 
Chapter 6(1)
Chapter 6(1)Chapter 6(1)
Chapter 6(1)
 
Chapter 4 (1)
Chapter 4 (1)Chapter 4 (1)
Chapter 4 (1)
 
Gun control
Gun controlGun control
Gun control
 
Chapter 1(1)
Chapter 1(1)Chapter 1(1)
Chapter 1(1)
 
Morae - Türkçe
Morae - TürkçeMorae - Türkçe
Morae - Türkçe
 
Fys video analysis (5)
Fys video analysis (5)Fys video analysis (5)
Fys video analysis (5)
 
El comercio exterior_en_mexico
El comercio exterior_en_mexicoEl comercio exterior_en_mexico
El comercio exterior_en_mexico
 
Amin aminoaxit
Amin aminoaxitAmin aminoaxit
Amin aminoaxit
 
Chapter 2(1)
Chapter 2(1)Chapter 2(1)
Chapter 2(1)
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Chuyên đề.este
Chuyên đề.esteChuyên đề.este
Chuyên đề.este
 

Ähnlich wie Data preprocessing (20)

Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data1
Data1Data1
Data1
 
Data1
Data1Data1
Data1
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Datapreprocessingppt
DatapreprocessingpptDatapreprocessingppt
Datapreprocessingppt
 
Preprocess
PreprocessPreprocess
Preprocess
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Datapreprocess
DatapreprocessDatapreprocess
Datapreprocess
 
data processing.pdf
data processing.pdfdata processing.pdf
data processing.pdf
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preparation
Data preparationData preparation
Data preparation
 

Kürzlich hochgeladen

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Kürzlich hochgeladen (20)

General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 

Data preprocessing

  • 1.
  • 2. WHY DATA PREPROCESSING? Data in the real world is dirty  incomplete: missing attribute values, lack of certain attributes of interest, or containing only aggregate data  e.g., occupation=“”  noisy: containing errors or outliers  e.g., Salary=“-10”  inconsistent: containing discrepancies in codes or names  e.g., Age=“42” Birthday=“03/07/1997”  e.g., Was rating “1,2,3”, now rating “A, B, C”  e.g., discrepancy between duplicate records
  • 3. MAJOR TASKS IN DATA PREPROCESSING Data cleaning  Fill in missing values, smooth noisy data, identify or remove outliers and noisy data, and resolve inconsistencies Data integration  Integration of multiple databases, or files Data transformation  Normalization and aggregation Data reduction  Obtains reduced representation in volume but produces the same or similar analytical results Data discretization (for numerical data)
  • 4. DATA CLEANING Importance  “Data cleaning is the number one problem in data warehousing” Data cleaning tasks – this routine attempts to  Fill in missing values  Identify outliers and smooth out noisy data  Correct inconsistent data  Resolve redundancy caused by data integration
  • 5. MISSING DATA Data is not always available  E.g., many tuples have no recorded values for several attributes, such as customer income in sales data Missing data may be due to  equipment malfunction  inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding  certain data may not be considered important at the time of entry  not register history or changes of the data
  • 6. DATA INTEGRATION Data integration:  combines data from multiple sources(data cubes, multiple db or flat files) Issues during data integration  Schema integration  integrate metadata (about the data) from different sources  Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-#(same entity?)  Detecting and resolving data value conflicts  for the same real world entity, attribute values from different sources are different, e.g., different scales, metric vs. British units  Removing duplicates and redundant data  An attribute can be derived from another table (annual revenue)  Inconsistencies in attribute naming
  • 7. DATA TRANSFORMATION Smoothing: remove noise from data (binning, clustering, regression) Normalization: scaled to fall within a small, specified range such as –1.0 to 1.0 or 0.0 to 1.0 Attribute/feature construction  New attributes constructed / added from the given ones Aggregation: summarization or aggregation operations apply to data Generalization: concept hierarchy climbing  Low level/ primitive/raw data are replace by higher level concepts
  • 8. DATA REDUCTION STRATEGIES Data is too big to work with – may takes time, impractical or infeasible analysis Data reduction techniques  Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results Data reduction strategies  Data cube aggregation – apply aggregation operations (data cube)
  • 9. CLUSTERING Partition data set into clusters, and one can store cluster representation only Can be very effective if data is clustered but not if data is “smeared”/ spread There are many choices of clustering definitions and clustering algorithms. We will discuss them later.
  • 10. SAMPLING Data reduction technique  A large data set to be represented by much smaller random sample or subset. 4 types  Simple random sampling without replacement (SRSWOR).  Simple random sampling with replacement (SRSWR).  Develop adaptive sampling methods such as cluster sample and stratified sample
  • 11. DISCRETIZATION AND CONCEPT HIERARCHY Discretization  reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values Concept hierarchies  reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior)
  • 12. SOME TECHNIQUES -Binning methods – equal-width, equal-frequency -Histogram - Entropy-based methods
  • 13. SUMMARY Data preparation is a big issue for data mining Data preparation includes  Data cleaning and data integration  Data reduction and feature selection  Discretization Many methods have been proposed but still an active area of research