Data preprocessing

Data Preprocessing
Content
 What & Why preprocess the data?
 Data cleaning
 Data integration
 Data transformation
 Data reduction

PAAS Group
It is a data mining technique that involves
transforming raw data into an understandable
format.

PAAS Group
Why preprocess the data?

PAAS Group
Data Preprocessing
• Data in the real world is:
– incomplete: lacking values, certain attributes of
interest, etc.
– noisy: containing errors or outliers
– inconsistent: lack of compatibility or similarity
between two or more facts.

• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of
quality data

PAAS Group
Measure of Data Quality
 Accuracy
 Completeness
 Consistency
 Timeliness
 Believability
 Value added
 Interpretability
 Accessibility

PAAS Group
Data preprocessing techniques
• Data Cleaning
• Data Integration
• Data Transformation
• Data Reduction

PAAS Group
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies

• Data integration
– Integration of multiple databases, data cubes, or files

• Data transformation
– Normalization and aggregation

• Data reduction
– Obtains reduced representation in volume but produces the same or similar
analytical results

PAAS Group
Data Preprocessing

PAAS Group
Data Cleaning

PAAS Group
Data Cleaning
“Data Cleaning attempt to fill in missing
values, smooth out noise while identifying
outliers and correct inconsistencies in the real
world data.”

PAAS Group
Data Cleaning - Missing Values
•
•
•
•
•

Ignore the tuple
Fill in the missing value manually
Use a global constant
Use attribute mean
Use the most probable value
(decision tree, Bayesian Formalism)

PAAS Group
Data Cleaning - Noisy Data
•
•
•
•

Binning
Clustering
Combined computer and human inspection
Regression

PAAS Group
Data Cleaning - Inconsistent Data
• Manually, using external references
• Knowledge engineering tools

PAAS Group
Few Important Terms
• Discrepancy Detection
– Human Error
– Data Decay
– Deliberate Errors

• Metadata
• Unique Rules
• Null Rules
PAAS Group
Data Integration

PAAS Group
Data Integration
“Data Integration implies combining of data
from multiple sources into a coherent data
store(data warehouse). ”

PAAS Group
Data Integration - Issues
•
•
•
•

Entity identification problem
Redundancy
Tuple Duplication
Detecting data value conflicts

PAAS Group
Data Transformation

PAAS Group
Data Transformation
“Transforming or consolidating data into
mining suitable form is known as Data
Transformation.”

PAAS Group
Handling Redundant Data in Data Integration
• Redundant data occur often when integration of
multiple databases
– The same attribute may have different names in
different databases
– One attribute may be a “derived” attribute in
another table, e.g., annual revenue
Handling Redundant Data in Data Integration
• Redundant data may be able to be detected by
correlation analysis
• Careful integration of the data from multiple sources
may
help
reduce/avoid
redundancies
and
inconsistencies and improve mining speed and quality
Data Transformation
• Smoothing: remove noise from data
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
Data Reduction

PAAS Group
Data Reduction
“Data reduction techniques are applied to obtain a
reduced representation of the data set that is much
smaller in volume, yet closely maintains the integrity
of base data.”

PAAS Group
Data Reduction - Strategies
•
•
•
•
•

Data cube aggregation
Dimension Reduction
Data Compression
Numerosity Reduction
Discretization and concept hierarchy
generation

PAAS Group
Example of Decision Tree Induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
A6?

A1?

Class 1
>

Class 2

Class 1

Class 2

Reduced attribute set: {A1, A4, A6}
PAAS Group
Histograms
• A popular data reduction
technique
• Divide data into buckets and
store average (sum) for each
bucket
• Can be constructed optimally
in one dimension.
• Related to quantization
problems.

PAAS Group
Clustering
• Partition data set into clusters, and one can store cluster
representation only
• Can be very effective if data is clustered.
• Can have hierarchical clustering and be stored in multidimensional index tree structures.

PAAS Group
Sampling
• Allows a large data set to be represented by a much
smaller of the data.
• Let a large data set D, contains N tuples.
• Methods to reduce data set D:
–
–
–
–

Simple random sample without replacement (SRSWOR)
Simple random sample with replacement (SRSWR)
Cluster sample
Stright sample

PAAS Group
Sampling
SRSW
(simp OR
le ra n
do
samp
le wit m
hout
replac
emen
t)
R
SW
SR

Raw Data

PAAS Group
Sampling
Cluster/Stratified Sample

Raw Data

PAAS Group
PAAS Group
1 von 33

Recomendados

Data preprocessing using Machine Learning von
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
2.7K views38 Folien
Data Preprocessing von
Data PreprocessingData Preprocessing
Data PreprocessingObject-Frontier Software Pvt. Ltd
13.7K views41 Folien
Data mining von
Data miningData mining
Data miningAkannsha Totewar
297.1K views35 Folien
Mining Frequent Patterns, Association and Correlations von
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsJustin Cletus
17.7K views67 Folien
Clustering in Data Mining von
Clustering in Data MiningClustering in Data Mining
Clustering in Data MiningArchana Swaminathan
39.8K views17 Folien
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts von
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsSalah Amean
30.9K views81 Folien

Más contenido relacionado

Was ist angesagt?

Unsupervised learning von
Unsupervised learningUnsupervised learning
Unsupervised learningamalalhait
14.4K views22 Folien
1. Data Analytics-introduction von
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introductionkrishna singh
33.6K views14 Folien
Decision tree von
Decision treeDecision tree
Decision treeR A Akerkar
63.2K views30 Folien
PPT on Data Science Using Python von
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
7.5K views42 Folien
Data warehousing von
Data warehousingData warehousing
Data warehousingShruti Dalela
12.7K views29 Folien
Data preprocessing von
Data preprocessingData preprocessing
Data preprocessingJason Rodrigues
58.4K views29 Folien

Was ist angesagt?(20)

Unsupervised learning von amalalhait
Unsupervised learningUnsupervised learning
Unsupervised learning
amalalhait14.4K views
1. Data Analytics-introduction von krishna singh
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introduction
krishna singh33.6K views
Decision tree von R A Akerkar
Decision treeDecision tree
Decision tree
R A Akerkar63.2K views
Supervised learning and Unsupervised learning von Usama Fayyaz
Supervised learning and Unsupervised learning Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning
Usama Fayyaz5K views
1.2 steps and functionalities von Krish_ver2
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
Krish_ver29.6K views
Data Reduction von Rajan Shah
Data ReductionData Reduction
Data Reduction
Rajan Shah7.6K views
Data Preprocessing || Data Mining von Iffat Firozy
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
Iffat Firozy426 views
5.2 mining time series data von Krish_ver2
5.2 mining time series data5.2 mining time series data
5.2 mining time series data
Krish_ver211.2K views
Data cubes von Mohammed
Data cubesData cubes
Data cubes
Mohammed 66.8K views
Data mining slides von smj
Data mining slidesData mining slides
Data mining slides
smj130.7K views

Destacado

1.7 data reduction von
1.7 data reduction1.7 data reduction
1.7 data reductionKrish_ver2
31.9K views23 Folien
Types of Data Processing von
Types of Data ProcessingTypes of Data Processing
Types of Data ProcessingAbangLingkod Mendoza
49.5K views10 Folien
DataPreProcessing von
DataPreProcessing DataPreProcessing
DataPreProcessing tdharmaputhiran
2.7K views20 Folien
Data Processing von
Data ProcessingData Processing
Data ProcessingJhessie Abella RN,RM,MAN,CPSO
71.8K views19 Folien
Data Processing-Presentation von
Data Processing-PresentationData Processing-Presentation
Data Processing-Presentationnibraspk
36.3K views14 Folien
Data discretization von
Data discretizationData discretization
Data discretizationHadi M.Abachi
19.8K views18 Folien

Destacado(20)

1.7 data reduction von Krish_ver2
1.7 data reduction1.7 data reduction
1.7 data reduction
Krish_ver231.9K views
Data Processing-Presentation von nibraspk
Data Processing-PresentationData Processing-Presentation
Data Processing-Presentation
nibraspk36.3K views
Data preprocessing von Slideshare
Data preprocessingData preprocessing
Data preprocessing
Slideshare1.3K views
1.8 discretization von Krish_ver2
1.8 discretization1.8 discretization
1.8 discretization
Krish_ver210.2K views
DATA WAREHOUSING von King Julian
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
King Julian232.6K views
Data preprocessing von kayathri02
Data preprocessingData preprocessing
Data preprocessing
kayathri021.3K views
Data editing ( In research methodology ) von Np Shakeel
Data editing ( In research methodology )Data editing ( In research methodology )
Data editing ( In research methodology )
Np Shakeel39K views
Data Mining Concepts von Dung Nguyen
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
Dung Nguyen59K views
introduction to data mining tutorial von Salah Amean
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
Salah Amean16K views
Data mining (lecture 1 & 2) conecpts and techniques von Saif Ullah
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
Saif Ullah53.1K views

Similar a Data preprocessing

Data preprocessing ppt1 von
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1meenas06
1.1K views35 Folien
Data extraction, cleanup & transformation tools 29.1.16 von
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Dhilsath Fathima
4.8K views29 Folien
Unit 3 part ii Data mining von
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data miningDhilsath Fathima
1.1K views39 Folien
Preprocess von
PreprocessPreprocess
Preprocesssharmilajohn
542 views27 Folien
Data pre processing von
Data pre processingData pre processing
Data pre processingpommurajopt
1K views22 Folien
Preprocessing von
PreprocessingPreprocessing
PreprocessingVijay Kumar
438 views16 Folien

Similar a Data preprocessing(20)

Data preprocessing ppt1 von meenas06
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
meenas061.1K views
Data extraction, cleanup & transformation tools 29.1.16 von Dhilsath Fathima
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
Dhilsath Fathima4.8K views
Preprocessing von mmuthuraj
PreprocessingPreprocessing
Preprocessing
mmuthuraj621 views
Preprocessing.ppt von chatbot9
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
chatbot99 views
Pre-Processing and Data Preparation von Umair Shafique
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
Umair Shafique37 views
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!! von Caserta
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Caserta 13.4K views

Más de ankur bhalla

Load balancing von
Load balancingLoad balancing
Load balancingankur bhalla
23.8K views18 Folien
Languages von
LanguagesLanguages
Languagesankur bhalla
4.5K views15 Folien
E branding von
E brandingE branding
E brandingankur bhalla
30.9K views31 Folien
Crt von
CrtCrt
Crtankur bhalla
5.6K views8 Folien
Johari window von
Johari windowJohari window
Johari windowankur bhalla
190K views18 Folien
Generation of computer von
Generation of computerGeneration of computer
Generation of computerankur bhalla
184.9K views44 Folien

Data preprocessing

  • 2. Content  What & Why preprocess the data?  Data cleaning  Data integration  Data transformation  Data reduction PAAS Group
  • 3. It is a data mining technique that involves transforming raw data into an understandable format. PAAS Group
  • 4. Why preprocess the data? PAAS Group
  • 5. Data Preprocessing • Data in the real world is: – incomplete: lacking values, certain attributes of interest, etc. – noisy: containing errors or outliers – inconsistent: lack of compatibility or similarity between two or more facts. • No quality data, no quality mining results! – Quality decisions must be based on quality data – Data warehouse needs consistent integration of quality data PAAS Group
  • 6. Measure of Data Quality  Accuracy  Completeness  Consistency  Timeliness  Believability  Value added  Interpretability  Accessibility PAAS Group
  • 7. Data preprocessing techniques • Data Cleaning • Data Integration • Data Transformation • Data Reduction PAAS Group
  • 8. Major Tasks in Data Preprocessing • Data cleaning – Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration – Integration of multiple databases, data cubes, or files • Data transformation – Normalization and aggregation • Data reduction – Obtains reduced representation in volume but produces the same or similar analytical results PAAS Group
  • 11. Data Cleaning “Data Cleaning attempt to fill in missing values, smooth out noise while identifying outliers and correct inconsistencies in the real world data.” PAAS Group
  • 12. Data Cleaning - Missing Values • • • • • Ignore the tuple Fill in the missing value manually Use a global constant Use attribute mean Use the most probable value (decision tree, Bayesian Formalism) PAAS Group
  • 13. Data Cleaning - Noisy Data • • • • Binning Clustering Combined computer and human inspection Regression PAAS Group
  • 14. Data Cleaning - Inconsistent Data • Manually, using external references • Knowledge engineering tools PAAS Group
  • 15. Few Important Terms • Discrepancy Detection – Human Error – Data Decay – Deliberate Errors • Metadata • Unique Rules • Null Rules PAAS Group
  • 17. Data Integration “Data Integration implies combining of data from multiple sources into a coherent data store(data warehouse). ” PAAS Group
  • 18. Data Integration - Issues • • • • Entity identification problem Redundancy Tuple Duplication Detecting data value conflicts PAAS Group
  • 20. Data Transformation “Transforming or consolidating data into mining suitable form is known as Data Transformation.” PAAS Group
  • 21. Handling Redundant Data in Data Integration • Redundant data occur often when integration of multiple databases – The same attribute may have different names in different databases – One attribute may be a “derived” attribute in another table, e.g., annual revenue
  • 22. Handling Redundant Data in Data Integration • Redundant data may be able to be detected by correlation analysis • Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
  • 23. Data Transformation • Smoothing: remove noise from data • Aggregation: summarization, data cube construction • Generalization: concept hierarchy climbing
  • 25. Data Reduction “Data reduction techniques are applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of base data.” PAAS Group
  • 26. Data Reduction - Strategies • • • • • Data cube aggregation Dimension Reduction Data Compression Numerosity Reduction Discretization and concept hierarchy generation PAAS Group
  • 27. Example of Decision Tree Induction Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A6? A1? Class 1 > Class 2 Class 1 Class 2 Reduced attribute set: {A1, A4, A6} PAAS Group
  • 28. Histograms • A popular data reduction technique • Divide data into buckets and store average (sum) for each bucket • Can be constructed optimally in one dimension. • Related to quantization problems. PAAS Group
  • 29. Clustering • Partition data set into clusters, and one can store cluster representation only • Can be very effective if data is clustered. • Can have hierarchical clustering and be stored in multidimensional index tree structures. PAAS Group
  • 30. Sampling • Allows a large data set to be represented by a much smaller of the data. • Let a large data set D, contains N tuples. • Methods to reduce data set D: – – – – Simple random sample without replacement (SRSWOR) Simple random sample with replacement (SRSWR) Cluster sample Stright sample PAAS Group
  • 31. Sampling SRSW (simp OR le ra n do samp le wit m hout replac emen t) R SW SR Raw Data PAAS Group

Hinweis der Redaktion

  1. lack of compatibility or similarity between two or more facts.