SlideShare a Scribd company logo
1 of 18
Data Preprocessing

BY:
K.KOTTAISAMY
Data Preprocessing

Data cleaning
Data integration
Data transformation
Data reduction

Data discretization
Major tasks in Data Preprocessing

Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies

Data integration
Integration of multiple databases, data cubes, or files

Data transformation
Normalization and aggregation
Data reduction
Reduced representation in volume but produces the same or
similar analytical results

Data discretization
Part of data reduction but with particular
importance, especially for numerical data
Data Preprocessing
Data Cleaning
Data in the Real World Is Dirty:
Lots of potentially incorrect data, e.g., instrument
faulty, human or computer error, transmission error
incomplete: lacking attribute values, containing only
aggregate data
e.g., Occupation = “ ” (missing data)
noisy: containing noise, errors, or outliers
e.g., Salary = “−10” (an error)
inconsistent: containing discrepancies in codes or
names, e.g.,
Age = “42”, Birthday = “03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records
Intentional (e.g., disguised missing data)
Jan. 1 as everyone’s birthday?
Data Integration
Data integration:
combines data from multiple sources

Schema integration
integrate metadata from different sources

Detecting and resolving data value conflicts
for the same real world entity, attribute values from
different sources are different,
e.g., different scales, metric vs. British units

Removing duplicates and redundant data
Data Transformation
Smoothing:

remove noise from data

Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified
range
min-max normalization
z-score normalization
normalization by decimal scaling
Data Transformation: Normalization
Min-max normalization: to [new_minA, new_maxA]
v'

v minA
(new _ maxA new _ minA) new _ minA
maxA minA

Ex.

Let income range $12,000 to $98,000 normalized to
[0.0, 1.0]. Then $73,000 is mapped to

73,600
98,000

12 ,000
(1.0 0)
12 ,000

0

0.716
Data Transformation: Normalization

Z-score normalization (μ: mean, σ: standard
deviation):

v'

v

A
A

Ex. Let μ = 54,000, σ = 16,000. Then

73,600 54 ,000
16 ,000

1.225
Data Transformation: Normalization

Normalization by decimal scaling

v
v'
10 j
Where j is the smallest integer such that Max(|ν’|) < 1
Data reduction
Why data reduction?
A database/data warehouse may store terabytes of data
Complex data analysis/mining may take a very long time to
run on the complete data set
Data reduction
Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same analytical results
Data reduction strategies
Data cube aggregation
Dimensionality reduction — e.g., remove unimportant
attributes
Data Compression
Discretization and concept hierarchy generation
Data cube aggregation

The lowest level of a data cube (base cuboid)
The aggregated data for an individual entity of
interest
E.g., a customer in a phone calling data warehouse
Multiple levels of aggregation in data cubes
Further reduce the size of data to deal with
Dimensionality Reduction

Feature selection (attribute subset selection):
Select a minimum set of attributes that is sufficient for
the data mining task.

Heuristic methods
step-wise forward selection
step-wise backward elimination
combining forward selection and backward elimination
Data Discretization

Three types of attributes:
Nominal — values from an unordered set
Ordinal — values from an ordered set
Continuous — real numbers

Discretization:
Some classification algorithms only accept
categorical attributes.
Reduce data size by discretization
Prepare for further analysis
Data preprocessing

More Related Content

What's hot

Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kambererror007
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALASaikiran Panjala
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning pyingkodi maran
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree inductionthamizh arasi
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHarry Potter
 
5 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/25 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/2Fabio Fumarola
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Md. Main Uddin Rony
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 
Data Visualization in Data Science
Data Visualization in Data ScienceData Visualization in Data Science
Data Visualization in Data ScienceMaloy Manna, PMP®
 
DATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSINGDATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSINGAhtesham Ullah khan
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecturepcherukumalla
 
Data preprocessing in Machine Learning
Data preprocessing in Machine LearningData preprocessing in Machine Learning
Data preprocessing in Machine LearningPyingkodi Maran
 
Introduction to Data Visualization
Introduction to Data Visualization Introduction to Data Visualization
Introduction to Data Visualization Ana Jofre
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretizationKrish_ver2
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHoang Nguyen
 

What's hot (20)

Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree induction
 
Data preprocess
Data preprocessData preprocess
Data preprocess
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
5 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/25 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/2
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Data Visualization
Data VisualizationData Visualization
Data Visualization
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Data Visualization in Data Science
Data Visualization in Data ScienceData Visualization in Data Science
Data Visualization in Data Science
 
Kdd process
Kdd processKdd process
Kdd process
 
DATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSINGDATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSING
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
 
Data preprocessing in Machine Learning
Data preprocessing in Machine LearningData preprocessing in Machine Learning
Data preprocessing in Machine Learning
 
Introduction to Data Visualization
Introduction to Data Visualization Introduction to Data Visualization
Introduction to Data Visualization
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretization
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 

Similar to Data preprocessing

Data preprocessing
Data preprocessingData preprocessing
Data preprocessingvenkadesh236
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingvenkadesh236
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingvenkadesh236
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptRevathy V R
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingTony Nguyen
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingJames Wong
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHarry Potter
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingFraboni Ec
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingYoung Alista
 
Data MIning: Data processing
Data MIning: Data processingData MIning: Data processing
Data MIning: Data processingDatamining Tools
 
Datapreprocessingppt
DatapreprocessingpptDatapreprocessingppt
DatapreprocessingpptShree Hari
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ngsaranya12345
 
Data pre processing
Data pre processingData pre processing
Data pre processingjunnubabu
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPTANUSUYA T K
 

Similar to Data preprocessing (20)

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data MIning: Data processing
Data MIning: Data processingData MIning: Data processing
Data MIning: Data processing
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 
Datapreprocessingppt
DatapreprocessingpptDatapreprocessingppt
Datapreprocessingppt
 
Data1
Data1Data1
Data1
 
Data1
Data1Data1
Data1
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
Data Mining
Data MiningData Mining
Data Mining
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPT
 

Recently uploaded

9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 

Recently uploaded (20)

9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 

Data preprocessing

  • 2. Data Preprocessing Data cleaning Data integration Data transformation Data reduction Data discretization
  • 3. Major tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation
  • 4. Data reduction Reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data
  • 6. Data Cleaning Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error incomplete: lacking attribute values, containing only aggregate data e.g., Occupation = “ ” (missing data) noisy: containing noise, errors, or outliers e.g., Salary = “−10” (an error)
  • 7. inconsistent: containing discrepancies in codes or names, e.g., Age = “42”, Birthday = “03/07/2010” Was rating “1, 2, 3”, now rating “A, B, C” discrepancy between duplicate records Intentional (e.g., disguised missing data) Jan. 1 as everyone’s birthday?
  • 8. Data Integration Data integration: combines data from multiple sources Schema integration integrate metadata from different sources Detecting and resolving data value conflicts for the same real world entity, attribute values from different sources are different, e.g., different scales, metric vs. British units Removing duplicates and redundant data
  • 9. Data Transformation Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling
  • 10. Data Transformation: Normalization Min-max normalization: to [new_minA, new_maxA] v' v minA (new _ maxA new _ minA) new _ minA maxA minA Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to 73,600 98,000 12 ,000 (1.0 0) 12 ,000 0 0.716
  • 11. Data Transformation: Normalization Z-score normalization (μ: mean, σ: standard deviation): v' v A A Ex. Let μ = 54,000, σ = 16,000. Then 73,600 54 ,000 16 ,000 1.225
  • 12. Data Transformation: Normalization Normalization by decimal scaling v v' 10 j Where j is the smallest integer such that Max(|ν’|) < 1
  • 13. Data reduction Why data reduction? A database/data warehouse may store terabytes of data Complex data analysis/mining may take a very long time to run on the complete data set Data reduction Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same analytical results
  • 14. Data reduction strategies Data cube aggregation Dimensionality reduction — e.g., remove unimportant attributes Data Compression Discretization and concept hierarchy generation
  • 15. Data cube aggregation The lowest level of a data cube (base cuboid) The aggregated data for an individual entity of interest E.g., a customer in a phone calling data warehouse Multiple levels of aggregation in data cubes Further reduce the size of data to deal with
  • 16. Dimensionality Reduction Feature selection (attribute subset selection): Select a minimum set of attributes that is sufficient for the data mining task. Heuristic methods step-wise forward selection step-wise backward elimination combining forward selection and backward elimination
  • 17. Data Discretization Three types of attributes: Nominal — values from an unordered set Ordinal — values from an ordered set Continuous — real numbers Discretization: Some classification algorithms only accept categorical attributes. Reduce data size by discretization Prepare for further analysis