SlideShare ist ein Scribd-Unternehmen logo
1 von 19
-
1
Intro to Data Warehousing
Data Warehousing vs Data Mining & Data
Preprocessing in Data Mining
Ch Anwar ul Hassan (Lecturer)
Department of Computer Science and Software
Engineering
Capital University of Sciences & Technology, Islamabad
Pakistan
anwarchaudary@gmail.com
Slide 2
• Data Science is an area
• A data warehouse is built to support management
functions whereas data mining is used to extract
useful information and patterns from data. Data
warehousing is the process of compiling information
into a data warehouse.
Difference between Data Warehousing and Data
Mining
Slide 3
• It is a technology that aggregates structured data from one or
more sources so that it can be compared and analyzed rather
than transaction processing. A data warehouse is designed to
support management decision-making process by providing a
platform for data cleaning, data integration and data
consolidation. A data warehouse contains subject-oriented,
integrated, time-variant and non-volatile data.
Data Warehousing
Slide 4
• It is the process of finding patterns and correlations within
large data sets to identify relationships between data. Data
mining tools allow a business organization to predict customer
behavior. Data mining tools are used to build risk models and
detect fraud. Data mining is used in market analysis and
management, fraud detection, corporate analysis and risk
management.
Data Mining
Slide 5
• Data preprocessing is a data mining technique which is used to
transform the raw data in a useful and efficient format.
Data Preprocessing in Data Mining
Slide 6
1.Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in
various ways. Some of them are:
Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.
Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.
Steps Involved in Data Preprocessing:
Slide 7
(b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be
generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :
Binning Method:
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are performed to
complete the task. Each segmented is handled separately. One can replace all data
in a segment by its mean or boundary values can be used to complete the task.
Data Preprocessing in Data Mining
Slide 8
Regression:
Here data can be made smooth by fitting it to a regression function. The
regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected
or it will fall outside the clusters.
Data Preprocessing in Data Mining
Slide 9
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable
for mining process. This involves following ways:
Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or
0.0 to 1.0)
Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes
to help the mining process.
Data Preprocessing in Data Mining
Slide 10
Discretization:
• This is done to replace the raw values of numeric attribute by interval levels
or conceptual levels (interval variable is a measurement variable).
Concept Hierarchy Generation:
• Here attributes are converted from level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.
Data Preprocessing in Data Mining
Slide 11
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data.
While working with huge volume of data, analysis became harder in such cases.
In order to get rid of this, we uses data reduction technique. It aims to increase
the storage efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value of
the attribute. The attribute having p-value greater than significance level can be
discarded.
Data Preprocessing in Data Mining
Slide 12
Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
This technique is used to aggregate data in a simpler form. For example,
imagine that information you gathered for your analysis for the years 2012 to
2014, that data includes the revenue of your company every three months. They
involve you in the annual sales, rather than the quarterly average, So we can
summarize the data in such a way that the resulting data summarizes the total
sales per year instead of per quarter. It summarizes the data.
Numerosity Reduction:
In this reduction technique the actual data is replaced with mathematical models
or smaller representation of the data instead of actual data, it is important to
only store the model parameter. Or non-parametric method such as clustering,
histogram, sampling.
Data Preprocessing in Data Mining
Slide 13
Dimensionality Reduction:
Whenever we come across any data which is weakly important, then we use
the attribute required for our analysis. It reduces data size as it eliminates
outdated or redundant features.
Step-wise Forward Selection –
The selection begins with an empty set of attributes later on we decide best of
the original attributes on the set based on their relevance to other attributes. We
know it as a p-value in statistics.
Suppose there are the following attributes in the data set in which few attributes
are redundant.
Data Preprocessing in Data Mining
Slide 14
Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original data and at
each point, it eliminates the worst remaining attribute in the set.
Suppose there are the following attributes in the data set in which few attributes
are redundant.
Data Preprocessing in Data Mining
Slide 15
Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original data and at
each point, it eliminates the worst remaining attribute in the set.
Suppose there are the following attributes in the data set in which few attributes
are redundant.
Combination of forwarding and Backward Selection –
It allows us to remove the worst and select best attributes, saving time and
making the process faster.
Data Preprocessing in Data Mining
Slide 16
Data Compression:
The data compression technique reduces the size of the files using different encoding
mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two
types based on their compression techniques.
Lossless Compression –
Encoding techniques allows a simple and minimal data size reduction. Lossless data
compression uses algorithms to restore the precise original data from the compressed
data.
Lossy Compression –
Methods such as Discrete Wavelet transform technique, principal component
analysis) are examples of this compression. For e.g., JPEG image format is a lossy
compression, but we can find the meaning equivalent to the original the image. In
lossy-data compression, the decompressed data may differ to the original data but are
useful enough to retrieve information from them.
Data Preprocessing in Data Mining
Slide 17
Discretization & Concept Hierarchy Operation:
Techniques of data discretization are used to divide the attributes of the continuous
nature into data with intervals. We replace many constant values of the attributes by
labels of small intervals. This means that mining results are shown in a concise, and
easily understandable way.
Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split points)
to divide the whole set of attributes and repeat of this method up to the end, then the
process is known as top-down discretization also known as splitting.
Bottom-up discretization –
If you first consider all the constant values as split-points, some are discarded through
a combination of the neighborhood values in the interval, that process is called
bottom-up discretization.
Data Preprocessing in Data Mining
Slide 18
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such
as 43 for age) to high-level concepts (categorical variables such as middle age or
Senior).
For numeric data following techniques can be followed:
Binning – is the process of changing numerical variables into categorical
counterparts. The number of categorical counterparts depends on the number of bins
specified by the user.
Data Preprocessing in Data Mining
Slide 19
Histogram analysis – Like the process of binning, the histogram is used to partition
the value for the attribute X, into disjoint ranges called brackets. There are several
partitioning rules:
Equal Frequency partitioning: Partitioning the values based on their number of
occurrences in the data set.
Equal Width partitioning : partitioning the values in a fixed gap based on the
number of bins i.e. a set of values ranging from 0-20.
Clustering: Grouping the similar data together.
Data Preprocessing in Data Mining

Weitere ähnliche Inhalte

Was ist angesagt?

Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data MiningIffat Firozy
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingkayathri02
 
Distributed Decision Tree Induction
Distributed Decision Tree InductionDistributed Decision Tree Induction
Distributed Decision Tree Inductiongregoryg
 
International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES) International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES) irjes
 
Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27IJARIIE JOURNAL
 
Data Reduction
Data ReductionData Reduction
Data ReductionRajan Shah
 
Data Reduction Stratergies
Data Reduction StratergiesData Reduction Stratergies
Data Reduction StratergiesAnjaliSoorej
 
Multidimentional data model
Multidimentional data modelMultidimentional data model
Multidimentional data modeljagdish_93
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 
Decision tree clustering a columnstores tuple reconstruction
Decision tree clustering  a columnstores tuple reconstructionDecision tree clustering  a columnstores tuple reconstruction
Decision tree clustering a columnstores tuple reconstructioncsandit
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHarry Potter
 
Statistics and Data Mining
Statistics and  Data MiningStatistics and  Data Mining
Statistics and Data MiningR A Akerkar
 

Was ist angesagt? (18)

Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Distributed Decision Tree Induction
Distributed Decision Tree InductionDistributed Decision Tree Induction
Distributed Decision Tree Induction
 
International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES) International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)
 
Descriptive Analytics: Data Reduction
 Descriptive Analytics: Data Reduction Descriptive Analytics: Data Reduction
Descriptive Analytics: Data Reduction
 
Data Mining: Data Preprocessing
Data Mining: Data PreprocessingData Mining: Data Preprocessing
Data Mining: Data Preprocessing
 
Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
Data Reduction Stratergies
Data Reduction StratergiesData Reduction Stratergies
Data Reduction Stratergies
 
Data Mining
Data MiningData Mining
Data Mining
 
Multidimentional data model
Multidimentional data modelMultidimentional data model
Multidimentional data model
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data1
Data1Data1
Data1
 
Data Warehouse Designing: Dimensional Modelling and E-R Modelling
Data Warehouse Designing: Dimensional Modelling and E-R ModellingData Warehouse Designing: Dimensional Modelling and E-R Modelling
Data Warehouse Designing: Dimensional Modelling and E-R Modelling
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Decision tree clustering a columnstores tuple reconstruction
Decision tree clustering  a columnstores tuple reconstructionDecision tree clustering  a columnstores tuple reconstruction
Decision tree clustering a columnstores tuple reconstruction
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Statistics and Data Mining
Statistics and  Data MiningStatistics and  Data Mining
Statistics and Data Mining
 

Ähnlich wie Intro to Data warehousing lecture 17

UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningNandakumar P
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessingKnoldus Inc.
 
Anwar kamal .pdf.pptx
Anwar kamal .pdf.pptxAnwar kamal .pdf.pptx
Anwar kamal .pdf.pptxLuminous8
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptbelay41
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptxHarsha Patel
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedYugal Kumar
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data PreparationUmair Shafique
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfAmmarAhmedSiddiqui2
 
Chapter 3 Data Preprocessing techniques.pptx
Chapter 3 Data Preprocessing techniques.pptxChapter 3 Data Preprocessing techniques.pptx
Chapter 3 Data Preprocessing techniques.pptxManishaPatil932723
 

Ähnlich wie Intro to Data warehousing lecture 17 (20)

UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
 
Unit II.pdf
Unit II.pdfUnit II.pdf
Unit II.pdf
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Data processing
Data processingData processing
Data processing
 
Anwar kamal .pdf.pptx
Anwar kamal .pdf.pptxAnwar kamal .pdf.pptx
Anwar kamal .pdf.pptx
 
Data .pptx
Data .pptxData .pptx
Data .pptx
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.ppt
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
 
ETL QA
ETL QAETL QA
ETL QA
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
DataPreprocessing.pptx
DataPreprocessing.pptxDataPreprocessing.pptx
DataPreprocessing.pptx
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
 
Chapter 3 Data Preprocessing techniques.pptx
Chapter 3 Data Preprocessing techniques.pptxChapter 3 Data Preprocessing techniques.pptx
Chapter 3 Data Preprocessing techniques.pptx
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 

Mehr von AnwarrChaudary

Intro to Data warehousing lecture 20
Intro to Data warehousing   lecture 20Intro to Data warehousing   lecture 20
Intro to Data warehousing lecture 20AnwarrChaudary
 
Intro to Data warehousing lecture 19
Intro to Data warehousing   lecture 19Intro to Data warehousing   lecture 19
Intro to Data warehousing lecture 19AnwarrChaudary
 
Intro to Data warehousing lecture 18
Intro to Data warehousing   lecture 18Intro to Data warehousing   lecture 18
Intro to Data warehousing lecture 18AnwarrChaudary
 
Intro to Data warehousing lecture 16
Intro to Data warehousing   lecture 16Intro to Data warehousing   lecture 16
Intro to Data warehousing lecture 16AnwarrChaudary
 
Intro to Data warehousing lecture 15
Intro to Data warehousing   lecture 15Intro to Data warehousing   lecture 15
Intro to Data warehousing lecture 15AnwarrChaudary
 
Intro to Data warehousing lecture 14
Intro to Data warehousing   lecture 14Intro to Data warehousing   lecture 14
Intro to Data warehousing lecture 14AnwarrChaudary
 
Intro to Data warehousing lecture 13
Intro to Data warehousing   lecture 13Intro to Data warehousing   lecture 13
Intro to Data warehousing lecture 13AnwarrChaudary
 
Intro to Data warehousing lecture 12
Intro to Data warehousing   lecture 12Intro to Data warehousing   lecture 12
Intro to Data warehousing lecture 12AnwarrChaudary
 
Intro to Data warehousing lecture 11
Intro to Data warehousing   lecture 11Intro to Data warehousing   lecture 11
Intro to Data warehousing lecture 11AnwarrChaudary
 
Intro to Data warehousing lecture 10
Intro to Data warehousing   lecture 10Intro to Data warehousing   lecture 10
Intro to Data warehousing lecture 10AnwarrChaudary
 
Intro to Data warehousing lecture 09
Intro to Data warehousing   lecture 09Intro to Data warehousing   lecture 09
Intro to Data warehousing lecture 09AnwarrChaudary
 
Intro to Data warehousing lecture 08
Intro to Data warehousing   lecture 08Intro to Data warehousing   lecture 08
Intro to Data warehousing lecture 08AnwarrChaudary
 
Intro to Data warehousing lecture 07
Intro to Data warehousing   lecture 07Intro to Data warehousing   lecture 07
Intro to Data warehousing lecture 07AnwarrChaudary
 
Intro to Data warehousing Lecture 06
Intro to Data warehousing   Lecture 06Intro to Data warehousing   Lecture 06
Intro to Data warehousing Lecture 06AnwarrChaudary
 
Intro to Data warehousing lecture 05
Intro to Data warehousing   lecture 05Intro to Data warehousing   lecture 05
Intro to Data warehousing lecture 05AnwarrChaudary
 
Intro to Data warehousing Lecture 04
Intro to Data warehousing   Lecture 04Intro to Data warehousing   Lecture 04
Intro to Data warehousing Lecture 04AnwarrChaudary
 
Intro to Data warehousing lecture 03
Intro to Data warehousing   lecture 03Intro to Data warehousing   lecture 03
Intro to Data warehousing lecture 03AnwarrChaudary
 
Intro to Data warehousing lecture 02
Intro to Data warehousing   lecture 02Intro to Data warehousing   lecture 02
Intro to Data warehousing lecture 02AnwarrChaudary
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data WarehouseAnwarrChaudary
 
Introduction to Software Engineering
Introduction to Software EngineeringIntroduction to Software Engineering
Introduction to Software EngineeringAnwarrChaudary
 

Mehr von AnwarrChaudary (20)

Intro to Data warehousing lecture 20
Intro to Data warehousing   lecture 20Intro to Data warehousing   lecture 20
Intro to Data warehousing lecture 20
 
Intro to Data warehousing lecture 19
Intro to Data warehousing   lecture 19Intro to Data warehousing   lecture 19
Intro to Data warehousing lecture 19
 
Intro to Data warehousing lecture 18
Intro to Data warehousing   lecture 18Intro to Data warehousing   lecture 18
Intro to Data warehousing lecture 18
 
Intro to Data warehousing lecture 16
Intro to Data warehousing   lecture 16Intro to Data warehousing   lecture 16
Intro to Data warehousing lecture 16
 
Intro to Data warehousing lecture 15
Intro to Data warehousing   lecture 15Intro to Data warehousing   lecture 15
Intro to Data warehousing lecture 15
 
Intro to Data warehousing lecture 14
Intro to Data warehousing   lecture 14Intro to Data warehousing   lecture 14
Intro to Data warehousing lecture 14
 
Intro to Data warehousing lecture 13
Intro to Data warehousing   lecture 13Intro to Data warehousing   lecture 13
Intro to Data warehousing lecture 13
 
Intro to Data warehousing lecture 12
Intro to Data warehousing   lecture 12Intro to Data warehousing   lecture 12
Intro to Data warehousing lecture 12
 
Intro to Data warehousing lecture 11
Intro to Data warehousing   lecture 11Intro to Data warehousing   lecture 11
Intro to Data warehousing lecture 11
 
Intro to Data warehousing lecture 10
Intro to Data warehousing   lecture 10Intro to Data warehousing   lecture 10
Intro to Data warehousing lecture 10
 
Intro to Data warehousing lecture 09
Intro to Data warehousing   lecture 09Intro to Data warehousing   lecture 09
Intro to Data warehousing lecture 09
 
Intro to Data warehousing lecture 08
Intro to Data warehousing   lecture 08Intro to Data warehousing   lecture 08
Intro to Data warehousing lecture 08
 
Intro to Data warehousing lecture 07
Intro to Data warehousing   lecture 07Intro to Data warehousing   lecture 07
Intro to Data warehousing lecture 07
 
Intro to Data warehousing Lecture 06
Intro to Data warehousing   Lecture 06Intro to Data warehousing   Lecture 06
Intro to Data warehousing Lecture 06
 
Intro to Data warehousing lecture 05
Intro to Data warehousing   lecture 05Intro to Data warehousing   lecture 05
Intro to Data warehousing lecture 05
 
Intro to Data warehousing Lecture 04
Intro to Data warehousing   Lecture 04Intro to Data warehousing   Lecture 04
Intro to Data warehousing Lecture 04
 
Intro to Data warehousing lecture 03
Intro to Data warehousing   lecture 03Intro to Data warehousing   lecture 03
Intro to Data warehousing lecture 03
 
Intro to Data warehousing lecture 02
Intro to Data warehousing   lecture 02Intro to Data warehousing   lecture 02
Intro to Data warehousing lecture 02
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
 
Introduction to Software Engineering
Introduction to Software EngineeringIntroduction to Software Engineering
Introduction to Software Engineering
 

Kürzlich hochgeladen

Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 

Kürzlich hochgeladen (20)

INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 

Intro to Data warehousing lecture 17

  • 1. - 1 Intro to Data Warehousing Data Warehousing vs Data Mining & Data Preprocessing in Data Mining Ch Anwar ul Hassan (Lecturer) Department of Computer Science and Software Engineering Capital University of Sciences & Technology, Islamabad Pakistan anwarchaudary@gmail.com
  • 2. Slide 2 • Data Science is an area • A data warehouse is built to support management functions whereas data mining is used to extract useful information and patterns from data. Data warehousing is the process of compiling information into a data warehouse. Difference between Data Warehousing and Data Mining
  • 3. Slide 3 • It is a technology that aggregates structured data from one or more sources so that it can be compared and analyzed rather than transaction processing. A data warehouse is designed to support management decision-making process by providing a platform for data cleaning, data integration and data consolidation. A data warehouse contains subject-oriented, integrated, time-variant and non-volatile data. Data Warehousing
  • 4. Slide 4 • It is the process of finding patterns and correlations within large data sets to identify relationships between data. Data mining tools allow a business organization to predict customer behavior. Data mining tools are used to build risk models and detect fraud. Data mining is used in market analysis and management, fraud detection, corporate analysis and risk management. Data Mining
  • 5. Slide 5 • Data preprocessing is a data mining technique which is used to transform the raw data in a useful and efficient format. Data Preprocessing in Data Mining
  • 6. Slide 6 1.Data Cleaning: The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling of missing data, noisy data etc. (a). Missing Data: This situation arises when some data is missing in the data. It can be handled in various ways. Some of them are: Ignore the tuples: This approach is suitable only when the dataset we have is quite large and multiple values are missing within a tuple. Fill the Missing values: There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean or the most probable value. Steps Involved in Data Preprocessing:
  • 7. Slide 7 (b). Noisy Data: Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to faulty data collection, data entry errors etc. It can be handled in following ways : Binning Method: This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size and then various methods are performed to complete the task. Each segmented is handled separately. One can replace all data in a segment by its mean or boundary values can be used to complete the task. Data Preprocessing in Data Mining
  • 8. Slide 8 Regression: Here data can be made smooth by fitting it to a regression function. The regression used may be linear (having one independent variable) or multiple (having multiple independent variables). Clustering: This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the clusters. Data Preprocessing in Data Mining
  • 9. Slide 9 2. Data Transformation: This step is taken in order to transform the data in appropriate forms suitable for mining process. This involves following ways: Normalization: It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0) Attribute Selection: In this strategy, new attributes are constructed from the given set of attributes to help the mining process. Data Preprocessing in Data Mining
  • 10. Slide 10 Discretization: • This is done to replace the raw values of numeric attribute by interval levels or conceptual levels (interval variable is a measurement variable). Concept Hierarchy Generation: • Here attributes are converted from level to higher level in hierarchy. For Example-The attribute “city” can be converted to “country”. Data Preprocessing in Data Mining
  • 11. Slide 11 3. Data Reduction: Since data mining is a technique that is used to handle huge amount of data. While working with huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses data reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis costs. The various steps to data reduction are: Attribute Subset Selection: The highly relevant attributes should be used, rest all can be discarded. For performing attribute selection, one can use level of significance and p- value of the attribute. The attribute having p-value greater than significance level can be discarded. Data Preprocessing in Data Mining
  • 12. Slide 12 Data Cube Aggregation: Aggregation operation is applied to data for the construction of the data cube. This technique is used to aggregate data in a simpler form. For example, imagine that information you gathered for your analysis for the years 2012 to 2014, that data includes the revenue of your company every three months. They involve you in the annual sales, rather than the quarterly average, So we can summarize the data in such a way that the resulting data summarizes the total sales per year instead of per quarter. It summarizes the data. Numerosity Reduction: In this reduction technique the actual data is replaced with mathematical models or smaller representation of the data instead of actual data, it is important to only store the model parameter. Or non-parametric method such as clustering, histogram, sampling. Data Preprocessing in Data Mining
  • 13. Slide 13 Dimensionality Reduction: Whenever we come across any data which is weakly important, then we use the attribute required for our analysis. It reduces data size as it eliminates outdated or redundant features. Step-wise Forward Selection – The selection begins with an empty set of attributes later on we decide best of the original attributes on the set based on their relevance to other attributes. We know it as a p-value in statistics. Suppose there are the following attributes in the data set in which few attributes are redundant. Data Preprocessing in Data Mining
  • 14. Slide 14 Step-wise Backward Selection – This selection starts with a set of complete attributes in the original data and at each point, it eliminates the worst remaining attribute in the set. Suppose there are the following attributes in the data set in which few attributes are redundant. Data Preprocessing in Data Mining
  • 15. Slide 15 Step-wise Backward Selection – This selection starts with a set of complete attributes in the original data and at each point, it eliminates the worst remaining attribute in the set. Suppose there are the following attributes in the data set in which few attributes are redundant. Combination of forwarding and Backward Selection – It allows us to remove the worst and select best attributes, saving time and making the process faster. Data Preprocessing in Data Mining
  • 16. Slide 16 Data Compression: The data compression technique reduces the size of the files using different encoding mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two types based on their compression techniques. Lossless Compression – Encoding techniques allows a simple and minimal data size reduction. Lossless data compression uses algorithms to restore the precise original data from the compressed data. Lossy Compression – Methods such as Discrete Wavelet transform technique, principal component analysis) are examples of this compression. For e.g., JPEG image format is a lossy compression, but we can find the meaning equivalent to the original the image. In lossy-data compression, the decompressed data may differ to the original data but are useful enough to retrieve information from them. Data Preprocessing in Data Mining
  • 17. Slide 17 Discretization & Concept Hierarchy Operation: Techniques of data discretization are used to divide the attributes of the continuous nature into data with intervals. We replace many constant values of the attributes by labels of small intervals. This means that mining results are shown in a concise, and easily understandable way. Top-down discretization – If you first consider one or a couple of points (so-called breakpoints or split points) to divide the whole set of attributes and repeat of this method up to the end, then the process is known as top-down discretization also known as splitting. Bottom-up discretization – If you first consider all the constant values as split-points, some are discarded through a combination of the neighborhood values in the interval, that process is called bottom-up discretization. Data Preprocessing in Data Mining
  • 18. Slide 18 Concept Hierarchies: It reduces the data size by collecting and then replacing the low-level concepts (such as 43 for age) to high-level concepts (categorical variables such as middle age or Senior). For numeric data following techniques can be followed: Binning – is the process of changing numerical variables into categorical counterparts. The number of categorical counterparts depends on the number of bins specified by the user. Data Preprocessing in Data Mining
  • 19. Slide 19 Histogram analysis – Like the process of binning, the histogram is used to partition the value for the attribute X, into disjoint ranges called brackets. There are several partitioning rules: Equal Frequency partitioning: Partitioning the values based on their number of occurrences in the data set. Equal Width partitioning : partitioning the values in a fixed gap based on the number of bins i.e. a set of values ranging from 0-20. Clustering: Grouping the similar data together. Data Preprocessing in Data Mining