Role of Data Cleaning in Data Warehouse

Role of Data cleaning in Data
Warehouse
Presentation on
Ramakant Soni
Assistant Professor, BKBIET, Pilani
ramakant.soni@bkbiet.ac.in

What is Data Warehouse ?
Data warehouse is an information delivery system where we can integrate and
transform data into information used largely for strategic decision making. The
historic data in the enterprise from various operational systems is collected and
is clubbed with other relevant data from outside sources to make integrated
data as content of data warehouse.
What is Data Cleaning ?
Data cleaning, also called data cleansing or scrubbing, deals with detecting and
removing errors and inconsistencies from data in order to improve the quality
of data.
 Introduction
RAMAKANT SONI, BKBIET

 Steps to build Data Warehouse: ETL Process
Figure 1. ETL Process

 Need of Data Cleaning
• Data warehouses require and provide extensive support for data cleaning.
• They load and continuously refresh huge amounts of data from a variety of
sources so the probability of “dirty data” is high.
• Data warehouses are used for decision making, so the correctness of data
is vital to avoid wrong conclusions.

 Requirements
A data cleaning approach should satisfy several requirements:
• Detect and remove all major errors and inconsistencies both in individual
data sources and when integrating multiple sources. The approach should
be supported by tools to limit manual inspection and programming effort.
• Data cleaning should not be performed in isolation but together with
schema-related data transformations based on comprehensive metadata.
• Mapping functions should be specified in a declarative way for data
cleaning and be reusable for other data sources as well as for query
processing.
• A workflow infrastructure should be supported to execute all data
transformation steps for multiple sources and large data sets in a reliable
and efficient way.

 Data Quality Problems

 Single-source problems
The data quality of a source largely depends on the degree to which it is governed by
schema and integrity constraints controlling permissible data values.
• Sources without schema, such as files, have few restrictions on what data can be
entered and stored, giving rise to a high probability of errors and inconsistencies.
• Database systems, enforce restrictions of a specific data model (e.g., the relational
approach requires simple attribute values, referential integrity, etc.) as well as
application-specific integrity constraints.
Schema-Level problems occur because of the lack of appropriate model-specific or
application-specific integrity constraints.
Instance-Level problems relate to errors and inconsistencies that cannot be prevented
at the schema level (e.g., misspellings).

 Example: Single Source Problem

 Multi-source problems
The problems in single sources are aggravated when multiple sources are integrated.
Each source may contain dirty data and the data in the sources may be represented
differently, overlap or contradict because of the independent sources.
Result: Large degree of heterogeneity.
Problem in cleaning: To identify overlapping data, in particular matching records
referring to the same real-world entity. This problem is also referred to as the object
identity problem, duplicate elimination problem.
Frequently, the information is only partially redundant and the sources may
complement each other by providing additional information about an entity.
Solution: duplicate information should be purged out and complementing information
should be consolidated and merged in order to achieve a consistent view of real world
entities.

 Example: Multi-Source Problem
Figure 2. Multi-Source problem example

 Data cleaning Phases
In general, data cleaning involves several phases:
• Data analysis
• Definition of transformation workflow and mapping rules
• Verification
• Transformation
• Backflow of cleaned data

 Data cleaning process
Data analysis & Defining
transformation workflow,
mapping rules
Verification &
Transformation
Backflow of
cleaned data
Figure 3. Data Cleaning Process

 Data cleaning Tool support
Large variety of tools is available to support data transformation and data cleaning:
• Data analysis Tools
1. Data profiling tool Eg. MigrationArchitect( Evoke Software)
2. Data mining tool Eg. WizRule( WizSoft)
• Data reengineering tools uses discovered patterns and rules for cleaning.
Eg. Integrity( Vality Software)
• Specialized cleaning tools deal with Particular Domain
1. Special Domain Cleaning Eg. IDCentric( FirstLogic)
2. Duplicate Elimination Eg. MatchIt( HelpItSystems)
• ETL tools uses repository built on DBMS to manage all metadata about data sources,
target schema, mapping script etc. in uniform way
Eg. Extract( ETI), CopyManager( InformationBuilders)

 References
1. Data Cleaning: Problems and Current Approaches- Erhard Rahm, Hong Hai Do-
University of Leipzig
2. Data cleaning, a problem that is redolent of Data Integration in Data Warehousing -
Shridhar B. Dandin- BKBIET Pilani
3. Principles and methods of data cleaning- Arthur D. Chapman

Thank You

Role of Data Cleaning in Data Warehouse

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Role of Data Cleaning in Data Warehouse

Ähnlich wie Role of Data Cleaning in Data Warehouse (20)

Mehr von Ramakant Soni

Mehr von Ramakant Soni (13)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Role of Data Cleaning in Data Warehouse