Data cleaning is an essential part of building a data warehouse as it improves data quality by detecting and removing errors and inconsistencies. Data warehouses integrate large amounts of data from various sources, so the probability of dirty data is high. Clean data is vital for decision making based on the data warehouse. The data cleaning process involves data analysis, defining transformation rules, verification of cleaning, applying transformations, and incorporating cleaned data. Tools can help support the different phases of data cleaning from data profiling to specialized cleaning of particular domains.
1. Role of Data cleaning in Data
Warehouse
Presentation on
Ramakant Soni
Assistant Professor, BKBIET, Pilani
ramakant.soni@bkbiet.ac.in
2. What is Data Warehouse ?
Data warehouse is an information delivery system where we can integrate and
transform data into information used largely for strategic decision making. The
historic data in the enterprise from various operational systems is collected and
is clubbed with other relevant data from outside sources to make integrated
data as content of data warehouse.
What is Data Cleaning ?
Data cleaning, also called data cleansing or scrubbing, deals with detecting and
removing errors and inconsistencies from data in order to improve the quality
of data.
Introduction
RAMAKANT SONI, BKBIET
3. Steps to build Data Warehouse: ETL Process
Figure 1. ETL Process
RAMAKANT SONI, BKBIET
4. Need of Data Cleaning
• Data warehouses require and provide extensive support for data cleaning.
• They load and continuously refresh huge amounts of data from a variety of
sources so the probability of “dirty data” is high.
• Data warehouses are used for decision making, so the correctness of data
is vital to avoid wrong conclusions.
RAMAKANT SONI, BKBIET
5. Requirements
A data cleaning approach should satisfy several requirements:
• Detect and remove all major errors and inconsistencies both in individual
data sources and when integrating multiple sources. The approach should
be supported by tools to limit manual inspection and programming effort.
• Data cleaning should not be performed in isolation but together with
schema-related data transformations based on comprehensive metadata.
• Mapping functions should be specified in a declarative way for data
cleaning and be reusable for other data sources as well as for query
processing.
• A workflow infrastructure should be supported to execute all data
transformation steps for multiple sources and large data sets in a reliable
and efficient way.
RAMAKANT SONI, BKBIET
7. Single-source problems
The data quality of a source largely depends on the degree to which it is governed by
schema and integrity constraints controlling permissible data values.
• Sources without schema, such as files, have few restrictions on what data can be
entered and stored, giving rise to a high probability of errors and inconsistencies.
• Database systems, enforce restrictions of a specific data model (e.g., the relational
approach requires simple attribute values, referential integrity, etc.) as well as
application-specific integrity constraints.
Schema-Level problems occur because of the lack of appropriate model-specific or
application-specific integrity constraints.
Instance-Level problems relate to errors and inconsistencies that cannot be prevented
at the schema level (e.g., misspellings).
RAMAKANT SONI, BKBIET
9. Multi-source problems
The problems in single sources are aggravated when multiple sources are integrated.
Each source may contain dirty data and the data in the sources may be represented
differently, overlap or contradict because of the independent sources.
Result: Large degree of heterogeneity.
Problem in cleaning: To identify overlapping data, in particular matching records
referring to the same real-world entity. This problem is also referred to as the object
identity problem, duplicate elimination problem.
Frequently, the information is only partially redundant and the sources may
complement each other by providing additional information about an entity.
Solution: duplicate information should be purged out and complementing information
should be consolidated and merged in order to achieve a consistent view of real world
entities.
RAMAKANT SONI, BKBIET
10. Example: Multi-Source Problem
Figure 2. Multi-Source problem example
RAMAKANT SONI, BKBIET
11. Data cleaning Phases
In general, data cleaning involves several phases:
• Data analysis
• Definition of transformation workflow and mapping rules
• Verification
• Transformation
• Backflow of cleaned data
RAMAKANT SONI, BKBIET
12. Data cleaning process
Data analysis & Defining
transformation workflow,
mapping rules
Verification &
Transformation
Backflow of
cleaned data
Figure 3. Data Cleaning Process
RAMAKANT SONI, BKBIET
13. Data cleaning Tool support
Large variety of tools is available to support data transformation and data cleaning:
• Data analysis Tools
1. Data profiling tool Eg. MigrationArchitect( Evoke Software)
2. Data mining tool Eg. WizRule( WizSoft)
• Data reengineering tools uses discovered patterns and rules for cleaning.
Eg. Integrity( Vality Software)
• Specialized cleaning tools deal with Particular Domain
1. Special Domain Cleaning Eg. IDCentric( FirstLogic)
2. Duplicate Elimination Eg. MatchIt( HelpItSystems)
• ETL tools uses repository built on DBMS to manage all metadata about data sources,
target schema, mapping script etc. in uniform way
Eg. Extract( ETI), CopyManager( InformationBuilders)
RAMAKANT SONI, BKBIET
14. References
1. Data Cleaning: Problems and Current Approaches- Erhard Rahm, Hong Hai Do-
University of Leipzig
2. Data cleaning, a problem that is redolent of Data Integration in Data Warehousing -
Shridhar B. Dandin- BKBIET Pilani
3. Principles and methods of data cleaning- Arthur D. Chapman
RAMAKANT SONI, BKBIET