SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
Role of Data cleaning in Data
Warehouse
Presentation on
Ramakant Soni
Assistant Professor, BKBIET, Pilani
ramakant.soni@bkbiet.ac.in
What is Data Warehouse ?
Data warehouse is an information delivery system where we can integrate and
transform data into information used largely for strategic decision making. The
historic data in the enterprise from various operational systems is collected and
is clubbed with other relevant data from outside sources to make integrated
data as content of data warehouse.
What is Data Cleaning ?
Data cleaning, also called data cleansing or scrubbing, deals with detecting and
removing errors and inconsistencies from data in order to improve the quality
of data.
 Introduction
RAMAKANT SONI, BKBIET
 Steps to build Data Warehouse: ETL Process
Figure 1. ETL Process
RAMAKANT SONI, BKBIET
 Need of Data Cleaning
• Data warehouses require and provide extensive support for data cleaning.
• They load and continuously refresh huge amounts of data from a variety of
sources so the probability of “dirty data” is high.
• Data warehouses are used for decision making, so the correctness of data
is vital to avoid wrong conclusions.
RAMAKANT SONI, BKBIET
 Requirements
A data cleaning approach should satisfy several requirements:
• Detect and remove all major errors and inconsistencies both in individual
data sources and when integrating multiple sources. The approach should
be supported by tools to limit manual inspection and programming effort.
• Data cleaning should not be performed in isolation but together with
schema-related data transformations based on comprehensive metadata.
• Mapping functions should be specified in a declarative way for data
cleaning and be reusable for other data sources as well as for query
processing.
• A workflow infrastructure should be supported to execute all data
transformation steps for multiple sources and large data sets in a reliable
and efficient way.
RAMAKANT SONI, BKBIET
 Data Quality Problems
RAMAKANT SONI, BKBIET
 Single-source problems
The data quality of a source largely depends on the degree to which it is governed by
schema and integrity constraints controlling permissible data values.
• Sources without schema, such as files, have few restrictions on what data can be
entered and stored, giving rise to a high probability of errors and inconsistencies.
• Database systems, enforce restrictions of a specific data model (e.g., the relational
approach requires simple attribute values, referential integrity, etc.) as well as
application-specific integrity constraints.
Schema-Level problems occur because of the lack of appropriate model-specific or
application-specific integrity constraints.
Instance-Level problems relate to errors and inconsistencies that cannot be prevented
at the schema level (e.g., misspellings).
RAMAKANT SONI, BKBIET
 Example: Single Source Problem
RAMAKANT SONI, BKBIET
 Multi-source problems
The problems in single sources are aggravated when multiple sources are integrated.
Each source may contain dirty data and the data in the sources may be represented
differently, overlap or contradict because of the independent sources.
Result: Large degree of heterogeneity.
Problem in cleaning: To identify overlapping data, in particular matching records
referring to the same real-world entity. This problem is also referred to as the object
identity problem, duplicate elimination problem.
Frequently, the information is only partially redundant and the sources may
complement each other by providing additional information about an entity.
Solution: duplicate information should be purged out and complementing information
should be consolidated and merged in order to achieve a consistent view of real world
entities.
RAMAKANT SONI, BKBIET
 Example: Multi-Source Problem
Figure 2. Multi-Source problem example
RAMAKANT SONI, BKBIET
 Data cleaning Phases
In general, data cleaning involves several phases:
• Data analysis
• Definition of transformation workflow and mapping rules
• Verification
• Transformation
• Backflow of cleaned data
RAMAKANT SONI, BKBIET
 Data cleaning process
Data analysis & Defining
transformation workflow,
mapping rules
Verification &
Transformation
Backflow of
cleaned data
Figure 3. Data Cleaning Process
RAMAKANT SONI, BKBIET
 Data cleaning Tool support
Large variety of tools is available to support data transformation and data cleaning:
• Data analysis Tools
1. Data profiling tool Eg. MigrationArchitect( Evoke Software)
2. Data mining tool Eg. WizRule( WizSoft)
• Data reengineering tools uses discovered patterns and rules for cleaning.
Eg. Integrity( Vality Software)
• Specialized cleaning tools deal with Particular Domain
1. Special Domain Cleaning Eg. IDCentric( FirstLogic)
2. Duplicate Elimination Eg. MatchIt( HelpItSystems)
• ETL tools uses repository built on DBMS to manage all metadata about data sources,
target schema, mapping script etc. in uniform way
Eg. Extract( ETI), CopyManager( InformationBuilders)
RAMAKANT SONI, BKBIET
 References
1. Data Cleaning: Problems and Current Approaches- Erhard Rahm, Hong Hai Do-
University of Leipzig
2. Data cleaning, a problem that is redolent of Data Integration in Data Warehousing -
Shridhar B. Dandin- BKBIET Pilani
3. Principles and methods of data cleaning- Arthur D. Chapman
RAMAKANT SONI, BKBIET
Thank You
RAMAKANT SONI, BKBIET

Weitere ähnliche Inhalte

Was ist angesagt?

14. Query Optimization in DBMS
14. Query Optimization in DBMS14. Query Optimization in DBMS
14. Query Optimization in DBMS
koolkampus
 

Was ist angesagt? (20)

Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning
 
Data visualization
Data visualizationData visualization
Data visualization
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analytics
 
CS6010 Social Network Analysis Unit II
CS6010 Social Network Analysis   Unit IICS6010 Social Network Analysis   Unit II
CS6010 Social Network Analysis Unit II
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
The Growing Importance of Data Cleaning
The Growing Importance of Data CleaningThe Growing Importance of Data Cleaning
The Growing Importance of Data Cleaning
 
14. Query Optimization in DBMS
14. Query Optimization in DBMS14. Query Optimization in DBMS
14. Query Optimization in DBMS
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introduction
 
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Data preparation
Data preparationData preparation
Data preparation
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 
Application of data mining
Application of data miningApplication of data mining
Application of data mining
 
Data Cleaning
Data CleaningData Cleaning
Data Cleaning
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data mining
Data mining Data mining
Data mining
 

Andere mochten auch

PTC Live: Integrating PTC Windchill with Cadence PCB Design
PTC Live: Integrating PTC Windchill with Cadence PCB DesignPTC Live: Integrating PTC Windchill with Cadence PCB Design
PTC Live: Integrating PTC Windchill with Cadence PCB Design
EMA Design Automation
 
Weka presentation
Weka presentationWeka presentation
Weka presentation
Saeed Iqbal
 
Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)
Revolution Analytics
 
Multidimentional data model
Multidimentional data modelMultidimentional data model
Multidimentional data model
jagdish_93
 

Andere mochten auch (20)

Data cleansing
Data cleansingData cleansing
Data cleansing
 
Data Cleaning Process
Data Cleaning ProcessData Cleaning Process
Data Cleaning Process
 
Brief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data CleaningBrief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data Cleaning
 
Ontology-driven KDD Process Composition
Ontology-driven KDD Process CompositionOntology-driven KDD Process Composition
Ontology-driven KDD Process Composition
 
14.machine learning
14.machine learning14.machine learning
14.machine learning
 
26.docking
26.docking26.docking
26.docking
 
7 Step Data Cleanse: Salesforce Hygiene
7 Step Data Cleanse: Salesforce Hygiene7 Step Data Cleanse: Salesforce Hygiene
7 Step Data Cleanse: Salesforce Hygiene
 
WEKA - A Data Mining Tool - by Shareek Ahamed
WEKA - A Data Mining Tool - by Shareek AhamedWEKA - A Data Mining Tool - by Shareek Ahamed
WEKA - A Data Mining Tool - by Shareek Ahamed
 
Data cleaning and screening
Data cleaning and screeningData cleaning and screening
Data cleaning and screening
 
Data Quality - The Cleansing Process
Data Quality - The Cleansing ProcessData Quality - The Cleansing Process
Data Quality - The Cleansing Process
 
PTC Live: Integrating PTC Windchill with Cadence PCB Design
PTC Live: Integrating PTC Windchill with Cadence PCB DesignPTC Live: Integrating PTC Windchill with Cadence PCB Design
PTC Live: Integrating PTC Windchill with Cadence PCB Design
 
An Introduction To Weka
An Introduction To WekaAn Introduction To Weka
An Introduction To Weka
 
Kettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration toolKettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration tool
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with R
 
Weka presentation
Weka presentationWeka presentation
Weka presentation
 
Presentation on Data Cleansing
Presentation on Data CleansingPresentation on Data Cleansing
Presentation on Data Cleansing
 
Datacube
DatacubeDatacube
Datacube
 
Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)
 
Multidimentional data model
Multidimentional data modelMultidimentional data model
Multidimentional data model
 
Lecture13 - Association Rules
Lecture13 - Association RulesLecture13 - Association Rules
Lecture13 - Association Rules
 

Ähnlich wie Role of Data Cleaning in Data Warehouse

Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
sumit621
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
Editor IJCATR
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
Editor IJCATR
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
Editor IJCATR
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
Editor IJCATR
 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architecture
Costa Pissaris
 

Ähnlich wie Role of Data Cleaning in Data Warehouse (20)

Dwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousing
 
Datawarehousing Terminology
Datawarehousing TerminologyDatawarehousing Terminology
Datawarehousing Terminology
 
Data Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricData Mesh using Microsoft Fabric
Data Mesh using Microsoft Fabric
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
 
Data Warehouse Basic Guide
Data Warehouse Basic GuideData Warehouse Basic Guide
Data Warehouse Basic Guide
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
 
ijcatr04081001
ijcatr04081001ijcatr04081001
ijcatr04081001
 
The Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationThe Shifting Landscape of Data Integration
The Shifting Landscape of Data Integration
 
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architecture
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
 
Data Mesh
Data MeshData Mesh
Data Mesh
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Database :Introduction to Database System
Database :Introduction to Database SystemDatabase :Introduction to Database System
Database :Introduction to Database System
 
Intro.pptx
Intro.pptxIntro.pptx
Intro.pptx
 

Mehr von Ramakant Soni

Mehr von Ramakant Soni (13)

GATE 2021 Exam Information
GATE 2021 Exam InformationGATE 2021 Exam Information
GATE 2021 Exam Information
 
What is Algorithm - An Overview
What is Algorithm - An OverviewWhat is Algorithm - An Overview
What is Algorithm - An Overview
 
Internet of things
Internet of thingsInternet of things
Internet of things
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
Huffman and Arithmetic coding - Performance analysis
Huffman and Arithmetic coding - Performance analysisHuffman and Arithmetic coding - Performance analysis
Huffman and Arithmetic coding - Performance analysis
 
UML daigrams for Bank ATM system
UML daigrams for Bank ATM systemUML daigrams for Bank ATM system
UML daigrams for Bank ATM system
 
Collaboration diagram- UML diagram
Collaboration diagram- UML diagram Collaboration diagram- UML diagram
Collaboration diagram- UML diagram
 
Activity diagram-UML diagram
Activity diagram-UML diagramActivity diagram-UML diagram
Activity diagram-UML diagram
 
Sequence diagram- UML diagram
Sequence diagram- UML diagramSequence diagram- UML diagram
Sequence diagram- UML diagram
 
Class diagram- UML diagram
Class diagram- UML diagramClass diagram- UML diagram
Class diagram- UML diagram
 
Use Case diagram-UML diagram-2
Use Case diagram-UML diagram-2Use Case diagram-UML diagram-2
Use Case diagram-UML diagram-2
 
Use Case diagram-UML diagram-1
Use Case diagram-UML diagram-1Use Case diagram-UML diagram-1
Use Case diagram-UML diagram-1
 
UML Diagrams- Unified Modeling Language Introduction
UML Diagrams- Unified Modeling Language IntroductionUML Diagrams- Unified Modeling Language Introduction
UML Diagrams- Unified Modeling Language Introduction
 

Kürzlich hochgeladen

Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Kürzlich hochgeladen (20)

PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 

Role of Data Cleaning in Data Warehouse

  • 1. Role of Data cleaning in Data Warehouse Presentation on Ramakant Soni Assistant Professor, BKBIET, Pilani ramakant.soni@bkbiet.ac.in
  • 2. What is Data Warehouse ? Data warehouse is an information delivery system where we can integrate and transform data into information used largely for strategic decision making. The historic data in the enterprise from various operational systems is collected and is clubbed with other relevant data from outside sources to make integrated data as content of data warehouse. What is Data Cleaning ? Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data.  Introduction RAMAKANT SONI, BKBIET
  • 3.  Steps to build Data Warehouse: ETL Process Figure 1. ETL Process RAMAKANT SONI, BKBIET
  • 4.  Need of Data Cleaning • Data warehouses require and provide extensive support for data cleaning. • They load and continuously refresh huge amounts of data from a variety of sources so the probability of “dirty data” is high. • Data warehouses are used for decision making, so the correctness of data is vital to avoid wrong conclusions. RAMAKANT SONI, BKBIET
  • 5.  Requirements A data cleaning approach should satisfy several requirements: • Detect and remove all major errors and inconsistencies both in individual data sources and when integrating multiple sources. The approach should be supported by tools to limit manual inspection and programming effort. • Data cleaning should not be performed in isolation but together with schema-related data transformations based on comprehensive metadata. • Mapping functions should be specified in a declarative way for data cleaning and be reusable for other data sources as well as for query processing. • A workflow infrastructure should be supported to execute all data transformation steps for multiple sources and large data sets in a reliable and efficient way. RAMAKANT SONI, BKBIET
  • 6.  Data Quality Problems RAMAKANT SONI, BKBIET
  • 7.  Single-source problems The data quality of a source largely depends on the degree to which it is governed by schema and integrity constraints controlling permissible data values. • Sources without schema, such as files, have few restrictions on what data can be entered and stored, giving rise to a high probability of errors and inconsistencies. • Database systems, enforce restrictions of a specific data model (e.g., the relational approach requires simple attribute values, referential integrity, etc.) as well as application-specific integrity constraints. Schema-Level problems occur because of the lack of appropriate model-specific or application-specific integrity constraints. Instance-Level problems relate to errors and inconsistencies that cannot be prevented at the schema level (e.g., misspellings). RAMAKANT SONI, BKBIET
  • 8.  Example: Single Source Problem RAMAKANT SONI, BKBIET
  • 9.  Multi-source problems The problems in single sources are aggravated when multiple sources are integrated. Each source may contain dirty data and the data in the sources may be represented differently, overlap or contradict because of the independent sources. Result: Large degree of heterogeneity. Problem in cleaning: To identify overlapping data, in particular matching records referring to the same real-world entity. This problem is also referred to as the object identity problem, duplicate elimination problem. Frequently, the information is only partially redundant and the sources may complement each other by providing additional information about an entity. Solution: duplicate information should be purged out and complementing information should be consolidated and merged in order to achieve a consistent view of real world entities. RAMAKANT SONI, BKBIET
  • 10.  Example: Multi-Source Problem Figure 2. Multi-Source problem example RAMAKANT SONI, BKBIET
  • 11.  Data cleaning Phases In general, data cleaning involves several phases: • Data analysis • Definition of transformation workflow and mapping rules • Verification • Transformation • Backflow of cleaned data RAMAKANT SONI, BKBIET
  • 12.  Data cleaning process Data analysis & Defining transformation workflow, mapping rules Verification & Transformation Backflow of cleaned data Figure 3. Data Cleaning Process RAMAKANT SONI, BKBIET
  • 13.  Data cleaning Tool support Large variety of tools is available to support data transformation and data cleaning: • Data analysis Tools 1. Data profiling tool Eg. MigrationArchitect( Evoke Software) 2. Data mining tool Eg. WizRule( WizSoft) • Data reengineering tools uses discovered patterns and rules for cleaning. Eg. Integrity( Vality Software) • Specialized cleaning tools deal with Particular Domain 1. Special Domain Cleaning Eg. IDCentric( FirstLogic) 2. Duplicate Elimination Eg. MatchIt( HelpItSystems) • ETL tools uses repository built on DBMS to manage all metadata about data sources, target schema, mapping script etc. in uniform way Eg. Extract( ETI), CopyManager( InformationBuilders) RAMAKANT SONI, BKBIET
  • 14.  References 1. Data Cleaning: Problems and Current Approaches- Erhard Rahm, Hong Hai Do- University of Leipzig 2. Data cleaning, a problem that is redolent of Data Integration in Data Warehousing - Shridhar B. Dandin- BKBIET Pilani 3. Principles and methods of data cleaning- Arthur D. Chapman RAMAKANT SONI, BKBIET