SlideShare a Scribd company logo
1 of 37
1
Data Cleaning Techniques
Shahid Rajaee Teacher Training University
Faculty of Computer Engineering
PRESENTED BY:
Amir Masoud Sefidian
2
Today’s Lecture Content
•Introduction
•Enhanced Technique to Clean Data in the Data Warehouse
• DWCLEANSER: A Framework for Approximate Duplicate Detection
•Data Quality Mining
• Data Quality Mining With Association Rules
• Data Cleaning Using Functional Dependencies
3
Today’s Lecture Content
•Introduction
•Enhanced Technique to Clean Data in the Data Warehouse
• DWCLEANSER: A Framework for Approximate Duplicate Detection
• Data Quality Mining
• Data Quality Mining With Association Rules
• Data Cleaning Using Functional Dependencies
4
Introduction
• Data quality is a main issue in quality information management.
• Data quality problems occur anywhere in information systems.
• These problems are solved by Data Cleaning:
• Is a process used to determine inaccurate, incomplete or unreasonable data and then
improve the quality through correcting of detected errors => reduces errors and improves
the data quality.
• Data Cleaning can be a time consuming and tedious process but it cannot be ignored.
• Data quality criterias : accuracy, integrity, completeness, validity, consistency, schema
conformance, uniqueness,… .
5
Today’s Lecture Content
•Introduction
•Enhanced Technique to Clean Data in the Data Warehouse
• DWCLEANSER: A Framework for Approximate Duplicate Detection
•Data Quality Mining
• Data Quality Mining With Association Rules
• Data Cleaning Using Functional Dependencies
6
An Enhanced Technique to Clean Data in the Data Warehouse
• Using a new algorithm that detects and corrects most of the error types and expected problems, such as
lexical errors, domain format errors, irregularities, integrity constraint violation, and duplicates, missing
value .
• Presents a solution working on the quantitative data and any data that have limited values.
• Offers the user interaction by selecting the rules and any sources and the desired targets.
• Algorithm is able to clean the data completely, addressing all the mistakes and inconsistencies in the data
or numerical values specified.
• Time taken to process huge data is not as important as obtaining high quality data since a huge amount
of data can be treated one-time.
• Main focus has been on achieving good quality of the data.
• Pace of implementation of this algorithm is adequate.
• It well scales to large amount of data processing without a significant degradation of the most of relative
performance issues.
7
Flowchart of proposed technique
Proposed model can easily be developed in a data -
warehouse, by the following algorithm:
8
user selects any rules needed in the data cleaning system. layout and descriptions for fields of the data set, which are used
in implementing of the algorithm.
COMPARISON OF THE PROPOSED TECHNIQUE WITH SOME EXISTING TECHNIQUES
Above 1009 records, containing a lot of anomalies have been
examined before and after processing by different available methods
(such as: statistics, clustering) a big difference in the number of
anomalies which confirms the effectiveness and quality of this
algorithm.
10
Today’s Lecture Content
•Introduction
•Enhanced Technique to Clean Data in the Data Warehouse
• DWCLEANSER: A Framework for Approximate Duplicate Detection
•Data Quality Mining
• Data Quality Mining With Association Rules
• Data Cleaning Using Functional Dependencies
11
DWCLEANSER: A Framework for Approximate Duplicate Detection
• A novel framework for detection of exact as well as approximate duplicates in a data
warehouse.
• Decreases the complexity involved in the previously designed frameworks by providing
efficient data cleaning techniques.
• Provides a comprehensive metadata support to the whole cleaning process.
• Provisions have also been suggested to take care of outliers and missing fields.
Existing Framework
12
Existing Framework
Previously designed framework designed is a sequential, token-based framework that offers fundamental services of
data cleaning in six steps :
1)Selection of attributes:
Attributes are identified and selected for further processing in the following steps.
2) Formation of tokens:
The selected attributes are utilized to form tokens for similarity computation.
3) Clustering/Blocking of records:
The blocking/clustering algorithm is used to group the records based on the calculated similarity and block-
token key.
4) Similarity computation for selected attributes:
Jaccard similarity method is used for comparing token values of selected attributes in a field.
5) Detection and elimination of duplicate records:
A rule based detection and elimination approach is employed for detecting and eliminating the duplicates
in a cluster or in many clusters.
6) Merge:
The cleansed data is combined and stored.
13
14
Proposed Framework: DWCLEANSER
15
1.Field Selection
• Records are decomposed into fields:
• Fields are analyzed for gathering data about their type, relationship with other fields, key fields and integrity
constraints so that have enough metadata about the decomposed fields.
• Missing fields stored in a separate temporary table and preserved in the repository along with their source record,
relation name, data types and integrity constraints.
• Missing fields are reviewed by the DBA to verify the reason for their existence.
(1) if the data is missing it can be recaptured;
(2) if the value is not known efforts can be made to gather the data to complete the record or fill the missing field with a
valid value.
if no valid data can be collected the values is preserved in the repository for further verification and not used in the
cleaning procedure.
16
2.Computation of Rules
Certain rules are computed that will be utilized during the implementation of the cleaning process.
Threshold value:
The threshold value is calculated based on the experiments conducted in previous researches.
Values lower than the thresholds increase the number of false positives.
Values above thresholds are not able to detect all duplicates.
Values in between can be used to recognize approximate duplicates.
Rules for classification of fields:
Selected fields are classified on the basis of their data types.
Rules for data quality attributes:
Previous framework only focused on 3 quality attributes of data: completeness, accuracy and consistency.
2 other quality attribute values proposed in new framework:
Validity:
Integrity:
17
3. Formation of Clusters
• Using recursive record matching algorithm for initial cluster formation with slight modification:
• Use it for matching of fields rather than whole record.
• Clusters are stored in priority queue.
• Priorities of clusters in the queue are assigned on the basis of their ability to detect duplicates data sets.
• The cluster that detected the recent match is stored assigned the highest priority.
4. Match Score
Match scores are assigned by applying Smith-Waterman algorithm(An edit-distance based strategy).
The calculations done in this method are stored in a matrix.
5. Detection of Exact and Approximate Duplicates
When a new field is to be matched against any data set present in a cluster use Union-Find structure.
If it fails in detecting any match then we employ Smith-Waterman.
6. Handling of Outliers and Missing Fields
Records that do not match any of the clusters present are called outliers or singleton records.
Singleton records may be stored in a separate file, stored in the repository for future analysis and comparisons.
18
7. Updating Metadata/Repository:
Metadata and repositories will be an integral part of proposed framework:
important components of repositories:
1. Data dictionary: store the information about the relations, their sources, schema, etc.
2. Rules directory: All the calculated values of thresholds, quality attributes, matching scores, etc.
3. Log files: They are used to store:
• information about the selected fields and their source record.
• classification of the fields based on their data type explicitly under 3 categories numeric, strings and characters.
4. Outlier & Missing field files: stores the outliers and missing fields and their related information like-type, source relation.
19
Comparison of Existing and Proposed Framework
20
Today’s Lecture Content
•Introduction
•Enhanced Technique to Clean Data in the Data Warehouse
• DWCLEANSER: A Framework for Approximate Duplicate Detection
•Data Quality Mining
• Data Quality Mining With Association Rules
• Data Cleaning Using Functional Dependencies
21
Data Quality Mining
Data mining process :
• Involves into the data collection, cleaning the data, building a model and monitoring the models.
• Automatically extract hidden and intrinsic information from the collections of data.
• Has various techniques that are suitable for data cleaning.
Some commonly used data mining techniques:
Association Rule Mining :
• Takes an input and induces rules as output; the outputs can be association rules.
• Association rules describe relationships among large data sets and co-occurrence of items.
Functional dependency:
shows the connection and association between attributes and shows how one specific combination of
values on one set of attributes determines one specific combination of values on another set.
22
Today’s Lecture Content
•Introduction
•Enhanced Technique to Clean Data in the Data Warehouse
• DWCLEANSER: A Framework for Approximate Duplicate Detection
•Data Quality Mining
• Data Quality Mining With Association Rules
• Data Cleaning Using Functional Dependencies
23
Data Quality Mining With Association Rules
Objective:
Used here to detect, quantify, explain and correct data quality deficiencies in very large databases.
find a relationship with the items in huge database in addition to that it improves the data quality.
Association rules generates a rule for all the transactions which are checked by their confidence level.
Find out the strength of all rules by the following steps:
• Determine transaction type.
• Generates the association rule.
• Assign a score to each transaction based on the generated rules
Score : summing the confidence values of the rules it violates.
Rule violation occurs when a tuples must satisfy the rule body but not it’s consequent.
Idea: assign high scores to a transaction is to suspect the deficiencies.
Suggest minimal threshold for confidence to restrict the rule set in order to improve the results.
Sort the transactions according to their score values.
Based on the score, the system decides whether to accept or reject the data or else issue a warning.
24
Data Cleaning Using Functional Dependencies
Functional Dependency(FD) is an important feature for referencing to the relationship between attributes and
candidate keys in tuples.
FD discovery could find too many FDs and, if use directly in a cleaning process, could cause it to NP time =>
degrade the performance of the data cleaning.
Developing a cleaning engine by combining:
FD discovery technique + data cleaning technique
+
Use the feature in query optimization called Selectivity Value to decrease the number of
FDs discovered(prune unlikely FDs).
25
Today’s Lecture Content
•Introduction
•Enhanced Technique to Clean Data in the Data Warehouse
• DWCLEANSER: A Framework for Approximate Duplicate Detection
•Data Quality Mining
• Data Quality Mining With Association Rules
• Data Cleaning Using Functional Dependencies
26
SYSTEM ARCHITETURE
27
SYSTEM ARCHITETURE
Data collector
• Retrieve data from relational database and Improves some quality of data (corrects data from basic typos, invalid domains
and invalid formats) and prepares it for the next module (in a relational format).
FD engine
• Is an FD finding module
• Dirty data usually has some errors => use the Approximate FD technique to remove errors and find FD.
• Apply the selectivity value technique to rank the candidates in its Pruning step and select the candidates only with high
and low rank from the computing FD step.
• At the same time, any errors detected from this modified FD engine are suspicious tuples for cleaning.
• The errors can be separated into 2 types:
o Errors from finding non-candidate key FDs are inconsistent data.
o Errors from finding a candidate key FDs are potentially duplicated data.
• Together with the (discovered FDs + all suspicious error tuples) will be sent to the next step.
28
SYSTEM ARCHITETURE
Cleaning Engine:
Receive:
• suspicious error tuples
• FD selected from the FD engine
Then:
Assign weight to the data (high error produces a high weight).
Tuples with low weights will repair the high weight tuples.
FD repairing technique:
After updating the weight, the engine brings the FD to clean the data by using the Cost-based algorithm (use low cost data to
repair a high cost data).
Duplicate Elimination:
The last step is to find the duplicate data by improving the sorted neighbor-hood method algorithm through using the
candidate key FD from the FD engine to assign key and sorting data from the attribute on the left-hand side of FDs.
Relational database:
Other modules storing and retrieving data from this module.
29
SELECTING THE FD
Apply selectivity value for ranking the candidate in order to find the appropriate FD.
1 Selectivity value
the selectivity value determine distribution.
If the selectivity value of any attribute
• is high => the attribute value is highly distributed.
• is low => the attribute value is more likely to be united.
Highly distributed attribute is potentially a candidate key and can be used to eliminate duplicates.
The lowest distributed attribute can be applied to improve the error of distortion of attribute values in the
cleaning engine.
30
SELECTING THE FD
2 Ranking the candidate
After calculating the selectivity value for determining the ranks of candidates, we sort these ranks in ascending
order.
To choose potentially good candidates:
Define the low ranking threshold and high ranking threshold as a pruning point.
The selected candidates are chosen from the candidates with either high ranking or low ranking values.
The high ranking candidate has high selectivity is potentially a candidate key .
The low ranking candidates is potentially an invariant valued which can be functionally determined by some attribute in a
trivial manner. Thus, it can be computed to be a non-candidate key on the right-hand side.
The middle ranking is not precise so ignored.
31
SELECTING THE FD
3 Improve the pruning step :
The pruning step is a step for generating the candidate set by computing the candidates from level 1.
Pruning lattice example
32
Improved pruning method
• Begins the pruning by getting the set of candidates in level - 1 and then, checks the candidates.
• If they are not the FD and in either high or low accepted ranking => use StoreCandidate function to store new candidate
from candidate_x and candidate_y in the current level.
• Other candidates that are in a neither low nor high ranking will be ignored.
33
Results
50,000 real customer tuples are used as a data source.
Separate the dataset into 3 sets, as follows:
o first dataset has 10% duplicates,
o second dataset has 10% errors
o last dataset has 10% duplicates and errors.
Results showed that this work can identify duplicates and
anomalies with high recall and low false positive.
PROBLEM :
Combining solution is sensitive to data size:
• Data volume increase => discovery algorithm speed
decrease
• Number of attributes increase => the discovery creates
more candidates of FD and generates too many FDs
including noise ones.
34
Strengths and Limitations of Data Quality Mining Methods :
Association rules Functional Dependency
Reduce the number of rules to generate for
a transaction
Easily identifies suspicious tuples for cleaning
avoids a severe pitfall of association rule
mining
Decrease the number of functional dependency
discovered
difficult to generate association rules for all
transactions
is not suitable for large database because it is
difficult to sort all the records
35
Main References:
1. Hamad, Mortadha M., and Alaa Abdulkhar Jihad. "An Enhanced Technique To Clean Data In The Data
Warehouse". 2011 Developments in E-systems Engineering (2011): n. pag. Web. 20 Dec. 2015.
2. Thakur, G., Singh, M., Pahwa, P. and Tyagi, N. (2011). DWCLEANSER: A Framework for Approximate Duplicate
Detection. Advances in Computing and Information Technology, pp.355-364.
3. Natarajan, K., Li, J. and Koronios, A. (2010). Data mining techniques for data cleaning, Engineering Asset Lifecycle
Management, Springer London, pp.796-804.
4. Kollayut Kaewbuadee, Yae Temtanapat, and Ratchata Peachavanish, (2006) Data cleaning using functional
dependency from data mining process, International Journal on Computer Science and Information System
(IADIS) V1 , no. 2, 117–131 ,ISBN: ISSN : 1646 – 3692.
QUESTION??...

More Related Content

What's hot

Business Intelligence (BI) and Data Management Basics
Business Intelligence (BI) and Data Management  Basics Business Intelligence (BI) and Data Management  Basics
Business Intelligence (BI) and Data Management Basics amorshed
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data AnalyticsUtkarsh Sharma
 
Database vs Data Warehouse: A Comparative Review
Database vs Data Warehouse: A Comparative ReviewDatabase vs Data Warehouse: A Comparative Review
Database vs Data Warehouse: A Comparative ReviewHealth Catalyst
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisEva Durall
 
Scaling Big Data Cleansing
Scaling Big Data CleansingScaling Big Data Cleansing
Scaling Big Data CleansingZuhair khayyat
 
Collibra - Forrester Presentation : Data Governance 2.0
Collibra - Forrester Presentation : Data Governance 2.0Collibra - Forrester Presentation : Data Governance 2.0
Collibra - Forrester Presentation : Data Governance 2.0Guillaume LE GALIARD
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with PythonDavis David
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
Importance of Data Analytics
 Importance of Data Analytics Importance of Data Analytics
Importance of Data AnalyticsProduct School
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingksamyMCA
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?DATAVERSITY
 
Data Quality Best Practices
Data Quality Best PracticesData Quality Best Practices
Data Quality Best PracticesDATAVERSITY
 
Improving Data Literacy Around Data Architecture
Improving Data Literacy Around Data ArchitectureImproving Data Literacy Around Data Architecture
Improving Data Literacy Around Data ArchitectureDATAVERSITY
 
Data Quality
Data QualityData Quality
Data Qualityjerdeb
 

What's hot (20)

Business Intelligence (BI) and Data Management Basics
Business Intelligence (BI) and Data Management  Basics Business Intelligence (BI) and Data Management  Basics
Business Intelligence (BI) and Data Management Basics
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Database vs Data Warehouse: A Comparative Review
Database vs Data Warehouse: A Comparative ReviewDatabase vs Data Warehouse: A Comparative Review
Database vs Data Warehouse: A Comparative Review
 
Creating a Data Driven Culture
Creating a Data Driven Culture Creating a Data Driven Culture
Creating a Data Driven Culture
 
Data Quality Presentation
Data Quality PresentationData Quality Presentation
Data Quality Presentation
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
 
Scaling Big Data Cleansing
Scaling Big Data CleansingScaling Big Data Cleansing
Scaling Big Data Cleansing
 
Collibra - Forrester Presentation : Data Governance 2.0
Collibra - Forrester Presentation : Data Governance 2.0Collibra - Forrester Presentation : Data Governance 2.0
Collibra - Forrester Presentation : Data Governance 2.0
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Importance of Data Analytics
 Importance of Data Analytics Importance of Data Analytics
Importance of Data Analytics
 
Data Visualization - A Brief Overview
Data Visualization - A Brief OverviewData Visualization - A Brief Overview
Data Visualization - A Brief Overview
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data analytics
Data analyticsData analytics
Data analytics
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?
 
Data Quality Best Practices
Data Quality Best PracticesData Quality Best Practices
Data Quality Best Practices
 
Improving Data Literacy Around Data Architecture
Improving Data Literacy Around Data ArchitectureImproving Data Literacy Around Data Architecture
Improving Data Literacy Around Data Architecture
 
Data Quality
Data QualityData Quality
Data Quality
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
 

Similar to Data Cleaning Techniques

N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERSN ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERScsandit
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processingFEG
 
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...csandit
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningNandakumar P
 
IRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current ApproachesIRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current ApproachesIRJET Journal
 
Mind Map Test Data Management Overview
Mind Map Test Data Management OverviewMind Map Test Data Management Overview
Mind Map Test Data Management Overviewdublinx
 
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...ijaia
 
DATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSINGDATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSINGAhtesham Ullah khan
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data PreparationUmair Shafique
 
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREEA ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREEijcsa
 
Design and implementation for
Design and implementation forDesign and implementation for
Design and implementation forIJDKP
 
Data transformation and query management in personal health sensor network
Data transformation and query management in personal health sensor networkData transformation and query management in personal health sensor network
Data transformation and query management in personal health sensor networkTAIWAN
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineMichael Gerke
 
Subhaschamdrabhosesubhqschndrachose.pptx
Subhaschamdrabhosesubhqschndrachose.pptxSubhaschamdrabhosesubhqschndrachose.pptx
Subhaschamdrabhosesubhqschndrachose.pptxrocky170104
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data MiningValerii Klymchuk
 

Similar to Data Cleaning Techniques (20)

Data mining
Data miningData mining
Data mining
 
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERSN ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processing
 
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
 
Chromatography: Part 4 of 4 Pesticide Residue Analysis Webinar Series - Late...
Chromatography: Part 4 of 4 Pesticide Residue Analysis Webinar Series -  Late...Chromatography: Part 4 of 4 Pesticide Residue Analysis Webinar Series -  Late...
Chromatography: Part 4 of 4 Pesticide Residue Analysis Webinar Series - Late...
 
IRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current ApproachesIRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current Approaches
 
Module-4_Part-II.pptx
Module-4_Part-II.pptxModule-4_Part-II.pptx
Module-4_Part-II.pptx
 
Mind Map Test Data Management Overview
Mind Map Test Data Management OverviewMind Map Test Data Management Overview
Mind Map Test Data Management Overview
 
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
 
DATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSINGDATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSING
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
02 Related Concepts
02 Related Concepts02 Related Concepts
02 Related Concepts
 
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREEA ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
 
Design and implementation for
Design and implementation forDesign and implementation for
Design and implementation for
 
data mining
data miningdata mining
data mining
 
Data transformation and query management in personal health sensor network
Data transformation and query management in personal health sensor networkData transformation and query management in personal health sensor network
Data transformation and query management in personal health sensor network
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
 
Subhaschamdrabhosesubhqschndrachose.pptx
Subhaschamdrabhosesubhqschndrachose.pptxSubhaschamdrabhosesubhqschndrachose.pptx
Subhaschamdrabhosesubhqschndrachose.pptx
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
 

Recently uploaded

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...gragchanchal546
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdfkhraisr
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 

Recently uploaded (20)

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 

Data Cleaning Techniques

  • 1. 1 Data Cleaning Techniques Shahid Rajaee Teacher Training University Faculty of Computer Engineering PRESENTED BY: Amir Masoud Sefidian
  • 2. 2 Today’s Lecture Content •Introduction •Enhanced Technique to Clean Data in the Data Warehouse • DWCLEANSER: A Framework for Approximate Duplicate Detection •Data Quality Mining • Data Quality Mining With Association Rules • Data Cleaning Using Functional Dependencies
  • 3. 3 Today’s Lecture Content •Introduction •Enhanced Technique to Clean Data in the Data Warehouse • DWCLEANSER: A Framework for Approximate Duplicate Detection • Data Quality Mining • Data Quality Mining With Association Rules • Data Cleaning Using Functional Dependencies
  • 4. 4 Introduction • Data quality is a main issue in quality information management. • Data quality problems occur anywhere in information systems. • These problems are solved by Data Cleaning: • Is a process used to determine inaccurate, incomplete or unreasonable data and then improve the quality through correcting of detected errors => reduces errors and improves the data quality. • Data Cleaning can be a time consuming and tedious process but it cannot be ignored. • Data quality criterias : accuracy, integrity, completeness, validity, consistency, schema conformance, uniqueness,… .
  • 5. 5 Today’s Lecture Content •Introduction •Enhanced Technique to Clean Data in the Data Warehouse • DWCLEANSER: A Framework for Approximate Duplicate Detection •Data Quality Mining • Data Quality Mining With Association Rules • Data Cleaning Using Functional Dependencies
  • 6. 6 An Enhanced Technique to Clean Data in the Data Warehouse • Using a new algorithm that detects and corrects most of the error types and expected problems, such as lexical errors, domain format errors, irregularities, integrity constraint violation, and duplicates, missing value . • Presents a solution working on the quantitative data and any data that have limited values. • Offers the user interaction by selecting the rules and any sources and the desired targets. • Algorithm is able to clean the data completely, addressing all the mistakes and inconsistencies in the data or numerical values specified. • Time taken to process huge data is not as important as obtaining high quality data since a huge amount of data can be treated one-time. • Main focus has been on achieving good quality of the data. • Pace of implementation of this algorithm is adequate. • It well scales to large amount of data processing without a significant degradation of the most of relative performance issues.
  • 7. 7 Flowchart of proposed technique Proposed model can easily be developed in a data - warehouse, by the following algorithm:
  • 8. 8 user selects any rules needed in the data cleaning system. layout and descriptions for fields of the data set, which are used in implementing of the algorithm.
  • 9. COMPARISON OF THE PROPOSED TECHNIQUE WITH SOME EXISTING TECHNIQUES Above 1009 records, containing a lot of anomalies have been examined before and after processing by different available methods (such as: statistics, clustering) a big difference in the number of anomalies which confirms the effectiveness and quality of this algorithm.
  • 10. 10 Today’s Lecture Content •Introduction •Enhanced Technique to Clean Data in the Data Warehouse • DWCLEANSER: A Framework for Approximate Duplicate Detection •Data Quality Mining • Data Quality Mining With Association Rules • Data Cleaning Using Functional Dependencies
  • 11. 11 DWCLEANSER: A Framework for Approximate Duplicate Detection • A novel framework for detection of exact as well as approximate duplicates in a data warehouse. • Decreases the complexity involved in the previously designed frameworks by providing efficient data cleaning techniques. • Provides a comprehensive metadata support to the whole cleaning process. • Provisions have also been suggested to take care of outliers and missing fields.
  • 13. Existing Framework Previously designed framework designed is a sequential, token-based framework that offers fundamental services of data cleaning in six steps : 1)Selection of attributes: Attributes are identified and selected for further processing in the following steps. 2) Formation of tokens: The selected attributes are utilized to form tokens for similarity computation. 3) Clustering/Blocking of records: The blocking/clustering algorithm is used to group the records based on the calculated similarity and block- token key. 4) Similarity computation for selected attributes: Jaccard similarity method is used for comparing token values of selected attributes in a field. 5) Detection and elimination of duplicate records: A rule based detection and elimination approach is employed for detecting and eliminating the duplicates in a cluster or in many clusters. 6) Merge: The cleansed data is combined and stored. 13
  • 15. 15 1.Field Selection • Records are decomposed into fields: • Fields are analyzed for gathering data about their type, relationship with other fields, key fields and integrity constraints so that have enough metadata about the decomposed fields. • Missing fields stored in a separate temporary table and preserved in the repository along with their source record, relation name, data types and integrity constraints. • Missing fields are reviewed by the DBA to verify the reason for their existence. (1) if the data is missing it can be recaptured; (2) if the value is not known efforts can be made to gather the data to complete the record or fill the missing field with a valid value. if no valid data can be collected the values is preserved in the repository for further verification and not used in the cleaning procedure.
  • 16. 16 2.Computation of Rules Certain rules are computed that will be utilized during the implementation of the cleaning process. Threshold value: The threshold value is calculated based on the experiments conducted in previous researches. Values lower than the thresholds increase the number of false positives. Values above thresholds are not able to detect all duplicates. Values in between can be used to recognize approximate duplicates. Rules for classification of fields: Selected fields are classified on the basis of their data types. Rules for data quality attributes: Previous framework only focused on 3 quality attributes of data: completeness, accuracy and consistency. 2 other quality attribute values proposed in new framework: Validity: Integrity:
  • 17. 17 3. Formation of Clusters • Using recursive record matching algorithm for initial cluster formation with slight modification: • Use it for matching of fields rather than whole record. • Clusters are stored in priority queue. • Priorities of clusters in the queue are assigned on the basis of their ability to detect duplicates data sets. • The cluster that detected the recent match is stored assigned the highest priority. 4. Match Score Match scores are assigned by applying Smith-Waterman algorithm(An edit-distance based strategy). The calculations done in this method are stored in a matrix. 5. Detection of Exact and Approximate Duplicates When a new field is to be matched against any data set present in a cluster use Union-Find structure. If it fails in detecting any match then we employ Smith-Waterman. 6. Handling of Outliers and Missing Fields Records that do not match any of the clusters present are called outliers or singleton records. Singleton records may be stored in a separate file, stored in the repository for future analysis and comparisons.
  • 18. 18 7. Updating Metadata/Repository: Metadata and repositories will be an integral part of proposed framework: important components of repositories: 1. Data dictionary: store the information about the relations, their sources, schema, etc. 2. Rules directory: All the calculated values of thresholds, quality attributes, matching scores, etc. 3. Log files: They are used to store: • information about the selected fields and their source record. • classification of the fields based on their data type explicitly under 3 categories numeric, strings and characters. 4. Outlier & Missing field files: stores the outliers and missing fields and their related information like-type, source relation.
  • 19. 19 Comparison of Existing and Proposed Framework
  • 20. 20 Today’s Lecture Content •Introduction •Enhanced Technique to Clean Data in the Data Warehouse • DWCLEANSER: A Framework for Approximate Duplicate Detection •Data Quality Mining • Data Quality Mining With Association Rules • Data Cleaning Using Functional Dependencies
  • 21. 21 Data Quality Mining Data mining process : • Involves into the data collection, cleaning the data, building a model and monitoring the models. • Automatically extract hidden and intrinsic information from the collections of data. • Has various techniques that are suitable for data cleaning. Some commonly used data mining techniques: Association Rule Mining : • Takes an input and induces rules as output; the outputs can be association rules. • Association rules describe relationships among large data sets and co-occurrence of items. Functional dependency: shows the connection and association between attributes and shows how one specific combination of values on one set of attributes determines one specific combination of values on another set.
  • 22. 22 Today’s Lecture Content •Introduction •Enhanced Technique to Clean Data in the Data Warehouse • DWCLEANSER: A Framework for Approximate Duplicate Detection •Data Quality Mining • Data Quality Mining With Association Rules • Data Cleaning Using Functional Dependencies
  • 23. 23 Data Quality Mining With Association Rules Objective: Used here to detect, quantify, explain and correct data quality deficiencies in very large databases. find a relationship with the items in huge database in addition to that it improves the data quality. Association rules generates a rule for all the transactions which are checked by their confidence level. Find out the strength of all rules by the following steps: • Determine transaction type. • Generates the association rule. • Assign a score to each transaction based on the generated rules Score : summing the confidence values of the rules it violates. Rule violation occurs when a tuples must satisfy the rule body but not it’s consequent. Idea: assign high scores to a transaction is to suspect the deficiencies. Suggest minimal threshold for confidence to restrict the rule set in order to improve the results. Sort the transactions according to their score values. Based on the score, the system decides whether to accept or reject the data or else issue a warning.
  • 24. 24 Data Cleaning Using Functional Dependencies Functional Dependency(FD) is an important feature for referencing to the relationship between attributes and candidate keys in tuples. FD discovery could find too many FDs and, if use directly in a cleaning process, could cause it to NP time => degrade the performance of the data cleaning. Developing a cleaning engine by combining: FD discovery technique + data cleaning technique + Use the feature in query optimization called Selectivity Value to decrease the number of FDs discovered(prune unlikely FDs).
  • 25. 25 Today’s Lecture Content •Introduction •Enhanced Technique to Clean Data in the Data Warehouse • DWCLEANSER: A Framework for Approximate Duplicate Detection •Data Quality Mining • Data Quality Mining With Association Rules • Data Cleaning Using Functional Dependencies
  • 27. 27 SYSTEM ARCHITETURE Data collector • Retrieve data from relational database and Improves some quality of data (corrects data from basic typos, invalid domains and invalid formats) and prepares it for the next module (in a relational format). FD engine • Is an FD finding module • Dirty data usually has some errors => use the Approximate FD technique to remove errors and find FD. • Apply the selectivity value technique to rank the candidates in its Pruning step and select the candidates only with high and low rank from the computing FD step. • At the same time, any errors detected from this modified FD engine are suspicious tuples for cleaning. • The errors can be separated into 2 types: o Errors from finding non-candidate key FDs are inconsistent data. o Errors from finding a candidate key FDs are potentially duplicated data. • Together with the (discovered FDs + all suspicious error tuples) will be sent to the next step.
  • 28. 28 SYSTEM ARCHITETURE Cleaning Engine: Receive: • suspicious error tuples • FD selected from the FD engine Then: Assign weight to the data (high error produces a high weight). Tuples with low weights will repair the high weight tuples. FD repairing technique: After updating the weight, the engine brings the FD to clean the data by using the Cost-based algorithm (use low cost data to repair a high cost data). Duplicate Elimination: The last step is to find the duplicate data by improving the sorted neighbor-hood method algorithm through using the candidate key FD from the FD engine to assign key and sorting data from the attribute on the left-hand side of FDs. Relational database: Other modules storing and retrieving data from this module.
  • 29. 29 SELECTING THE FD Apply selectivity value for ranking the candidate in order to find the appropriate FD. 1 Selectivity value the selectivity value determine distribution. If the selectivity value of any attribute • is high => the attribute value is highly distributed. • is low => the attribute value is more likely to be united. Highly distributed attribute is potentially a candidate key and can be used to eliminate duplicates. The lowest distributed attribute can be applied to improve the error of distortion of attribute values in the cleaning engine.
  • 30. 30 SELECTING THE FD 2 Ranking the candidate After calculating the selectivity value for determining the ranks of candidates, we sort these ranks in ascending order. To choose potentially good candidates: Define the low ranking threshold and high ranking threshold as a pruning point. The selected candidates are chosen from the candidates with either high ranking or low ranking values. The high ranking candidate has high selectivity is potentially a candidate key . The low ranking candidates is potentially an invariant valued which can be functionally determined by some attribute in a trivial manner. Thus, it can be computed to be a non-candidate key on the right-hand side. The middle ranking is not precise so ignored.
  • 31. 31 SELECTING THE FD 3 Improve the pruning step : The pruning step is a step for generating the candidate set by computing the candidates from level 1. Pruning lattice example
  • 32. 32 Improved pruning method • Begins the pruning by getting the set of candidates in level - 1 and then, checks the candidates. • If they are not the FD and in either high or low accepted ranking => use StoreCandidate function to store new candidate from candidate_x and candidate_y in the current level. • Other candidates that are in a neither low nor high ranking will be ignored.
  • 33. 33 Results 50,000 real customer tuples are used as a data source. Separate the dataset into 3 sets, as follows: o first dataset has 10% duplicates, o second dataset has 10% errors o last dataset has 10% duplicates and errors. Results showed that this work can identify duplicates and anomalies with high recall and low false positive. PROBLEM : Combining solution is sensitive to data size: • Data volume increase => discovery algorithm speed decrease • Number of attributes increase => the discovery creates more candidates of FD and generates too many FDs including noise ones.
  • 34. 34 Strengths and Limitations of Data Quality Mining Methods : Association rules Functional Dependency Reduce the number of rules to generate for a transaction Easily identifies suspicious tuples for cleaning avoids a severe pitfall of association rule mining Decrease the number of functional dependency discovered difficult to generate association rules for all transactions is not suitable for large database because it is difficult to sort all the records
  • 35. 35 Main References: 1. Hamad, Mortadha M., and Alaa Abdulkhar Jihad. "An Enhanced Technique To Clean Data In The Data Warehouse". 2011 Developments in E-systems Engineering (2011): n. pag. Web. 20 Dec. 2015. 2. Thakur, G., Singh, M., Pahwa, P. and Tyagi, N. (2011). DWCLEANSER: A Framework for Approximate Duplicate Detection. Advances in Computing and Information Technology, pp.355-364. 3. Natarajan, K., Li, J. and Koronios, A. (2010). Data mining techniques for data cleaning, Engineering Asset Lifecycle Management, Springer London, pp.796-804. 4. Kollayut Kaewbuadee, Yae Temtanapat, and Ratchata Peachavanish, (2006) Data cleaning using functional dependency from data mining process, International Journal on Computer Science and Information System (IADIS) V1 , no. 2, 117–131 ,ISBN: ISSN : 1646 – 3692.
  • 36.