Data Cleaning Techniques

1
Data Cleaning Techniques
Shahid Rajaee Teacher Training University
Faculty of Computer Engineering
PRESENTED BY:
Amir Masoud Sefidian

2
Today’s Lecture Content
•Introduction
•Enhanced Technique to Clean Data in the Data Warehouse
• DWCLEANSER: A Framework for Approximate Duplicate Detection
•Data Quality Mining
• Data Quality Mining With Association Rules
• Data Cleaning Using Functional Dependencies

3
•Introduction
• Data Quality Mining

4
Introduction
• Data quality is a main issue in quality information management.
• Data quality problems occur anywhere in information systems.
• These problems are solved by Data Cleaning:
• Is a process used to determine inaccurate, incomplete or unreasonable data and then
improve the quality through correcting of detected errors => reduces errors and improves
the data quality.
• Data Cleaning can be a time consuming and tedious process but it cannot be ignored.
• Data quality criterias : accuracy, integrity, completeness, validity, consistency, schema
conformance, uniqueness,… .

5
•Introduction

6
An Enhanced Technique to Clean Data in the Data Warehouse
• Using a new algorithm that detects and corrects most of the error types and expected problems, such as
lexical errors, domain format errors, irregularities, integrity constraint violation, and duplicates, missing
value .
• Presents a solution working on the quantitative data and any data that have limited values.
• Offers the user interaction by selecting the rules and any sources and the desired targets.
• Algorithm is able to clean the data completely, addressing all the mistakes and inconsistencies in the data
or numerical values specified.
• Time taken to process huge data is not as important as obtaining high quality data since a huge amount
of data can be treated one-time.
• Main focus has been on achieving good quality of the data.
• Pace of implementation of this algorithm is adequate.
• It well scales to large amount of data processing without a significant degradation of the most of relative
performance issues.

7
Flowchart of proposed technique
Proposed model can easily be developed in a data -
warehouse, by the following algorithm:

8
user selects any rules needed in the data cleaning system. layout and descriptions for fields of the data set, which are used
in implementing of the algorithm.

COMPARISON OF THE PROPOSED TECHNIQUE WITH SOME EXISTING TECHNIQUES
Above 1009 records, containing a lot of anomalies have been
examined before and after processing by different available methods
(such as: statistics, clustering) a big difference in the number of
anomalies which confirms the effectiveness and quality of this
algorithm.

10
•Introduction

11
DWCLEANSER: A Framework for Approximate Duplicate Detection
• A novel framework for detection of exact as well as approximate duplicates in a data
warehouse.
• Decreases the complexity involved in the previously designed frameworks by providing
efficient data cleaning techniques.
• Provides a comprehensive metadata support to the whole cleaning process.
• Provisions have also been suggested to take care of outliers and missing fields.

Existing Framework
Previously designed framework designed is a sequential, token-based framework that offers fundamental services of
data cleaning in six steps :
1)Selection of attributes:
Attributes are identified and selected for further processing in the following steps.
2) Formation of tokens:
The selected attributes are utilized to form tokens for similarity computation.
3) Clustering/Blocking of records:
The blocking/clustering algorithm is used to group the records based on the calculated similarity and block-
token key.
4) Similarity computation for selected attributes:
Jaccard similarity method is used for comparing token values of selected attributes in a field.
5) Detection and elimination of duplicate records:
A rule based detection and elimination approach is employed for detecting and eliminating the duplicates
in a cluster or in many clusters.
6) Merge:
The cleansed data is combined and stored.
13

14
Proposed Framework: DWCLEANSER

15
1.Field Selection
• Records are decomposed into fields:
• Fields are analyzed for gathering data about their type, relationship with other fields, key fields and integrity
constraints so that have enough metadata about the decomposed fields.
• Missing fields stored in a separate temporary table and preserved in the repository along with their source record,
relation name, data types and integrity constraints.
• Missing fields are reviewed by the DBA to verify the reason for their existence.
(1) if the data is missing it can be recaptured;
(2) if the value is not known efforts can be made to gather the data to complete the record or fill the missing field with a
valid value.
if no valid data can be collected the values is preserved in the repository for further verification and not used in the
cleaning procedure.

16
2.Computation of Rules
Certain rules are computed that will be utilized during the implementation of the cleaning process.
Threshold value:
The threshold value is calculated based on the experiments conducted in previous researches.
Values lower than the thresholds increase the number of false positives.
Values above thresholds are not able to detect all duplicates.
Values in between can be used to recognize approximate duplicates.
Rules for classification of fields:
Selected fields are classified on the basis of their data types.
Rules for data quality attributes:
Previous framework only focused on 3 quality attributes of data: completeness, accuracy and consistency.
2 other quality attribute values proposed in new framework:
Validity:
Integrity:

17
3. Formation of Clusters
• Using recursive record matching algorithm for initial cluster formation with slight modification:
• Use it for matching of fields rather than whole record.
• Clusters are stored in priority queue.
• Priorities of clusters in the queue are assigned on the basis of their ability to detect duplicates data sets.
• The cluster that detected the recent match is stored assigned the highest priority.
4. Match Score
Match scores are assigned by applying Smith-Waterman algorithm(An edit-distance based strategy).
The calculations done in this method are stored in a matrix.
5. Detection of Exact and Approximate Duplicates
When a new field is to be matched against any data set present in a cluster use Union-Find structure.
If it fails in detecting any match then we employ Smith-Waterman.
6. Handling of Outliers and Missing Fields
Records that do not match any of the clusters present are called outliers or singleton records.
Singleton records may be stored in a separate file, stored in the repository for future analysis and comparisons.

18
7. Updating Metadata/Repository:
Metadata and repositories will be an integral part of proposed framework:
important components of repositories:
1. Data dictionary: store the information about the relations, their sources, schema, etc.
2. Rules directory: All the calculated values of thresholds, quality attributes, matching scores, etc.
3. Log files: They are used to store:
• information about the selected fields and their source record.
• classification of the fields based on their data type explicitly under 3 categories numeric, strings and characters.
4. Outlier & Missing field files: stores the outliers and missing fields and their related information like-type, source relation.

19
Comparison of Existing and Proposed Framework

20
•Introduction

21
Data Quality Mining
Data mining process :
• Involves into the data collection, cleaning the data, building a model and monitoring the models.
• Automatically extract hidden and intrinsic information from the collections of data.
• Has various techniques that are suitable for data cleaning.
Some commonly used data mining techniques:
Association Rule Mining :
• Takes an input and induces rules as output; the outputs can be association rules.
• Association rules describe relationships among large data sets and co-occurrence of items.
Functional dependency:
shows the connection and association between attributes and shows how one specific combination of
values on one set of attributes determines one specific combination of values on another set.

22
•Introduction

23
Data Quality Mining With Association Rules
Objective:
Used here to detect, quantify, explain and correct data quality deficiencies in very large databases.
find a relationship with the items in huge database in addition to that it improves the data quality.
Association rules generates a rule for all the transactions which are checked by their confidence level.
Find out the strength of all rules by the following steps:
• Determine transaction type.
• Generates the association rule.
• Assign a score to each transaction based on the generated rules
Score : summing the confidence values of the rules it violates.
Rule violation occurs when a tuples must satisfy the rule body but not it’s consequent.
Idea: assign high scores to a transaction is to suspect the deficiencies.
Suggest minimal threshold for confidence to restrict the rule set in order to improve the results.
Sort the transactions according to their score values.
Based on the score, the system decides whether to accept or reject the data or else issue a warning.

24
Data Cleaning Using Functional Dependencies
Functional Dependency(FD) is an important feature for referencing to the relationship between attributes and
candidate keys in tuples.
FD discovery could find too many FDs and, if use directly in a cleaning process, could cause it to NP time =>
degrade the performance of the data cleaning.
Developing a cleaning engine by combining:
FD discovery technique + data cleaning technique
+
Use the feature in query optimization called Selectivity Value to decrease the number of
FDs discovered(prune unlikely FDs).

25
•Introduction

27
SYSTEM ARCHITETURE
Data collector
• Retrieve data from relational database and Improves some quality of data (corrects data from basic typos, invalid domains
and invalid formats) and prepares it for the next module (in a relational format).
FD engine
• Is an FD finding module
• Dirty data usually has some errors => use the Approximate FD technique to remove errors and find FD.
• Apply the selectivity value technique to rank the candidates in its Pruning step and select the candidates only with high
and low rank from the computing FD step.
• At the same time, any errors detected from this modified FD engine are suspicious tuples for cleaning.
• The errors can be separated into 2 types:
o Errors from finding non-candidate key FDs are inconsistent data.
o Errors from finding a candidate key FDs are potentially duplicated data.
• Together with the (discovered FDs + all suspicious error tuples) will be sent to the next step.

28
SYSTEM ARCHITETURE
Cleaning Engine:
Receive:
• suspicious error tuples
• FD selected from the FD engine
Then:
Assign weight to the data (high error produces a high weight).
Tuples with low weights will repair the high weight tuples.
FD repairing technique:
After updating the weight, the engine brings the FD to clean the data by using the Cost-based algorithm (use low cost data to
repair a high cost data).
Duplicate Elimination:
The last step is to find the duplicate data by improving the sorted neighbor-hood method algorithm through using the
candidate key FD from the FD engine to assign key and sorting data from the attribute on the left-hand side of FDs.
Relational database:
Other modules storing and retrieving data from this module.

29
SELECTING THE FD
Apply selectivity value for ranking the candidate in order to find the appropriate FD.
1 Selectivity value
the selectivity value determine distribution.
If the selectivity value of any attribute
• is high => the attribute value is highly distributed.
• is low => the attribute value is more likely to be united.
Highly distributed attribute is potentially a candidate key and can be used to eliminate duplicates.
The lowest distributed attribute can be applied to improve the error of distortion of attribute values in the
cleaning engine.

30
SELECTING THE FD
2 Ranking the candidate
After calculating the selectivity value for determining the ranks of candidates, we sort these ranks in ascending
order.
To choose potentially good candidates:
Define the low ranking threshold and high ranking threshold as a pruning point.
The selected candidates are chosen from the candidates with either high ranking or low ranking values.
The high ranking candidate has high selectivity is potentially a candidate key .
The low ranking candidates is potentially an invariant valued which can be functionally determined by some attribute in a
trivial manner. Thus, it can be computed to be a non-candidate key on the right-hand side.
The middle ranking is not precise so ignored.

31
SELECTING THE FD
3 Improve the pruning step :
The pruning step is a step for generating the candidate set by computing the candidates from level 1.
Pruning lattice example

32
Improved pruning method
• Begins the pruning by getting the set of candidates in level - 1 and then, checks the candidates.
• If they are not the FD and in either high or low accepted ranking => use StoreCandidate function to store new candidate
from candidate_x and candidate_y in the current level.
• Other candidates that are in a neither low nor high ranking will be ignored.

33
Results
50,000 real customer tuples are used as a data source.
Separate the dataset into 3 sets, as follows:
o first dataset has 10% duplicates,
o second dataset has 10% errors
o last dataset has 10% duplicates and errors.
Results showed that this work can identify duplicates and
anomalies with high recall and low false positive.
PROBLEM :
Combining solution is sensitive to data size:
• Data volume increase => discovery algorithm speed
decrease
• Number of attributes increase => the discovery creates
more candidates of FD and generates too many FDs
including noise ones.

34
Strengths and Limitations of Data Quality Mining Methods :
Association rules Functional Dependency
Reduce the number of rules to generate for
a transaction
Easily identifies suspicious tuples for cleaning
avoids a severe pitfall of association rule
mining
Decrease the number of functional dependency
discovered
difficult to generate association rules for all
transactions
is not suitable for large database because it is
difficult to sort all the records

35
Main References:
1. Hamad, Mortadha M., and Alaa Abdulkhar Jihad. "An Enhanced Technique To Clean Data In The Data
Warehouse". 2011 Developments in E-systems Engineering (2011): n. pag. Web. 20 Dec. 2015.
2. Thakur, G., Singh, M., Pahwa, P. and Tyagi, N. (2011). DWCLEANSER: A Framework for Approximate Duplicate
Detection. Advances in Computing and Information Technology, pp.355-364.
3. Natarajan, K., Li, J. and Koronios, A. (2010). Data mining techniques for data cleaning, Engineering Asset Lifecycle
Management, Springer London, pp.796-804.
4. Kollayut Kaewbuadee, Yae Temtanapat, and Ratchata Peachavanish, (2006) Data cleaning using functional
dependency from data mining process, International Journal on Computer Science and Information System
(IADIS) V1 , no. 2, 117–131 ,ISBN: ISSN : 1646 – 3692.

Data Cleaning Techniques

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Cleaning Techniques

Similar to Data Cleaning Techniques (20)

Recently uploaded

Recently uploaded (20)

Data Cleaning Techniques