2. Outline
• Problem
• Motivating example
• AJAX framework
• Logical layer
• Physical layer
• Matching
• Neighborhood join algorithm
• Multi-pass neighborhood algorithm
• Evaluation
• Related work
• Conclusion
• Discussion
3. Problem
• Data cleaning is a difficult problem.
• Current solutions (ETL and reengineering tools) :
• Not sophisticated enough to design data flow graphs
efficiently and effectively.
• Non-interactive.
• Hinder stepwise refinement process crucial to data
cleaning.
5. AJAX framework
• Logical layer :
• Declarative language to express data cleaning using
logical operators (extension of SQL).
• Physical layer :
• Specify algorithm.
• Optimization.
• Exceptions as a mechanism to solicit user
interaction.
6. Logical layer
• 5 Operations :
• Mapping
• View
• Matching (important)
• Clustering
• Merging
• Duplicate elimination is handled by a sequence of
match, cluster, and merge.
7. Physical layer
• Implementations written in 3GL and registered
within the AJAX library.
• Matching algorithms :
• Naïve.
• Neighborhood Join optimization (NJ).
• Multi-pass Neighborhood optimization (MPN).
8. NJ optimization
• Apply distance filters on naïve algorithm.
• Devise function over input tuples so that cheaper
to compute similarity than actual similarity.
• E.g, use prefixes of strings
• Actual similarity only computed after passing filter.
• Damerau-Levenshtein for similarity
• Transitive closure.
10. MPN optimization
• NJ does not allow false dismissals.
• MPN relaxes this requirement.
• Algorithm :
• Outer join on relations.
• Select key for each record.
• Sort all keys.
• Compare records that are close; within fixed window.
• Multiple passes allowed.
11. Evaluation
• MPN faster but less accurate than NJ.
• NJ algorithm is able to achieve a recall of 1 much
faster than the MPN method for more unstructured
domains :
• E.g., event name vs author name
12. Related work
• AJAX has more operations than related languages
:
• SQL doesn’t have merging and clustering operations
or exception support.
• WHIRL doesn’t have merging and clustering.
• AJAX and Potter’s Wheel both interactive.
• Potter’s Wheel automatic discrepancy detection
algorithm can be integrated into AJAX.
13. Conclusion
• AJAX framework :
• Logical and physical separation.
• Declarative language to specify transformations.
• Exceptions as a way to solicit interactions.