SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Downloaden Sie, um offline zu lesen
Declarative Data Cleaning :
Language, Model, and
Algorithms
(Galhardas et. Al., Proc. VLDB, 2001)
Outline
• Problem
• Motivating example
• AJAX framework
• Logical layer
• Physical layer
• Matching
• Neighborhood join algorithm
• Multi-pass neighborhood algorithm
• Evaluation
• Related work
• Conclusion
• Discussion
Problem
• Data cleaning is a difficult problem.
• Current solutions (ETL and reengineering tools) :
• Not sophisticated enough to design data flow graphs
efficiently and effectively.
• Non-interactive.
• Hinder stepwise refinement process crucial to data
cleaning.
Motivating Example
AJAX framework
• Logical layer :
• Declarative language to express data cleaning using
logical operators (extension of SQL).
• Physical layer :
• Specify algorithm.
• Optimization.
• Exceptions as a mechanism to solicit user
interaction.
Logical layer
• 5 Operations :
• Mapping
• View
• Matching (important)
• Clustering
• Merging
• Duplicate elimination is handled by a sequence of
match, cluster, and merge.
Physical layer
• Implementations written in 3GL and registered
within the AJAX library.
• Matching algorithms :
• Naïve.
• Neighborhood Join optimization (NJ).
• Multi-pass Neighborhood optimization (MPN).
NJ optimization
• Apply distance filters on naïve algorithm.
• Devise function over input tuples so that cheaper
to compute similarity than actual similarity.
• E.g, use prefixes of strings
• Actual similarity only computed after passing filter.
• Damerau-Levenshtein for similarity
• Transitive closure.
NJ optimization
MPN optimization
• NJ does not allow false dismissals.
• MPN relaxes this requirement.
• Algorithm :
• Outer join on relations.
• Select key for each record.
• Sort all keys.
• Compare records that are close; within fixed window.
• Multiple passes allowed.
Evaluation
• MPN faster but less accurate than NJ.
• NJ algorithm is able to achieve a recall of 1 much
faster than the MPN method for more unstructured
domains :
• E.g., event name vs author name
Related work
• AJAX has more operations than related languages
:
• SQL doesn’t have merging and clustering operations
or exception support.
• WHIRL doesn’t have merging and clustering.
• AJAX and Potter’s Wheel both interactive.
• Potter’s Wheel automatic discrepancy detection
algorithm can be integrated into AJAX.
Conclusion
• AJAX framework :
• Logical and physical separation.
• Declarative language to specify transformations.
• Exceptions as a way to solicit interactions.

Weitere ähnliche Inhalte

Was ist angesagt?

C++ in object oriented programming
C++ in object oriented programmingC++ in object oriented programming
C++ in object oriented programmingSaket Khopkar
 
Whole Platform LWC11 Submission
Whole Platform LWC11 SubmissionWhole Platform LWC11 Submission
Whole Platform LWC11 SubmissionRiccardo Solmi
 
F# type providers
F# type providersF# type providers
F# type providersAntya Dev
 
Pipes & Filters Architectural Pattern
Pipes & Filters Architectural PatternPipes & Filters Architectural Pattern
Pipes & Filters Architectural PatternFredrik Kivi
 
Sql Server Machine Learning Services - Sql Saturday Prague 2018 #SqlSatPrague
Sql Server Machine Learning Services - Sql Saturday Prague 2018 #SqlSatPragueSql Server Machine Learning Services - Sql Saturday Prague 2018 #SqlSatPrague
Sql Server Machine Learning Services - Sql Saturday Prague 2018 #SqlSatPragueLuis Beltran
 
The Rise of Functional Programming
The Rise of Functional ProgrammingThe Rise of Functional Programming
The Rise of Functional ProgrammingTjerk Wolterink
 
Introduction to React by Ebowe Blessing
Introduction to React by Ebowe BlessingIntroduction to React by Ebowe Blessing
Introduction to React by Ebowe BlessingBlessing Ebowe
 
State management in react applications (Statecharts)
State management in react applications (Statecharts)State management in react applications (Statecharts)
State management in react applications (Statecharts)Tomáš Drenčák
 
Actors: Not Just for Movies Anymore
Actors: Not Just for Movies AnymoreActors: Not Just for Movies Anymore
Actors: Not Just for Movies AnymoreVictorOps
 
State Management in Angular/React
State Management in Angular/ReactState Management in Angular/React
State Management in Angular/ReactDEV Cafe
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsGareth Rogers
 

Was ist angesagt? (13)

C++ in object oriented programming
C++ in object oriented programmingC++ in object oriented programming
C++ in object oriented programming
 
Cetpa dotnet taining
Cetpa dotnet tainingCetpa dotnet taining
Cetpa dotnet taining
 
Whole Platform LWC11 Submission
Whole Platform LWC11 SubmissionWhole Platform LWC11 Submission
Whole Platform LWC11 Submission
 
F# type providers
F# type providersF# type providers
F# type providers
 
Pipes & Filters Architectural Pattern
Pipes & Filters Architectural PatternPipes & Filters Architectural Pattern
Pipes & Filters Architectural Pattern
 
Sql Server Machine Learning Services - Sql Saturday Prague 2018 #SqlSatPrague
Sql Server Machine Learning Services - Sql Saturday Prague 2018 #SqlSatPragueSql Server Machine Learning Services - Sql Saturday Prague 2018 #SqlSatPrague
Sql Server Machine Learning Services - Sql Saturday Prague 2018 #SqlSatPrague
 
The Rise of Functional Programming
The Rise of Functional ProgrammingThe Rise of Functional Programming
The Rise of Functional Programming
 
Introduction to React by Ebowe Blessing
Introduction to React by Ebowe BlessingIntroduction to React by Ebowe Blessing
Introduction to React by Ebowe Blessing
 
State management in react applications (Statecharts)
State management in react applications (Statecharts)State management in react applications (Statecharts)
State management in react applications (Statecharts)
 
Actors: Not Just for Movies Anymore
Actors: Not Just for Movies AnymoreActors: Not Just for Movies Anymore
Actors: Not Just for Movies Anymore
 
State Management in Angular/React
State Management in Angular/ReactState Management in Angular/React
State Management in Angular/React
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech Analystics
 
Intro to ember.js
Intro to ember.jsIntro to ember.js
Intro to ember.js
 

Ähnlich wie Ajax

PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptxPPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptxneju3
 
Open Source SQL Databases
Open Source SQL DatabasesOpen Source SQL Databases
Open Source SQL DatabasesEmanuel Calvo
 
High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)Jose Luis Lopez Pino
 
Distributed Model Validation with Epsilon
Distributed Model Validation with EpsilonDistributed Model Validation with Epsilon
Distributed Model Validation with EpsilonSina Madani
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the Worldjhugg
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Spark Summit
 
Lecture 5.pptx
Lecture 5.pptxLecture 5.pptx
Lecture 5.pptxShafii8
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks
 
Основы функционального JS
Основы функционального JSОсновы функционального JS
Основы функционального JSАнна Луць
 
Agent Based Modeling and Simulation - Overview and Tools
Agent Based Modeling and Simulation - Overview and ToolsAgent Based Modeling and Simulation - Overview and Tools
Agent Based Modeling and Simulation - Overview and ToolsStathis Grigoropoulos
 
SFDC Introduction to Apex
SFDC Introduction to ApexSFDC Introduction to Apex
SFDC Introduction to ApexSujit Kumar
 
Query Decomposition and data localization
Query Decomposition and data localization Query Decomposition and data localization
Query Decomposition and data localization Hafiz faiz
 
JSR 335 / java 8 - update reference
JSR 335 / java 8 - update referenceJSR 335 / java 8 - update reference
JSR 335 / java 8 - update referencesandeepji_choudhary
 
Parallel Computing in .NET
Parallel Computing in .NETParallel Computing in .NET
Parallel Computing in .NETmeghantaylor
 
Programming language paradigms
Programming language paradigmsProgramming language paradigms
Programming language paradigmsAshok Raj
 
Getting ready to java 8
Getting ready to java 8Getting ready to java 8
Getting ready to java 8Strannik_2013
 
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Spark Summit
 

Ähnlich wie Ajax (20)

PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptxPPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
 
Open Source SQL Databases
Open Source SQL DatabasesOpen Source SQL Databases
Open Source SQL Databases
 
Java9to19Final.pptx
Java9to19Final.pptxJava9to19Final.pptx
Java9to19Final.pptx
 
High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)
 
Distributed Model Validation with Epsilon
Distributed Model Validation with EpsilonDistributed Model Validation with Epsilon
Distributed Model Validation with Epsilon
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
 
Lecture 5.pptx
Lecture 5.pptxLecture 5.pptx
Lecture 5.pptx
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
 
Основы функционального JS
Основы функционального JSОсновы функционального JS
Основы функционального JS
 
Agent Based Modeling and Simulation - Overview and Tools
Agent Based Modeling and Simulation - Overview and ToolsAgent Based Modeling and Simulation - Overview and Tools
Agent Based Modeling and Simulation - Overview and Tools
 
CPP19 - Revision
CPP19 - RevisionCPP19 - Revision
CPP19 - Revision
 
SFDC Introduction to Apex
SFDC Introduction to ApexSFDC Introduction to Apex
SFDC Introduction to Apex
 
Query Decomposition and data localization
Query Decomposition and data localization Query Decomposition and data localization
Query Decomposition and data localization
 
JSR 335 / java 8 - update reference
JSR 335 / java 8 - update referenceJSR 335 / java 8 - update reference
JSR 335 / java 8 - update reference
 
Parallel Computing in .NET
Parallel Computing in .NETParallel Computing in .NET
Parallel Computing in .NET
 
Programming language paradigms
Programming language paradigmsProgramming language paradigms
Programming language paradigms
 
CDN algos
CDN algosCDN algos
CDN algos
 
Getting ready to java 8
Getting ready to java 8Getting ready to java 8
Getting ready to java 8
 
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
 

Mehr von dhruvgairola

A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...
A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...
A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...dhruvgairola
 
Differences bet. versions of UML diagrams.
Differences bet. versions of UML diagrams.Differences bet. versions of UML diagrams.
Differences bet. versions of UML diagrams.dhruvgairola
 
A Theory of the Learnable; PAC Learning
A Theory of the Learnable; PAC LearningA Theory of the Learnable; PAC Learning
A Theory of the Learnable; PAC Learningdhruvgairola
 
Discussion : Info sharing across private DBs
Discussion : Info sharing across private DBsDiscussion : Info sharing across private DBs
Discussion : Info sharing across private DBsdhruvgairola
 

Mehr von dhruvgairola (8)

A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...
A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...
A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...
 
Differences bet. versions of UML diagrams.
Differences bet. versions of UML diagrams.Differences bet. versions of UML diagrams.
Differences bet. versions of UML diagrams.
 
Beginning jQuery
Beginning jQueryBeginning jQuery
Beginning jQuery
 
Beginning CSS.
Beginning CSS.Beginning CSS.
Beginning CSS.
 
A Theory of the Learnable; PAC Learning
A Theory of the Learnable; PAC LearningA Theory of the Learnable; PAC Learning
A Theory of the Learnable; PAC Learning
 
Discussion : Info sharing across private DBs
Discussion : Info sharing across private DBsDiscussion : Info sharing across private DBs
Discussion : Info sharing across private DBs
 
PRIMES is in P
PRIMES is in PPRIMES is in P
PRIMES is in P
 
Potters wheel
Potters wheelPotters wheel
Potters wheel
 

Kürzlich hochgeladen

Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceMartin Humpolec
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIUdaiappa Ramachandran
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxYounusS2
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServiceRenan Moreira de Oliveira
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?SANGHEE SHIN
 

Kürzlich hochgeladen (20)

Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your Salesforce
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AI
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptx
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?
 

Ajax

  • 1. Declarative Data Cleaning : Language, Model, and Algorithms (Galhardas et. Al., Proc. VLDB, 2001)
  • 2. Outline • Problem • Motivating example • AJAX framework • Logical layer • Physical layer • Matching • Neighborhood join algorithm • Multi-pass neighborhood algorithm • Evaluation • Related work • Conclusion • Discussion
  • 3. Problem • Data cleaning is a difficult problem. • Current solutions (ETL and reengineering tools) : • Not sophisticated enough to design data flow graphs efficiently and effectively. • Non-interactive. • Hinder stepwise refinement process crucial to data cleaning.
  • 5. AJAX framework • Logical layer : • Declarative language to express data cleaning using logical operators (extension of SQL). • Physical layer : • Specify algorithm. • Optimization. • Exceptions as a mechanism to solicit user interaction.
  • 6. Logical layer • 5 Operations : • Mapping • View • Matching (important) • Clustering • Merging • Duplicate elimination is handled by a sequence of match, cluster, and merge.
  • 7. Physical layer • Implementations written in 3GL and registered within the AJAX library. • Matching algorithms : • Naïve. • Neighborhood Join optimization (NJ). • Multi-pass Neighborhood optimization (MPN).
  • 8. NJ optimization • Apply distance filters on naïve algorithm. • Devise function over input tuples so that cheaper to compute similarity than actual similarity. • E.g, use prefixes of strings • Actual similarity only computed after passing filter. • Damerau-Levenshtein for similarity • Transitive closure.
  • 10. MPN optimization • NJ does not allow false dismissals. • MPN relaxes this requirement. • Algorithm : • Outer join on relations. • Select key for each record. • Sort all keys. • Compare records that are close; within fixed window. • Multiple passes allowed.
  • 11. Evaluation • MPN faster but less accurate than NJ. • NJ algorithm is able to achieve a recall of 1 much faster than the MPN method for more unstructured domains : • E.g., event name vs author name
  • 12. Related work • AJAX has more operations than related languages : • SQL doesn’t have merging and clustering operations or exception support. • WHIRL doesn’t have merging and clustering. • AJAX and Potter’s Wheel both interactive. • Potter’s Wheel automatic discrepancy detection algorithm can be integrated into AJAX.
  • 13. Conclusion • AJAX framework : • Logical and physical separation. • Declarative language to specify transformations. • Exceptions as a way to solicit interactions.