SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
Time-completeness trade-offs in record linkage
            using Adaptive query Processing
         Paolo Missier, Alvaro A. A. Fernandes
    School of Computer Science, University of Manchester

            Roald Lengu, Giovanna Guerrini
              DISI, Universita' di Genova, Italy

                       Marco Mesiti
              DiCo, Universita' di Milano, Italy


                SEBD 2009, Camogli, Italy
                      June, 2009
Conclusions
• Optimistic approach to online record linkage
     – Based on implicit referential integrity assumption
     – When assumption not true, goes back to pessimistic


• Technical approach based on autonomic computing
     – Adaptive query processing
     – Mix of exact / approximate physical join operators


• Applications: on-the-fly integration scenarios (mashups,
  personal dataspaces, sensor data streams)

• Need to relax the referential integrity assumption -- use
  other sources for result size estimation


P. Missier, SEBD, June 2009, Camogli, Italy
Context: On-the-fly data integration
 • “Situational Applications”
 • Data mashups and personal dataspaces
 • slight mismatches in records lead to incomplete
   integration
      – due to different encodings, conventions
      – due to errors in data
                          R       A=B         S
     C        D       A                           B        A

                                          Madchester


     Y      X     Manchester
                                                               A case of record linkage
                                          Manchester       Z




          Y       X   Manchester        Manchester     Z

                                                                                   3
P. Missier, SEBD, June 2009, Camogli, Italy
Offline vs online linkage
  • Offline record linkage:
       – performed once before queries involving joins
       – 1. reconcile R and S on joining attributes R.A, S.B using
         your favourite record linkage technique
                                         R,S →    R’,S’

       – 2. perform regular equijoin on the transformed tables:

                                              R   S
       ➡ ok for tables that can be analysed ahead of the join
       ➡ ok when multiple queries issued on integrated tables




                                                                             4
P. Missier, SEBD, June 2009, Camogli, Italy
Offline vs online linkage
  • Offline record linkage:
       – performed once before queries involving joins
       – 1. reconcile R and S on joining attributes R.A, S.B using
         your favourite record linkage technique
                                         R,S →    R’,S’

       – 2. perform regular equijoin on the transformed tables:

                                              R   S
       ➡ ok for tables that can be analysed ahead of the join
       ➡ ok when multiple queries issued on integrated tables


             • Online linkage:
                  – performed while answering a query
                  – exact join similarity (or approximate)              join
                                                                               4
P. Missier, SEBD, June 2009, Camogli, Italy
Record linkage and similarity joins
     Historical timeline:




     from:
     N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and
     algorithms. Tutorial in SIGMOD '06.
                                                                                      5
P. Missier, SEBD, June 2009, Camogli, Italy
Record linkage and similarity joins
     Historical timeline:




     from:
     N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and
     algorithms. Tutorial in SIGMOD '06.
                                                                                      5
P. Missier, SEBD, June 2009, Camogli, Italy
Measuring string similarity using q-grams
• q-grams map string s to a set q(s) of substrings of length q:
   Ex.: 3-grams:

     q(“Manchester”) =
      {‘Man’, ‘anc’, ‘nch’, ‘che’, ‘hes’, ‘est ’, ‘ste’, ‘ter’}.

     q(“Madchester”) =
      {‘Mad’, ‘adc’, ‘dch’, ‘che’, ‘hes’, ‘est ’, ‘ste’, ‘ter’}.
                        |q(s1 ) ∩ q(s2 )|
        sim(s1 , s2 ) =                               (Jaccard coefficient)
                        |q(s1 ) ∪ q(s2 )|


                    sim(r1 , r2 ) < θ1 → not match
        θ2 < sim(r1 , r2 ) → match


P. Missier, SEBD, June 2009, Camogli, Italy
Similarity symmetric hash join
                 R                            S

            insert                                insert
  R hash table
                           probe                  S hash table
                                                                 set intersection computed
        m       x                                                using a variation of the
                           probe
        n        y
                                              y      r           symmetric hash join
                                              x      s
                                                                 new primitive join
                                                                 operator [CGK06]
                         [R.m,S.s]
                         [R.n, S.r]


    • Main sources of complexity:
         – overhead for storing and indexing q-grams
         – cost of computing set intersection

                                      Efficient relational representation:
                               [CGK06] S. Chaudhuri, V. Ganti and R. Kaushik,
                     “A primitive operator for similarity joins in data cleaning” (ICDE’06)‫‏‬   7
P. Missier, SEBD, June 2009, Camogli, Italy
Is full similarity join always necessary?
• Pessimistic: always pays full complexity cost
• Typical mismatch rate in real datasets around 5%

   Research Goal: explore optimistic approach
 • detect mismatches and react as you go
 • requires estimates of incremental join result size
 • statistical + reactive
           • expect to sacrifice join result completeness for
               faster execution

     Approach:
     - combine exact and approximate join operators
     using Adaptive Query Processing techniques

                                                                        8
P. Missier, SEBD, June 2009, Camogli, Italy
Autonomic computing framework




    [KC03] J. O. Kephart and D. M. Chess. The vision of
    autonomic computing. IEEE Computer, 36(1):41–50, 2003.
                                                                      9
P. Missier, SEBD, June 2009, Camogli, Italy
Autonomic computing framework


                                                    monitor




                                         respond              assess




                                                                           9
P. Missier, SEBD, June 2009, Camogli, Italy
Autonomic computing framework

                                                                monitor incremental
                                                                  join result size
                                                    monitor




                                         respond              assess




                                                                                      9
P. Missier, SEBD, June 2009, Camogli, Italy
Autonomic computing framework

                                                                monitor incremental
                                                                  join result size
                                                    monitor


                                                                               estimate
                                                                           join result size

                                         respond              assess

                                                                            compute
                                                                           divergence
                                                                            from exp




                                                                                              9
P. Missier, SEBD, June 2009, Camogli, Italy
Autonomic computing framework

                                                                monitor incremental
                                                                  join result size
                                                    monitor


                                                                               estimate
                                                                           join result size

                                         respond              assess

                                                                            compute
               swap                                                        divergence
               join                                                         from exp
             operators



   • when using exact join:
     • if observed/estimated sizes diverge “too much”, then switch to
     approximate join
   • when using approximate join:
     • if observed/estimated converge, then switch to exact join
                                                                                              9
P. Missier, SEBD, June 2009, Camogli, Italy
Technical approach and challenges


  • Assess:
        – Need additional assumption for result size estimation
        – Estimating result size at specific points during join execution


  • Respond:
        – When and how can physical join operators switched?
        – Can we avoid loss of work?
           • operator replacement in pipelined query plans [EFP06]




  [EFP06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the
  replacement of pipelined physical join operators in adaptive query processing. In EDBT
  Workshops 2006, LNCS 4254
                                                                                     10
P. Missier, SEBD, June 2009, Camogli, Italy
Assessment
• Expectation of matching records
• Used to derive simple result size estimation model
                      R (parent)              S (child)
                                                                        symmetric hash
                  c       x                                     n       join, again
                                                y         b
                  d       y
                                                x         a



• When there are no mismatches:
  after scanning n < |S| tuples on S:
    P(a=x in |S| has been matched) = P(tuple c=x is in top n of R) = n/|R|

    Thus, join result size On after n tuples is a binomial random
    variable:                           n
                                     On ∼ bin(n,                    )
                                                              |R|
                                                                                    11
P. Missier, SEBD, June 2009, Camogli, Italy
Detecting divergent observed result size
             ¯
Observation On is outlier wrt expected result size On
=> divergence             ¯
                Pn,p(n) (On ≤ O) ≤ θout
where Pn,p(n) (.) is the binomial cdf with parameters n, p(n)




                                                          12
Detecting divergent observed result size
             ¯
Observation On is outlier wrt expected result size On
=> divergence             ¯
                Pn,p(n) (On ≤ O) ≤ θout
where Pn,p(n) (.) is the binomial cdf with parameters n, p(n)




     outlier detection on experimental datasets with various
                        mismatch patterns



                                                               12
Detecting divergent observed result size
             ¯
Observation On is outlier wrt expected result size On
=> divergence             ¯
                Pn,p(n) (On ≤ O) ≤ θout
where Pn,p(n) (.) is the binomial cdf with parameters n, p(n)




     outlier detection on experimental datasets with various
                        mismatch patterns

        Note: when the data does not follow our referential
         constraint hypothesis, the model leads to a pure
                          similarity join                      12
Condition for operator replacement
• Goal of AQP:
   – switch operators without loss of intermediate work
• sufficient condition: switch when the operator reaches
  a quiescent state [EPF06]




[EPF06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the
replacement of pipelined physical join operators in adaptive query processing. In EDBT
                                                                                   13
Workshops 2006, LNCS 4254
Adaptivity: responder state machine
         Each state corresponds to a combination of physical
                           join operators


              left: exact                                         left: approximate
             right: exact                                        right: approximate




          left: exact                                            left: approximate
     right: approximate                                              right: exact



                                                                                      14
P. Missier, SEBD, June 2009, Camogli, Italy
Rationale for state transitions


                                                      lex /
                                                      rex


  evidence that                               lex /           lap /   evidence that left
left and /or right                            rap             rex     and /or right input
 input perturbed                                                          no longer
                                                                          perturbed
                                                      lap /
                                                      rap



     this is formalized in terms of predicates on the
     observable variables
     - outlier detection and more (omitted)

P. Missier, SEBD, June 2009, Camogli, Italy
Experimental evaluation
• Marginal gain of hybrid algorithm:
     – level of completeness
                                       • R: result size for approx join only
                                       • r: result size for exact only
                                       • rabs: result size actually observed
                                                 grel = (rabs – r) / (R – r)‫‏‬




                                                                                16
P. Missier, SEBD, June 2009, Camogli, Italy
Experimental evaluation
• Marginal gain of hybrid algorithm:
     – level of completeness
                                       • R: result size for approx join only
                                       • r: result size for exact only
                                       • rabs: result size actually observed
                                                 grel = (rabs – r) / (R – r)‫‏‬
  • Marginal Cost:
       – baseline: exact join throughout
          • model marginal cost of hybrid algorithm
                                       unit cost of executing one step in one state
                                              – (experimental)
                                       • number of steps in each state
                                       • unit state transition cost (experimental)
                                       • number of state transitions over entire join
                                                                               16
P. Missier, SEBD, June 2009, Camogli, Italy
Test datasets
    Datasets chosen as representative of 4 distinct patterns




       we expect our results to vary:
   • uniform perturbation: evidence grows slowly => slow reaction
   • bursty perturbation: strong evidence => timely reaction
P. Missier, SEBD, June 2009, Camogli, Italy
Results




P. Missier, SEBD, June 2009, Camogli, Italy
Results




P. Missier, SEBD, June 2009, Camogli, Italy
Conclusions
• Optimistic approach to online record linkage
     – Based on implicit referential integrity assumption
     – When assumption not true, goes back to pessimistic


• Technical approach based on autonomic computing
     – Adaptive query processing
     – Mix of exact / approximate physical join operators


• Applications: on-the-fly integration scenarios (mashups,
  personal dataspaces, sensor data streams)

• Need to relax the referential integrity assumption -- use
  other sources for result size estimation


P. Missier, SEBD, June 2009, Camogli, Italy

Weitere ähnliche Inhalte

Mehr von Paolo Missier

Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
Paolo Missier
 

Mehr von Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

Paper presentation @ SEBD '09

  • 1. Time-completeness trade-offs in record linkage using Adaptive query Processing Paolo Missier, Alvaro A. A. Fernandes School of Computer Science, University of Manchester Roald Lengu, Giovanna Guerrini DISI, Universita' di Genova, Italy Marco Mesiti DiCo, Universita' di Milano, Italy SEBD 2009, Camogli, Italy June, 2009
  • 2. Conclusions • Optimistic approach to online record linkage – Based on implicit referential integrity assumption – When assumption not true, goes back to pessimistic • Technical approach based on autonomic computing – Adaptive query processing – Mix of exact / approximate physical join operators • Applications: on-the-fly integration scenarios (mashups, personal dataspaces, sensor data streams) • Need to relax the referential integrity assumption -- use other sources for result size estimation P. Missier, SEBD, June 2009, Camogli, Italy
  • 3. Context: On-the-fly data integration • “Situational Applications” • Data mashups and personal dataspaces • slight mismatches in records lead to incomplete integration – due to different encodings, conventions – due to errors in data R A=B S C D A B A Madchester Y X Manchester A case of record linkage Manchester Z Y X Manchester Manchester Z 3 P. Missier, SEBD, June 2009, Camogli, Italy
  • 4. Offline vs online linkage • Offline record linkage: – performed once before queries involving joins – 1. reconcile R and S on joining attributes R.A, S.B using your favourite record linkage technique R,S → R’,S’ – 2. perform regular equijoin on the transformed tables: R S ➡ ok for tables that can be analysed ahead of the join ➡ ok when multiple queries issued on integrated tables 4 P. Missier, SEBD, June 2009, Camogli, Italy
  • 5. Offline vs online linkage • Offline record linkage: – performed once before queries involving joins – 1. reconcile R and S on joining attributes R.A, S.B using your favourite record linkage technique R,S → R’,S’ – 2. perform regular equijoin on the transformed tables: R S ➡ ok for tables that can be analysed ahead of the join ➡ ok when multiple queries issued on integrated tables • Online linkage: – performed while answering a query – exact join similarity (or approximate) join 4 P. Missier, SEBD, June 2009, Camogli, Italy
  • 6. Record linkage and similarity joins Historical timeline: from: N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. Tutorial in SIGMOD '06. 5 P. Missier, SEBD, June 2009, Camogli, Italy
  • 7. Record linkage and similarity joins Historical timeline: from: N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. Tutorial in SIGMOD '06. 5 P. Missier, SEBD, June 2009, Camogli, Italy
  • 8. Measuring string similarity using q-grams • q-grams map string s to a set q(s) of substrings of length q: Ex.: 3-grams: q(“Manchester”) = {‘Man’, ‘anc’, ‘nch’, ‘che’, ‘hes’, ‘est ’, ‘ste’, ‘ter’}. q(“Madchester”) = {‘Mad’, ‘adc’, ‘dch’, ‘che’, ‘hes’, ‘est ’, ‘ste’, ‘ter’}. |q(s1 ) ∩ q(s2 )| sim(s1 , s2 ) = (Jaccard coefficient) |q(s1 ) ∪ q(s2 )| sim(r1 , r2 ) < θ1 → not match θ2 < sim(r1 , r2 ) → match P. Missier, SEBD, June 2009, Camogli, Italy
  • 9. Similarity symmetric hash join R S insert insert R hash table probe S hash table set intersection computed m x using a variation of the probe n y y r symmetric hash join x s new primitive join operator [CGK06] [R.m,S.s] [R.n, S.r] • Main sources of complexity: – overhead for storing and indexing q-grams – cost of computing set intersection Efficient relational representation: [CGK06] S. Chaudhuri, V. Ganti and R. Kaushik, “A primitive operator for similarity joins in data cleaning” (ICDE’06)‫‏‬ 7 P. Missier, SEBD, June 2009, Camogli, Italy
  • 10. Is full similarity join always necessary? • Pessimistic: always pays full complexity cost • Typical mismatch rate in real datasets around 5% Research Goal: explore optimistic approach • detect mismatches and react as you go • requires estimates of incremental join result size • statistical + reactive • expect to sacrifice join result completeness for faster execution Approach: - combine exact and approximate join operators using Adaptive Query Processing techniques 8 P. Missier, SEBD, June 2009, Camogli, Italy
  • 11. Autonomic computing framework [KC03] J. O. Kephart and D. M. Chess. The vision of autonomic computing. IEEE Computer, 36(1):41–50, 2003. 9 P. Missier, SEBD, June 2009, Camogli, Italy
  • 12. Autonomic computing framework monitor respond assess 9 P. Missier, SEBD, June 2009, Camogli, Italy
  • 13. Autonomic computing framework monitor incremental join result size monitor respond assess 9 P. Missier, SEBD, June 2009, Camogli, Italy
  • 14. Autonomic computing framework monitor incremental join result size monitor estimate join result size respond assess compute divergence from exp 9 P. Missier, SEBD, June 2009, Camogli, Italy
  • 15. Autonomic computing framework monitor incremental join result size monitor estimate join result size respond assess compute swap divergence join from exp operators • when using exact join: • if observed/estimated sizes diverge “too much”, then switch to approximate join • when using approximate join: • if observed/estimated converge, then switch to exact join 9 P. Missier, SEBD, June 2009, Camogli, Italy
  • 16. Technical approach and challenges • Assess: – Need additional assumption for result size estimation – Estimating result size at specific points during join execution • Respond: – When and how can physical join operators switched? – Can we avoid loss of work? • operator replacement in pipelined query plans [EFP06] [EFP06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the replacement of pipelined physical join operators in adaptive query processing. In EDBT Workshops 2006, LNCS 4254 10 P. Missier, SEBD, June 2009, Camogli, Italy
  • 17. Assessment • Expectation of matching records • Used to derive simple result size estimation model R (parent) S (child) symmetric hash c x n join, again y b d y x a • When there are no mismatches: after scanning n < |S| tuples on S: P(a=x in |S| has been matched) = P(tuple c=x is in top n of R) = n/|R| Thus, join result size On after n tuples is a binomial random variable: n On ∼ bin(n, ) |R| 11 P. Missier, SEBD, June 2009, Camogli, Italy
  • 18. Detecting divergent observed result size ¯ Observation On is outlier wrt expected result size On => divergence ¯ Pn,p(n) (On ≤ O) ≤ θout where Pn,p(n) (.) is the binomial cdf with parameters n, p(n) 12
  • 19. Detecting divergent observed result size ¯ Observation On is outlier wrt expected result size On => divergence ¯ Pn,p(n) (On ≤ O) ≤ θout where Pn,p(n) (.) is the binomial cdf with parameters n, p(n) outlier detection on experimental datasets with various mismatch patterns 12
  • 20. Detecting divergent observed result size ¯ Observation On is outlier wrt expected result size On => divergence ¯ Pn,p(n) (On ≤ O) ≤ θout where Pn,p(n) (.) is the binomial cdf with parameters n, p(n) outlier detection on experimental datasets with various mismatch patterns Note: when the data does not follow our referential constraint hypothesis, the model leads to a pure similarity join 12
  • 21. Condition for operator replacement • Goal of AQP: – switch operators without loss of intermediate work • sufficient condition: switch when the operator reaches a quiescent state [EPF06] [EPF06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the replacement of pipelined physical join operators in adaptive query processing. In EDBT 13 Workshops 2006, LNCS 4254
  • 22. Adaptivity: responder state machine Each state corresponds to a combination of physical join operators left: exact left: approximate right: exact right: approximate left: exact left: approximate right: approximate right: exact 14 P. Missier, SEBD, June 2009, Camogli, Italy
  • 23. Rationale for state transitions lex / rex evidence that lex / lap / evidence that left left and /or right rap rex and /or right input input perturbed no longer perturbed lap / rap this is formalized in terms of predicates on the observable variables - outlier detection and more (omitted) P. Missier, SEBD, June 2009, Camogli, Italy
  • 24. Experimental evaluation • Marginal gain of hybrid algorithm: – level of completeness • R: result size for approx join only • r: result size for exact only • rabs: result size actually observed grel = (rabs – r) / (R – r)‫‏‬ 16 P. Missier, SEBD, June 2009, Camogli, Italy
  • 25. Experimental evaluation • Marginal gain of hybrid algorithm: – level of completeness • R: result size for approx join only • r: result size for exact only • rabs: result size actually observed grel = (rabs – r) / (R – r)‫‏‬ • Marginal Cost: – baseline: exact join throughout • model marginal cost of hybrid algorithm unit cost of executing one step in one state – (experimental) • number of steps in each state • unit state transition cost (experimental) • number of state transitions over entire join 16 P. Missier, SEBD, June 2009, Camogli, Italy
  • 26. Test datasets Datasets chosen as representative of 4 distinct patterns we expect our results to vary: • uniform perturbation: evidence grows slowly => slow reaction • bursty perturbation: strong evidence => timely reaction P. Missier, SEBD, June 2009, Camogli, Italy
  • 27. Results P. Missier, SEBD, June 2009, Camogli, Italy
  • 28. Results P. Missier, SEBD, June 2009, Camogli, Italy
  • 29. Conclusions • Optimistic approach to online record linkage – Based on implicit referential integrity assumption – When assumption not true, goes back to pessimistic • Technical approach based on autonomic computing – Adaptive query processing – Mix of exact / approximate physical join operators • Applications: on-the-fly integration scenarios (mashups, personal dataspaces, sensor data streams) • Need to relax the referential integrity assumption -- use other sources for result size estimation P. Missier, SEBD, June 2009, Camogli, Italy