SlideShare ist ein Scribd-Unternehmen logo
1 von 66
Downloaden Sie, um offline zu lesen
Approximate entity reconciliation
 for on-the-fly integration in data mashups

     Paolo Missier, Alvaro A. A. Fernandes
School of Computer Science, University of Manchester

        Roald Lengu, Giovanna Guerrini
          DISI, Universita' di Genova, Italy

                   Marco Mesiti
          DiCo, Universita' di Milano, Italy
Outline
• New data integration scenarios:
  – occasional integration with little prior knowledge about the
    sources
• Context: Data mashups and personal dataspaces

• How to ensure that we are not missing any data in
  the process?
  – how costly (i.e. response time) is it to guarantee
    completeness?
  – can we trade completeness for response time?


• Technically speaking: convergence of
  – record linkage (an old data quality favourite)
  – approximate joins
  – adaptive query processing
                                                                   2
Early example
• sources 1..n: collection of car insurance DBs
  • data changes frequently
  • schemas can be analysed / integrated using traditional
    techniques
• source n+1: reference street atlas




                                                             3
Early example
• sources 1..n: collection of car insurance DBs
  • data changes frequently
  • schemas can be analysed / integrated using traditional
    techniques
• source n+1: reference street atlas

                              • target app: mapping accidents hotspots
                                • alert service to drivers, for example
                                • useful information for decision makers




                                                                     3
                                             (image from housingmaps.com)
Mashups
  The IBM view, 2006
  VLDB 2006 Keynote by Anant Jhingran (CTO, Information
  Management, IBM Silicon Valley Laboratory, San Jose, CA):
 Enterprise information mashups: integrating information,
 simply

  Situational Applications
• Applications that come together for solving some
  immediate business problems
• constructed “on the fly” for some transient need
  and possibly short-lasting

• Data never seen before, consumed on the spot
– would take too long for the IT department to provide them
– RSS feeds / data streams
                                                              4
IBM Mashup Center
• IBM Mashup Center
 – mashup workflow
 – leverages Lotus, DB2 plus LDAP, Web Services, ...




                                                       5
Yahoo pipes




Is there actually a “join” in the set of operators?
also google mashup editor, and more...                6
Dataspaces




     7
Dataspaces




     7
Dataspaces




     7
Integration in dataspaces




                    8
Integration in dataspaces




                    8
Integration in dataspaces




                    8
Assumptions



– no prior knowledge of data sets (streams) to be joined
– assumptions on implicit parent-child attribute relationships
– no guarantee of matching values



• sources 1..n: collection of car
  insurance DBs
• source n+1: reference street atlas
• target app: mapping accidents
  hotspots




                                                                 9
The broad context: record linkage
• Are two (slightly) different records two different surface
  representations of the same real-world entity?
       Name: John Smith            Name: John Smith           Record values incomplete
       SSN:                        SSN: 123-45-6789
       Address: 477 Cedar Street   Address:
       Brendan Hughes              Brenda Hughes              Twins or typo?
       Address: 564 Hickory Pl.    Address: 564 Hickory Pl.
       Name: Jean Smith            Name:                      Conflict between forenames
       Phone #: (337) 555-6676     Phone #: (337) 555 5676    and phone number
       Name: Alice Jones           Names: Lois Avon           Same SSN, different
       SSN: 123-45-6789            SSN: 123-45-6789           names:??




                                                                                           10
The broad context: record linkage
• Are two (slightly) different records two different surface
  representations of the same real-world entity?
       Name: John Smith            Name: John Smith           Record values incomplete
       SSN:                        SSN: 123-45-6789
       Address: 477 Cedar Street   Address:
       Brendan Hughes              Brenda Hughes              Twins or typo?
       Address: 564 Hickory Pl.    Address: 564 Hickory Pl.
       Name: Jean Smith            Name:                      Conflict between forenames
       Phone #: (337) 555-6676     Phone #: (337) 555 5676    and phone number
       Name: Alice Jones           Names: Lois Avon           Same SSN, different
       SSN: 123-45-6789            SSN: 123-45-6789           names:??



 • A difficult / uncertain decision process
   • which attributes should I consider for matching
   • what are the different weights
   • context: relative frequency of values?
   • external knowledge, user input


                                                                                           10
Results on record linkage
A mature field - ample literature
– 1969: I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Assoc., vol. 64,
  no. 328, pp. 1183-1210, Dec. 1969



– 2007: A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, “Duplicate Record Detection: A Survey”, IEEE
  Transactions on Knowledge and Data Engineering, VOL. 19, NO. 1, Jan 2007




                                                                                                     11
Results on record linkage
A mature field - ample literature
– 1969: I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Assoc., vol. 64,
  no. 328, pp. 1183-1210, Dec. 1969



– 2007: A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, “Duplicate Record Detection: A Survey”, IEEE
  Transactions on Knowledge and Data Engineering, VOL. 19, NO. 1, Jan 2007




                                                          Record Linkage:
                                                       Similarity Measures and
                                                             Algorithms

                                                   Nick Koudas (University of Toronto)
                                                      Sunita Sarawagi (IIT Bombay)
                                                Divesh Srivastava (AT&T Labs-Research)


                                               Sigmod 2006 Data Quality tutorial
                                                                                                     11
Results on record linkage
A mature field - ample literature
– 1969: I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Assoc., vol. 64,
  no. 328, pp. 1183-1210, Dec. 1969



– 2007: A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, “Duplicate Record Detection: A Survey”, IEEE
  Transactions on Knowledge and Data Engineering, VOL. 19, NO. 1, Jan 2007



                 Application: Merging Lists
  ! Application: merge address lists
       (customer lists, company lists)
                                                          Record Linkage:
       to avoid redundancy                             Similarity Measures and
  ! Current status: “standardize”,                           Algorithms
       different values treated as
       distinct for analysis
        ! Lot of heterogeneity                     Nick Koudas (University of Toronto)
        ! Need approximate joins
                                                      Sunita Sarawagi (IIT Bombay)
                                                Divesh Srivastava (AT&T Labs-Research)
  ! Relevant technologies
           !   Approximate joins
           !   Clustering/partitioning
  7/3/06
                                               Sigmod 2006 Data Quality tutorial
                                                                       6
                                                                                                     11
Results on record linkage
A mature field - ample literature
– 1969: I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Assoc., vol. 64,
  no. 328, pp. 1183-1210, Dec. 1969



– 2007: A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, “Duplicate Record Detection: A Survey”, IEEE
  Transactions on Knowledge and Data Engineering, VOL. 19, NO. 1, Jan 2007



                 Application: Merging Lists
  ! Application: merge address lists
       (customer lists, company lists)
                                                          Record Linkage:
       to avoid redundancy                             Similarity Measures and
  ! Current status: “standardize”,                           Algorithms
       different values treated as
       distinct for analysis
        ! Lot of heterogeneity                     Nick Koudas (University of Toronto)
        ! Need approximate joins
                                                      Sunita Sarawagi (IIT Bombay)
                                                Divesh Srivastava (AT&T Labs-Research)
  ! Relevant technologies
           !   Approximate joins
           !   Clustering/partitioning
  7/3/06
                                               Sigmod 2006 Data Quality tutorial
                                                                       6
                                                                                                     11
Offline vs online linkage
• Offline linkage:
  – performed once before queries involving joins
  – reconcile R and S on joining attributes R.A, S.B using your
    favourite record linkage technique
                     R → R ,S → S

  – perform regular equijoin on the transformed tables:

                        R      S
  ➡ok for tables that can be analysed ahead of the join
  ➡suitable when multiple queries issued on integrated tables




                                                              12
Offline vs online linkage
• Offline linkage:
  – performed once before queries involving joins
  – reconcile R and S on joining attributes R.A, S.B using your
    favourite record linkage technique
                     R → R ,S → S

  – perform regular equijoin on the transformed tables:

                        R      S
  ➡ok for tables that can be analysed ahead of the join
  ➡suitable when multiple queries issued on integrated tables

• Online linkage:
  – performed just-in-time before a query
  – exact join approximate join
                                                              12
Integration with approximate joins
• Assume relational data: tables R, S
• Assume schema integration is understood
  – we focus on data integration only


• Ultimately, data integration involves joining tables
                R      A=B      S
   C   D    A                  B            A

                            Mcrosoft
                                                • ordinary “exact” match
   Y   X Microsoft
                            Microsoft       Z
                                                  misses out on the
                                                  similar values
                                                • compromises integration
                                                  completeness
       Y   X    Microsoft Microsoft     Z


                                                                    13
Approximate joins
Historical timeline:




from:
N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and
algorithms. Tutorial in SIGMOD '06.
                                                                                14
Approximate joins
Historical timeline:




from:
N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and
algorithms. Tutorial in SIGMOD '06.
                                                                                14
Approximate joins
Historical timeline:




from:
N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and
algorithms. Tutorial in SIGMOD '06.
                                                                                14
Edit distance / similarity functions
• Core sub-problem in approximate join:
  – define / choose distance function between values in pairs
    of joining attributes

1. Similarity function sim(r1 , r2 ) between record pairs r1 , r2

2. Decision rules of the form

        sim(r1 , r2 ) < θ1 → not match
  θ1 ≤ sim(r1 , r2 ) ≤ θ2 → unknown
        θ2 < sim(r1 , r2 ) → match




                                                                15
Edit distance / similarity functions
• Core sub-problem in approximate join:
  – define / choose distance function between values in pairs
    of joining attributes

1. Similarity function sim(r1 , r2 ) between record pairs r1 , r2

2. Decision rules of the form

        sim(r1 , r2 ) < θ1 → not match
  θ1 ≤ sim(r1 , r2 ) ≤ θ2 → unknown
        θ2 < sim(r1 , r2 ) → match


 A common choice of similarity function in the context of
 approximate joins is one based on string q-grams

                                                                15
Measuring string similarity using q-grams
• q-grams map string s to a set q(s) of substrings of length q:
  Ex.: 3-grams:

  q(“Microsoft Corporation”) =
    {‘Mic’, ‘icr’, ‘cro’, ‘ros’, ‘oso’, ‘sof ’, ‘oft’, ‘ft ’, ‘t C’, ‘ Co’, ‘Cor’, ‘orp’ }.

  q(“Mcrosoft Corporation”) =
    {‘Mcr’, ‘cro’, ‘ros’, ‘oso’, ‘sof’, ‘oft’, ‘ft ’, ‘t C’, ‘ Co’, ‘Cor’, ‘orp’, ‘rp#’ }.
                       |q(s1 ) ∩ q(s2 )|
       sim(s1 , s2 ) =                                    (Jaccard coefficient)
                       |q(s1 ) ∪ q(s2 )|
   This is a commonly used measure of string similarity
Online linkage using q-grams
 – approximate join is a θ join:
                              R      θA,B   S
  – where θΑ,Β incorporates a similarity measure, eg Jaccard



• Naïve method: for each record pair, compute similarity
  score
  – I/O and CPU intensive, not scalable


• Goal: reduce O(n2) cost to O(n*w), where w << n
  – Reduce number of pairs on which similarity is computed
  – Take advantage of efficient relational join methods


                                                               17
Efficient relational approximate joins
 Idea:
 reduce approximate join to aggregated set intersection:

 dis(s1 , s2 ) ≤ d if |(s1 ) ∩ q(s2 )| ≥ max (|s1 |, |s2 |) − (d − 1) × q − 1

In practice:
• known similarity measures can be used to compare pairs
of records
• cheap filters (length, count, position) to prune non-matches
• Implementation using standard SQL
  • cost-based join methods

   Efficient relational representation:
   [CGK06] S. Chaudhuri, V. Ganti and R. Kaushik,
   “A primitive operator for similarity joins in data cleaning” (ICDE’06)‫‏‬
                                                                             18
Is full approximate join always necessary?
• Remaining source of complexity:
  – overhead for storing and indexing q-grams
  – cost of computing set intersection


• Typical mismatch rate in real datasets around 5%
• Complexity of full-fledged approximate join not fully
  justified


  Research hypothesis: time-completeness trade-offs


    Offer users the option to trade completeness of integration
    with the time required to complete the join


                                                                  19
Adaptive query processing
    Idea:
    implement a hybrid join algorithm that combines
    exact and approximate join


  Intuition:
  leverage known results on Adaptive Query Processing
   – developed in the context of query re-optimization
   – switch physical join operators in mid-flight

[DIR07] A. Deshpande, Z. G. Ives, and V. Raman. Adaptive query processing.
Foundations and Trends in Databases, 1(1):1–140, 2007

See also VLDB 2007 Tutorial at
http://www.vldb2007.org/program/slides/s1426-deshpande.pdf


                                                                             20
Autonomic computing framework




[KC03] J. O. Kephart and D. M. Chess. The vision of
autonomic computing. IEEE Computer, 36(1):41–50, 2003.
                                                         21
Autonomic computing framework


           monitor




respond              assess




                                  21
Autonomic computing framework

                                             incremental
                                              result size
                                  monitor



                                                             estimate
                                                            result size
                       respond               assess

       switch
        join                                                 compute
      operators                                             divergence



start with an exact join (optimistically)
at step t during the execution:
• estimate the expected size of the join result Ōt at that point
• monitor the actual size Ot of the result
  • when using exact join: if Ōt and Ot diverge “too much”, then switch to
  approximate join
  • when using approximate join: if Ōt and Ot are very close, then switch to
  exact join                                                             21
Technical approach and challenges
        Need to add several new capabilities to a standard query
                       processing infrastructure

• Assess:
  – estimating result size at specific points during join execution
• Respond:
  – switching between join operators at specific points during
    execution
     • Adaptive Query Processing (AQP): operator replacement
       in pipelined query plans [EFP06]

  – adding an approximate join operator to the query processor
    [CGK06]

[EFP06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the
replacement of pipelined physical join operators in adaptive query processing. In EDBT
Workshops 2006, LNCS 4254
[CGK06] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in
data cleaning. In ICDE 2006, p. 5.                                                     22
Symmetric hash join
Well-known join operator
– basis for approximate join [CGK06]
– can be applied to streams of data
  • they can read tuples from whichever input is available, and they incrementally
    produce output based on the tuples received so far.
– a pipelined operator ← this is a key requirement for use in AQP


            R                   S




                                                                                 23
Symmetric hash join
Well-known join operator
 – basis for approximate join [CGK06]
 – can be applied to streams of data
     • they can read tuples from whichever input is available, and they incrementally
       produce output based on the tuples received so far.
 – a pipelined operator ← this is a key requirement for use in AQP


                   R               S

               build


          m        x

          n        y
R hash table




                                                                                    23
Symmetric hash join
Well-known join operator
 – basis for approximate join [CGK06]
 – can be applied to streams of data
     • they can read tuples from whichever input is available, and they incrementally
       produce output based on the tuples received so far.
 – a pipelined operator ← this is a key requirement for use in AQP


                   R               S

               build                   build


          m        x
                                   y      r
          n        y
                                   x      s
R hash table
                                              S hash table




                                                                                    23
Symmetric hash join
Well-known join operator
 – basis for approximate join [CGK06]
 – can be applied to streams of data
     • they can read tuples from whichever input is available, and they incrementally
       produce output based on the tuples received so far.
 – a pipelined operator ← this is a key requirement for use in AQP


                   R               S                         when a tuple appears at either input,
                                                             it is incrementally added to the
               build                   build                 corresponding hash table and
                                                             probed against the opposite hash
                                                             table.
          m        x
                                   y      r
          n        y
                                   x      s
R hash table
                                              S hash table




                                                                                          23
Symmetric hash join
Well-known join operator
 – basis for approximate join [CGK06]
 – can be applied to streams of data
     • they can read tuples from whichever input is available, and they incrementally
       produce output based on the tuples received so far.
 – a pipelined operator ← this is a key requirement for use in AQP


                   R               S                         when a tuple appears at either input,
                                                             it is incrementally added to the
               build                   build                 corresponding hash table and
                                                             probed against the opposite hash
                       probe                                 table.
          m        x
                                   y      r
          n        y
                                   x      s
R hash table
                                              S hash table




                                                                                          23
Symmetric hash join
Well-known join operator
 – basis for approximate join [CGK06]
 – can be applied to streams of data
     • they can read tuples from whichever input is available, and they incrementally
       produce output based on the tuples received so far.
 – a pipelined operator ← this is a key requirement for use in AQP


                   R               S                         when a tuple appears at either input,
                                                             it is incrementally added to the
               build                   build                 corresponding hash table and
                                                             probed against the opposite hash
                        probe                                table.
          m        x
                                   y      r
          n        y
                                   x      s
R hash table
                                              S hash table


                       [R.m,S.s]
                                                                                          23
Symmetric hash join
Well-known join operator
 – basis for approximate join [CGK06]
 – can be applied to streams of data
     • they can read tuples from whichever input is available, and they incrementally
       produce output based on the tuples received so far.
 – a pipelined operator ← this is a key requirement for use in AQP


                   R               S                         when a tuple appears at either input,
                                                             it is incrementally added to the
               build                   build                 corresponding hash table and
                                                             probed against the opposite hash
                        probe                                table.
          m        x
                        probe
                                   y      r
          n        y
                                   x      s
R hash table
                                              S hash table


                       [R.m,S.s]
                                                                                          23
Symmetric hash join
Well-known join operator
 – basis for approximate join [CGK06]
 – can be applied to streams of data
     • they can read tuples from whichever input is available, and they incrementally
       produce output based on the tuples received so far.
 – a pipelined operator ← this is a key requirement for use in AQP


                   R                S                         when a tuple appears at either input,
                                                              it is incrementally added to the
               build                    build                 corresponding hash table and
                                                              probed against the opposite hash
                        probe                                 table.
          m        x
                        probe
                                    y      r
          n        y
                                    x      s
R hash table
                                               S hash table


                       [R.m,S.s]
                       [R.n, S.r]                                                          23
Estimating result size
• Exploit implicit parent-child key assumption:
  – at the end of join, we expect a result of size |S|
                R (parent)      S (child)

            c      x                            n
                                  y         b
            d       y
                                  x         a



• When there are no mismatches:
  after scanning n < |S| tuples on S:
  P(a=x in |S| has been matched) = P(tuple c=x is in top n of R) = n/|R|

  Thus, join result size On is a binomial random variable:

                                     n
                        On ∼ bin(n,     )
                                    |R|
                                                                           24
Detecting divergent observed result size
               ¯
Observation On is an outlier wrt expected result size
On after n tuples have been scanned, if:
                       ¯
              Pn,p(n) (On ≤ O) ≤ θout

where Pn,p(n) (.) is the cumulative distribution function for
a binomial with parameters n, p(n)




                                                           25
Detecting divergent observed result size
               ¯
Observation On is an outlier wrt expected result size
On after n tuples have been scanned, if:
                       ¯
              Pn,p(n) (On ≤ O) ≤ θout

where Pn,p(n) (.) is the cumulative distribution function for
a binomial with parameters n, p(n)




                                                           25
Instantiating the MAR framework
                                                             On
                                incremental
                                 result size      ✔
                      monitor



                                                estimate     ✔
                                               result size
            respond             assess
 switch                                        compute
  join                                         divergence
operators                                      predicates




                                                                  26
Instantiating the MAR framework
                                                             On
                                incremental
                                 result size      ✔
                      monitor



                                                estimate     ✔
                                               result size
            respond             assess
 switch
  join
                                               compute
                                               divergence
                                                             ✔
operators                                      predicates




                                                                  26
Instantiating the MAR framework
                                                                    On
                                    incremental
                                     result size      ✔
                         monitor



                                                    estimate         ✔
                                                   result size
              respond              assess
 switch
  join
                                                   compute
                                                   divergence
                                                                  ✔
operators                                          predicates
                                                                 σ(t), µ(t), π(t)
                 ¯
 σ(n) ≡ Pn,p(n) (On ≤ O) ≤ θout                Discrepancy detected
          At,W
 µi (t) ≡      ≤ θcurpert                      Current perturbations on
           W                                   left/right?
                                                                         26

 πi (t) ≡          I(µi (t )) ≤ θpastpert      Past perturbations on left/
            t <t
                                               right?
Responder’s state machine
• Operator switch defined in terms of state transitions
• Owing to symmetry, we can use a different operator
  on each of the two tables

       left: exact                        left: approximate
      right: exact                       right: approximate




      left: exact                        left: approximate
 right: approximate                          right: exact



                                                              27
Rationale for state transitions


                                lex /
                                rex


  evidence that       lex /               lap /     evidence that left
left and /or right    rap                 rex       and /or right input
 input perturbed                                        no longer
                                                        perturbed
                                lap /
                                rap



   predicates σ(t), µ(t), π(t) provide the evidence needed to
   drive the transitions
Assessment → state transitions
                ¯
σ(n) ≡ Pn,p(n) (On ≤ O) ≤ θout
         At,W
µi (t) ≡      ≤ θcurpert
          W

πi (t) ≡          I(µi (t )) ≤ θpastpert
           t <t




 ϕ0 (t) = ¬σ(t) ∧ µleft (t) ∧ µright (t)
 ϕ1 (t) = σ(t) ∧ ¬µleft (t) ∧ ¬µright (t)
 ϕ2 (t) = σ(t) ∧ ¬µleft (t) ∧ µright (t) ∧ πleft (t)
                                                             29
Completing the loop
                                                                                       On
                                                      incremental                      δadapt
                                                       result size         ✔
                                     monitor


                                                                          estimate      ✔
                                                                         result size
          ✔              respond                      assess
     switch
                                                                      compute
                                                                                       ✔
      join
    operators                                                        divergence




ϕ0 (t) = ¬σ(t) ∧ µleft (t) ∧ µright (t)                                ¯
                                                       σ(n) ≡ Pn,p(n) (On ≤ O) ≤ θout
ϕ1 (t) = σ(t) ∧ ¬µleft (t) ∧ ¬µright (t)                        At,W
                                                       µi (t) ≡      ≤ θcurpert
ϕ2 (t) = σ(t) ∧ ¬µleft (t) ∧ µright (t) ∧ πleft (t)              W
                                                                                            30

                                                       πi (t) ≡          I(µi (t )) ≤ θpastpert
                                                                  t <t
Note on operator replacement
• Details on how to switch operators on the fly are
  omitted
   – main point: pipelined operators expose specific quiescent
     states where replacement can take place with no loss of
     work [EPF06]




[EPF06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the
replacement of pipelined physical join operators in adaptive query processing. In
                                                                               31
EDBT Workshops 2006, LNCS 4254
Note on operator replacement
• Details on how to switch operators on the fly are
  omitted
   – main point: pipelined operators expose specific quiescent
     states where replacement can take place with no loss of
     work [EPF06]




[EPF06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the
replacement of pipelined physical join operators in adaptive query processing. In
                                                                               31
EDBT Workshops 2006, LNCS 4254
Experimental evaluation
 Trade-off analysis

• Benefits:
  – achieved level of result completeness
  – baseline: approximate join throughout
     • model marginal gain of hybrid algorithm

• Cost
  – baseline: exact join throughout
     • model marginal cost of hybrid algorithm




                                                       32
Test datasets
Datasets chosen as representative of 4 distinct patterns




  we expect our results to vary:
• uniform perturbation: evidence grows slowly => slow reaction
• bursty perturbation: strong evidence => timely reaction
Parameters tuning and gain/cost models
• Each of the MAR parameters tuned empirically
• Experiments executed using the best possible
  configuration
• Nice result: parameter setting is quite independent
  from the specific variant pattern


Relative gain grel:
• R: result size for approx join only
• r: result size for exact only
• rabs: result size actually observed
                      grel = (rabs – r) / (R – r)‫‏‬


(details on cost model omitted)
Cost model
unit cost of executing one step in state i: wi
  – weights determined experimentally
• number of steps in each state ti
• unit state transition cost – experimental: vi
• number of state transitions tri
total absolute cost:
              cabs = sumi(sci) + sumi(tci)‫‏‬
relative cost:
c: best cost (exact only)‫‏‬
C: worst cost (approx only)‫‏‬
                  crel = cabs / (C - c)‫‏‬
Results
Results
Discussion
• Results similar across different variant patterns
  – good!

• Transition cost is not overwhelming:
  – we never pay more for hybrid than for approx
  – this gives us a good space for trade-offs
  – we could let users tune the algorithm without fear of
    “breaking” it
Conclusions
• An exact / approximate hybrid approach to join with
  violations to implicit referential integrity across tables
  – relational setting


• Approach based on autonomic computing principles
  – Adaptive query processing techniques


• Application: on-the-fly integration scenarios (mashups,
  personal dataspaces)

• Results: cost / completeness trade-off analysis
  – initial encouraging experimental conclusions


     Study requires additional testing on real datasets
References used in the presentation
• A. Halevy and D. Maier, Dataspaces: the Tutorial, VLDB 2008
  tutorial, Auckland, NZ, Aug 2008

• N. Koudas, S. Sarawagi, D.Srivastava, Record Linkage: Similarity
  Measures and Algorithms, VLDB 2006 tutorial, Seoul, Corea, 2006

• [FS69] I.P. Fellegi and A.B. Sunter, A Theory for Record Linkage, J.
  Am. Statistical Assoc., vol. 64, no. 328, pp. 1183-1210, Dec. 1969

• [EIV07] A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, Duplicate
  Record Detection: A Survey, IEEE Transactions on Knowledge and
  Data Engineering, VOL. 19, NO. 1, Jan 2007

• [KC03] J. O. Kephart and D. M. Chess. The vision of autonomic
  computing. IEEE Computer, 36(1):41–50, 2003.

• EFP06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A
  foundation for the replacement of pipelined physical join operators
  in adaptive query processing. In EDBT Workshops 2006, LNCS
  4254

• [CGK06] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive
                                                                         39
  operator for similarity joins in data cleaning. In ICDE 2006, p. 5

Weitere ähnliche Inhalte

Andere mochten auch

Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005
Paolo Missier
 
SWPM12 report on the dagstuhl seminar on Semantic Data Management
SWPM12 report on the dagstuhl seminar on Semantic Data Management SWPM12 report on the dagstuhl seminar on Semantic Data Management
SWPM12 report on the dagstuhl seminar on Semantic Data Management
Paolo Missier
 
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
Your data won’t stay smart forever:exploring the temporal dimension of (big ...Your data won’t stay smart forever:exploring the temporal dimension of (big ...
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
Paolo Missier
 

Andere mochten auch (14)

Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005
 
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...
 
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
 
SWPM12 report on the dagstuhl seminar on Semantic Data Management
SWPM12 report on the dagstuhl seminar on Semantic Data Management SWPM12 report on the dagstuhl seminar on Semantic Data Management
SWPM12 report on the dagstuhl seminar on Semantic Data Management
 
Ipaw12 datalog paper talk
Ipaw12 datalog paper talkIpaw12 datalog paper talk
Ipaw12 datalog paper talk
 
Structured Occurrence Network for provenance: talk for ipaw12 paper
Structured Occurrence Network for provenance: talk for ipaw12 paperStructured Occurrence Network for provenance: talk for ipaw12 paper
Structured Occurrence Network for provenance: talk for ipaw12 paper
 
Invited talk at the GeoClouds Workshop, Indianapolis, 2009
Invited talk at the GeoClouds Workshop, Indianapolis, 2009Invited talk at the GeoClouds Workshop, Indianapolis, 2009
Invited talk at the GeoClouds Workshop, Indianapolis, 2009
 
C4Bio paper talk
C4Bio paper talkC4Bio paper talk
C4Bio paper talk
 
ProvAbs: model, policy, and tooling for abstracting PROV graphs
ProvAbs: model, policy, and tooling for abstracting PROV graphsProvAbs: model, policy, and tooling for abstracting PROV graphs
ProvAbs: model, policy, and tooling for abstracting PROV graphs
 
Big Data Quality Panel : Diachron Workshop @EDBT
Big Data Quality Panel: Diachron Workshop @EDBTBig Data Quality Panel: Diachron Workshop @EDBT
Big Data Quality Panel : Diachron Workshop @EDBT
 
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
Your data won’t stay smart forever:exploring the temporal dimension of (big ...Your data won’t stay smart forever:exploring the temporal dimension of (big ...
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
 
The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...
 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
 
Paper presentation @DILS'07
Paper presentation @DILS'07Paper presentation @DILS'07
Paper presentation @DILS'07
 

Ähnlich wie Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Andre Freitas
 

Ähnlich wie Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups (20)

DBMS
DBMSDBMS
DBMS
 
Linked Open Data (LOD) part 1
Linked Open Data (LOD) part 1Linked Open Data (LOD) part 1
Linked Open Data (LOD) part 1
 
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)
 
Emerging Data Citation Infrastructure
Emerging Data Citation InfrastructureEmerging Data Citation Infrastructure
Emerging Data Citation Infrastructure
 
Social Relation Based Scalable Semantic Search Refinement
Social Relation Based Scalable Semantic Search RefinementSocial Relation Based Scalable Semantic Search Refinement
Social Relation Based Scalable Semantic Search Refinement
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data Science
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
 
Sensors1(1)
Sensors1(1)Sensors1(1)
Sensors1(1)
 
Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014
Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014
Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014
 
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and Applications
 
AI Beyond Deep Learning
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep Learning
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research Opportunities
 
Managing Confidential Information – Trends and Approaches
Managing Confidential Information – Trends and ApproachesManaging Confidential Information – Trends and Approaches
Managing Confidential Information – Trends and Approaches
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
RDA, Data Citation, and PIDs for DataOne
RDA, Data Citation, and PIDs for DataOneRDA, Data Citation, and PIDs for DataOne
RDA, Data Citation, and PIDs for DataOne
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
unit 1 DATA MINING.ppt
unit 1 DATA MINING.pptunit 1 DATA MINING.ppt
unit 1 DATA MINING.ppt
 

Mehr von Paolo Missier

Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
Paolo Missier
 

Mehr von Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

  • 1. Approximate entity reconciliation for on-the-fly integration in data mashups Paolo Missier, Alvaro A. A. Fernandes School of Computer Science, University of Manchester Roald Lengu, Giovanna Guerrini DISI, Universita' di Genova, Italy Marco Mesiti DiCo, Universita' di Milano, Italy
  • 2. Outline • New data integration scenarios: – occasional integration with little prior knowledge about the sources • Context: Data mashups and personal dataspaces • How to ensure that we are not missing any data in the process? – how costly (i.e. response time) is it to guarantee completeness? – can we trade completeness for response time? • Technically speaking: convergence of – record linkage (an old data quality favourite) – approximate joins – adaptive query processing 2
  • 3. Early example • sources 1..n: collection of car insurance DBs • data changes frequently • schemas can be analysed / integrated using traditional techniques • source n+1: reference street atlas 3
  • 4. Early example • sources 1..n: collection of car insurance DBs • data changes frequently • schemas can be analysed / integrated using traditional techniques • source n+1: reference street atlas • target app: mapping accidents hotspots • alert service to drivers, for example • useful information for decision makers 3 (image from housingmaps.com)
  • 5. Mashups The IBM view, 2006 VLDB 2006 Keynote by Anant Jhingran (CTO, Information Management, IBM Silicon Valley Laboratory, San Jose, CA): Enterprise information mashups: integrating information, simply Situational Applications • Applications that come together for solving some immediate business problems • constructed “on the fly” for some transient need and possibly short-lasting • Data never seen before, consumed on the spot – would take too long for the IT department to provide them – RSS feeds / data streams 4
  • 6. IBM Mashup Center • IBM Mashup Center – mashup workflow – leverages Lotus, DB2 plus LDAP, Web Services, ... 5
  • 7. Yahoo pipes Is there actually a “join” in the set of operators? also google mashup editor, and more... 6
  • 14. Assumptions – no prior knowledge of data sets (streams) to be joined – assumptions on implicit parent-child attribute relationships – no guarantee of matching values • sources 1..n: collection of car insurance DBs • source n+1: reference street atlas • target app: mapping accidents hotspots 9
  • 15. The broad context: record linkage • Are two (slightly) different records two different surface representations of the same real-world entity? Name: John Smith Name: John Smith Record values incomplete SSN: SSN: 123-45-6789 Address: 477 Cedar Street Address: Brendan Hughes Brenda Hughes Twins or typo? Address: 564 Hickory Pl. Address: 564 Hickory Pl. Name: Jean Smith Name: Conflict between forenames Phone #: (337) 555-6676 Phone #: (337) 555 5676 and phone number Name: Alice Jones Names: Lois Avon Same SSN, different SSN: 123-45-6789 SSN: 123-45-6789 names:?? 10
  • 16. The broad context: record linkage • Are two (slightly) different records two different surface representations of the same real-world entity? Name: John Smith Name: John Smith Record values incomplete SSN: SSN: 123-45-6789 Address: 477 Cedar Street Address: Brendan Hughes Brenda Hughes Twins or typo? Address: 564 Hickory Pl. Address: 564 Hickory Pl. Name: Jean Smith Name: Conflict between forenames Phone #: (337) 555-6676 Phone #: (337) 555 5676 and phone number Name: Alice Jones Names: Lois Avon Same SSN, different SSN: 123-45-6789 SSN: 123-45-6789 names:?? • A difficult / uncertain decision process • which attributes should I consider for matching • what are the different weights • context: relative frequency of values? • external knowledge, user input 10
  • 17. Results on record linkage A mature field - ample literature – 1969: I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Assoc., vol. 64, no. 328, pp. 1183-1210, Dec. 1969 – 2007: A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, “Duplicate Record Detection: A Survey”, IEEE Transactions on Knowledge and Data Engineering, VOL. 19, NO. 1, Jan 2007 11
  • 18. Results on record linkage A mature field - ample literature – 1969: I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Assoc., vol. 64, no. 328, pp. 1183-1210, Dec. 1969 – 2007: A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, “Duplicate Record Detection: A Survey”, IEEE Transactions on Knowledge and Data Engineering, VOL. 19, NO. 1, Jan 2007 Record Linkage: Similarity Measures and Algorithms Nick Koudas (University of Toronto) Sunita Sarawagi (IIT Bombay) Divesh Srivastava (AT&T Labs-Research) Sigmod 2006 Data Quality tutorial 11
  • 19. Results on record linkage A mature field - ample literature – 1969: I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Assoc., vol. 64, no. 328, pp. 1183-1210, Dec. 1969 – 2007: A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, “Duplicate Record Detection: A Survey”, IEEE Transactions on Knowledge and Data Engineering, VOL. 19, NO. 1, Jan 2007 Application: Merging Lists ! Application: merge address lists (customer lists, company lists) Record Linkage: to avoid redundancy Similarity Measures and ! Current status: “standardize”, Algorithms different values treated as distinct for analysis ! Lot of heterogeneity Nick Koudas (University of Toronto) ! Need approximate joins Sunita Sarawagi (IIT Bombay) Divesh Srivastava (AT&T Labs-Research) ! Relevant technologies ! Approximate joins ! Clustering/partitioning 7/3/06 Sigmod 2006 Data Quality tutorial 6 11
  • 20. Results on record linkage A mature field - ample literature – 1969: I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Assoc., vol. 64, no. 328, pp. 1183-1210, Dec. 1969 – 2007: A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, “Duplicate Record Detection: A Survey”, IEEE Transactions on Knowledge and Data Engineering, VOL. 19, NO. 1, Jan 2007 Application: Merging Lists ! Application: merge address lists (customer lists, company lists) Record Linkage: to avoid redundancy Similarity Measures and ! Current status: “standardize”, Algorithms different values treated as distinct for analysis ! Lot of heterogeneity Nick Koudas (University of Toronto) ! Need approximate joins Sunita Sarawagi (IIT Bombay) Divesh Srivastava (AT&T Labs-Research) ! Relevant technologies ! Approximate joins ! Clustering/partitioning 7/3/06 Sigmod 2006 Data Quality tutorial 6 11
  • 21. Offline vs online linkage • Offline linkage: – performed once before queries involving joins – reconcile R and S on joining attributes R.A, S.B using your favourite record linkage technique R → R ,S → S – perform regular equijoin on the transformed tables: R S ➡ok for tables that can be analysed ahead of the join ➡suitable when multiple queries issued on integrated tables 12
  • 22. Offline vs online linkage • Offline linkage: – performed once before queries involving joins – reconcile R and S on joining attributes R.A, S.B using your favourite record linkage technique R → R ,S → S – perform regular equijoin on the transformed tables: R S ➡ok for tables that can be analysed ahead of the join ➡suitable when multiple queries issued on integrated tables • Online linkage: – performed just-in-time before a query – exact join approximate join 12
  • 23. Integration with approximate joins • Assume relational data: tables R, S • Assume schema integration is understood – we focus on data integration only • Ultimately, data integration involves joining tables R A=B S C D A B A Mcrosoft • ordinary “exact” match Y X Microsoft Microsoft Z misses out on the similar values • compromises integration completeness Y X Microsoft Microsoft Z 13
  • 24. Approximate joins Historical timeline: from: N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. Tutorial in SIGMOD '06. 14
  • 25. Approximate joins Historical timeline: from: N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. Tutorial in SIGMOD '06. 14
  • 26. Approximate joins Historical timeline: from: N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. Tutorial in SIGMOD '06. 14
  • 27. Edit distance / similarity functions • Core sub-problem in approximate join: – define / choose distance function between values in pairs of joining attributes 1. Similarity function sim(r1 , r2 ) between record pairs r1 , r2 2. Decision rules of the form sim(r1 , r2 ) < θ1 → not match θ1 ≤ sim(r1 , r2 ) ≤ θ2 → unknown θ2 < sim(r1 , r2 ) → match 15
  • 28. Edit distance / similarity functions • Core sub-problem in approximate join: – define / choose distance function between values in pairs of joining attributes 1. Similarity function sim(r1 , r2 ) between record pairs r1 , r2 2. Decision rules of the form sim(r1 , r2 ) < θ1 → not match θ1 ≤ sim(r1 , r2 ) ≤ θ2 → unknown θ2 < sim(r1 , r2 ) → match A common choice of similarity function in the context of approximate joins is one based on string q-grams 15
  • 29. Measuring string similarity using q-grams • q-grams map string s to a set q(s) of substrings of length q: Ex.: 3-grams: q(“Microsoft Corporation”) = {‘Mic’, ‘icr’, ‘cro’, ‘ros’, ‘oso’, ‘sof ’, ‘oft’, ‘ft ’, ‘t C’, ‘ Co’, ‘Cor’, ‘orp’ }. q(“Mcrosoft Corporation”) = {‘Mcr’, ‘cro’, ‘ros’, ‘oso’, ‘sof’, ‘oft’, ‘ft ’, ‘t C’, ‘ Co’, ‘Cor’, ‘orp’, ‘rp#’ }. |q(s1 ) ∩ q(s2 )| sim(s1 , s2 ) = (Jaccard coefficient) |q(s1 ) ∪ q(s2 )| This is a commonly used measure of string similarity
  • 30. Online linkage using q-grams – approximate join is a θ join: R θA,B S – where θΑ,Β incorporates a similarity measure, eg Jaccard • Naïve method: for each record pair, compute similarity score – I/O and CPU intensive, not scalable • Goal: reduce O(n2) cost to O(n*w), where w << n – Reduce number of pairs on which similarity is computed – Take advantage of efficient relational join methods 17
  • 31. Efficient relational approximate joins Idea: reduce approximate join to aggregated set intersection: dis(s1 , s2 ) ≤ d if |(s1 ) ∩ q(s2 )| ≥ max (|s1 |, |s2 |) − (d − 1) × q − 1 In practice: • known similarity measures can be used to compare pairs of records • cheap filters (length, count, position) to prune non-matches • Implementation using standard SQL • cost-based join methods Efficient relational representation: [CGK06] S. Chaudhuri, V. Ganti and R. Kaushik, “A primitive operator for similarity joins in data cleaning” (ICDE’06)‫‏‬ 18
  • 32. Is full approximate join always necessary? • Remaining source of complexity: – overhead for storing and indexing q-grams – cost of computing set intersection • Typical mismatch rate in real datasets around 5% • Complexity of full-fledged approximate join not fully justified Research hypothesis: time-completeness trade-offs Offer users the option to trade completeness of integration with the time required to complete the join 19
  • 33. Adaptive query processing Idea: implement a hybrid join algorithm that combines exact and approximate join Intuition: leverage known results on Adaptive Query Processing – developed in the context of query re-optimization – switch physical join operators in mid-flight [DIR07] A. Deshpande, Z. G. Ives, and V. Raman. Adaptive query processing. Foundations and Trends in Databases, 1(1):1–140, 2007 See also VLDB 2007 Tutorial at http://www.vldb2007.org/program/slides/s1426-deshpande.pdf 20
  • 34. Autonomic computing framework [KC03] J. O. Kephart and D. M. Chess. The vision of autonomic computing. IEEE Computer, 36(1):41–50, 2003. 21
  • 35. Autonomic computing framework monitor respond assess 21
  • 36. Autonomic computing framework incremental result size monitor estimate result size respond assess switch join compute operators divergence start with an exact join (optimistically) at step t during the execution: • estimate the expected size of the join result Ōt at that point • monitor the actual size Ot of the result • when using exact join: if Ōt and Ot diverge “too much”, then switch to approximate join • when using approximate join: if Ōt and Ot are very close, then switch to exact join 21
  • 37. Technical approach and challenges Need to add several new capabilities to a standard query processing infrastructure • Assess: – estimating result size at specific points during join execution • Respond: – switching between join operators at specific points during execution • Adaptive Query Processing (AQP): operator replacement in pipelined query plans [EFP06] – adding an approximate join operator to the query processor [CGK06] [EFP06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the replacement of pipelined physical join operators in adaptive query processing. In EDBT Workshops 2006, LNCS 4254 [CGK06] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE 2006, p. 5. 22
  • 38. Symmetric hash join Well-known join operator – basis for approximate join [CGK06] – can be applied to streams of data • they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far. – a pipelined operator ← this is a key requirement for use in AQP R S 23
  • 39. Symmetric hash join Well-known join operator – basis for approximate join [CGK06] – can be applied to streams of data • they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far. – a pipelined operator ← this is a key requirement for use in AQP R S build m x n y R hash table 23
  • 40. Symmetric hash join Well-known join operator – basis for approximate join [CGK06] – can be applied to streams of data • they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far. – a pipelined operator ← this is a key requirement for use in AQP R S build build m x y r n y x s R hash table S hash table 23
  • 41. Symmetric hash join Well-known join operator – basis for approximate join [CGK06] – can be applied to streams of data • they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far. – a pipelined operator ← this is a key requirement for use in AQP R S when a tuple appears at either input, it is incrementally added to the build build corresponding hash table and probed against the opposite hash table. m x y r n y x s R hash table S hash table 23
  • 42. Symmetric hash join Well-known join operator – basis for approximate join [CGK06] – can be applied to streams of data • they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far. – a pipelined operator ← this is a key requirement for use in AQP R S when a tuple appears at either input, it is incrementally added to the build build corresponding hash table and probed against the opposite hash probe table. m x y r n y x s R hash table S hash table 23
  • 43. Symmetric hash join Well-known join operator – basis for approximate join [CGK06] – can be applied to streams of data • they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far. – a pipelined operator ← this is a key requirement for use in AQP R S when a tuple appears at either input, it is incrementally added to the build build corresponding hash table and probed against the opposite hash probe table. m x y r n y x s R hash table S hash table [R.m,S.s] 23
  • 44. Symmetric hash join Well-known join operator – basis for approximate join [CGK06] – can be applied to streams of data • they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far. – a pipelined operator ← this is a key requirement for use in AQP R S when a tuple appears at either input, it is incrementally added to the build build corresponding hash table and probed against the opposite hash probe table. m x probe y r n y x s R hash table S hash table [R.m,S.s] 23
  • 45. Symmetric hash join Well-known join operator – basis for approximate join [CGK06] – can be applied to streams of data • they can read tuples from whichever input is available, and they incrementally produce output based on the tuples received so far. – a pipelined operator ← this is a key requirement for use in AQP R S when a tuple appears at either input, it is incrementally added to the build build corresponding hash table and probed against the opposite hash probe table. m x probe y r n y x s R hash table S hash table [R.m,S.s] [R.n, S.r] 23
  • 46. Estimating result size • Exploit implicit parent-child key assumption: – at the end of join, we expect a result of size |S| R (parent) S (child) c x n y b d y x a • When there are no mismatches: after scanning n < |S| tuples on S: P(a=x in |S| has been matched) = P(tuple c=x is in top n of R) = n/|R| Thus, join result size On is a binomial random variable: n On ∼ bin(n, ) |R| 24
  • 47. Detecting divergent observed result size ¯ Observation On is an outlier wrt expected result size On after n tuples have been scanned, if: ¯ Pn,p(n) (On ≤ O) ≤ θout where Pn,p(n) (.) is the cumulative distribution function for a binomial with parameters n, p(n) 25
  • 48. Detecting divergent observed result size ¯ Observation On is an outlier wrt expected result size On after n tuples have been scanned, if: ¯ Pn,p(n) (On ≤ O) ≤ θout where Pn,p(n) (.) is the cumulative distribution function for a binomial with parameters n, p(n) 25
  • 49. Instantiating the MAR framework On incremental result size ✔ monitor estimate ✔ result size respond assess switch compute join divergence operators predicates 26
  • 50. Instantiating the MAR framework On incremental result size ✔ monitor estimate ✔ result size respond assess switch join compute divergence ✔ operators predicates 26
  • 51. Instantiating the MAR framework On incremental result size ✔ monitor estimate ✔ result size respond assess switch join compute divergence ✔ operators predicates σ(t), µ(t), π(t) ¯ σ(n) ≡ Pn,p(n) (On ≤ O) ≤ θout Discrepancy detected At,W µi (t) ≡ ≤ θcurpert Current perturbations on W left/right? 26 πi (t) ≡ I(µi (t )) ≤ θpastpert Past perturbations on left/ t <t right?
  • 52. Responder’s state machine • Operator switch defined in terms of state transitions • Owing to symmetry, we can use a different operator on each of the two tables left: exact left: approximate right: exact right: approximate left: exact left: approximate right: approximate right: exact 27
  • 53. Rationale for state transitions lex / rex evidence that lex / lap / evidence that left left and /or right rap rex and /or right input input perturbed no longer perturbed lap / rap predicates σ(t), µ(t), π(t) provide the evidence needed to drive the transitions
  • 54. Assessment → state transitions ¯ σ(n) ≡ Pn,p(n) (On ≤ O) ≤ θout At,W µi (t) ≡ ≤ θcurpert W πi (t) ≡ I(µi (t )) ≤ θpastpert t <t ϕ0 (t) = ¬σ(t) ∧ µleft (t) ∧ µright (t) ϕ1 (t) = σ(t) ∧ ¬µleft (t) ∧ ¬µright (t) ϕ2 (t) = σ(t) ∧ ¬µleft (t) ∧ µright (t) ∧ πleft (t) 29
  • 55. Completing the loop On incremental δadapt result size ✔ monitor estimate ✔ result size ✔ respond assess switch compute ✔ join operators divergence ϕ0 (t) = ¬σ(t) ∧ µleft (t) ∧ µright (t) ¯ σ(n) ≡ Pn,p(n) (On ≤ O) ≤ θout ϕ1 (t) = σ(t) ∧ ¬µleft (t) ∧ ¬µright (t) At,W µi (t) ≡ ≤ θcurpert ϕ2 (t) = σ(t) ∧ ¬µleft (t) ∧ µright (t) ∧ πleft (t) W 30 πi (t) ≡ I(µi (t )) ≤ θpastpert t <t
  • 56. Note on operator replacement • Details on how to switch operators on the fly are omitted – main point: pipelined operators expose specific quiescent states where replacement can take place with no loss of work [EPF06] [EPF06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the replacement of pipelined physical join operators in adaptive query processing. In 31 EDBT Workshops 2006, LNCS 4254
  • 57. Note on operator replacement • Details on how to switch operators on the fly are omitted – main point: pipelined operators expose specific quiescent states where replacement can take place with no loss of work [EPF06] [EPF06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the replacement of pipelined physical join operators in adaptive query processing. In 31 EDBT Workshops 2006, LNCS 4254
  • 58. Experimental evaluation Trade-off analysis • Benefits: – achieved level of result completeness – baseline: approximate join throughout • model marginal gain of hybrid algorithm • Cost – baseline: exact join throughout • model marginal cost of hybrid algorithm 32
  • 59. Test datasets Datasets chosen as representative of 4 distinct patterns we expect our results to vary: • uniform perturbation: evidence grows slowly => slow reaction • bursty perturbation: strong evidence => timely reaction
  • 60. Parameters tuning and gain/cost models • Each of the MAR parameters tuned empirically • Experiments executed using the best possible configuration • Nice result: parameter setting is quite independent from the specific variant pattern Relative gain grel: • R: result size for approx join only • r: result size for exact only • rabs: result size actually observed grel = (rabs – r) / (R – r)‫‏‬ (details on cost model omitted)
  • 61. Cost model unit cost of executing one step in state i: wi – weights determined experimentally • number of steps in each state ti • unit state transition cost – experimental: vi • number of state transitions tri total absolute cost: cabs = sumi(sci) + sumi(tci)‫‏‬ relative cost: c: best cost (exact only)‫‏‬ C: worst cost (approx only)‫‏‬ crel = cabs / (C - c)‫‏‬
  • 64. Discussion • Results similar across different variant patterns – good! • Transition cost is not overwhelming: – we never pay more for hybrid than for approx – this gives us a good space for trade-offs – we could let users tune the algorithm without fear of “breaking” it
  • 65. Conclusions • An exact / approximate hybrid approach to join with violations to implicit referential integrity across tables – relational setting • Approach based on autonomic computing principles – Adaptive query processing techniques • Application: on-the-fly integration scenarios (mashups, personal dataspaces) • Results: cost / completeness trade-off analysis – initial encouraging experimental conclusions Study requires additional testing on real datasets
  • 66. References used in the presentation • A. Halevy and D. Maier, Dataspaces: the Tutorial, VLDB 2008 tutorial, Auckland, NZ, Aug 2008 • N. Koudas, S. Sarawagi, D.Srivastava, Record Linkage: Similarity Measures and Algorithms, VLDB 2006 tutorial, Seoul, Corea, 2006 • [FS69] I.P. Fellegi and A.B. Sunter, A Theory for Record Linkage, J. Am. Statistical Assoc., vol. 64, no. 328, pp. 1183-1210, Dec. 1969 • [EIV07] A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, Duplicate Record Detection: A Survey, IEEE Transactions on Knowledge and Data Engineering, VOL. 19, NO. 1, Jan 2007 • [KC03] J. O. Kephart and D. M. Chess. The vision of autonomic computing. IEEE Computer, 36(1):41–50, 2003. • EFP06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the replacement of pipelined physical join operators in adaptive query processing. In EDBT Workshops 2006, LNCS 4254 • [CGK06] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive 39 operator for similarity joins in data cleaning. In ICDE 2006, p. 5