Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Approximate entity reconciliation
for on-the-ﬂy integration in data mashups

Paolo Missier, Alvaro A. A. Fernandes
School of Computer Science, University of Manchester

Roald Lengu, Giovanna Guerrini
DISI, Universita' di Genova, Italy

Marco Mesiti
DiCo, Universita' di Milano, Italy

Outline
• New data integration scenarios:
– occasional integration with little prior knowledge about the
sources
• Context: Data mashups and personal dataspaces

• How to ensure that we are not missing any data in
the process?
– how costly (i.e. response time) is it to guarantee
completeness?
– can we trade completeness for response time?

• Technically speaking: convergence of
– record linkage (an old data quality favourite)
– approximate joins
– adaptive query processing
2

Early example
• sources 1..n: collection of car insurance DBs
• data changes frequently
• schemas can be analysed / integrated using traditional
techniques
• source n+1: reference street atlas

3

Early example
• sources 1..n: collection of car insurance DBs
• data changes frequently
• schemas can be analysed / integrated using traditional
techniques

• target app: mapping accidents hotspots
• alert service to drivers, for example
• useful information for decision makers

3
(image from housingmaps.com)

Mashups
The IBM view, 2006
VLDB 2006 Keynote by Anant Jhingran (CTO, Information
Management, IBM Silicon Valley Laboratory, San Jose, CA):
Enterprise information mashups: integrating information,
simply

Situational Applications
• Applications that come together for solving some
immediate business problems
• constructed “on the fly” for some transient need
and possibly short-lasting

• Data never seen before, consumed on the spot
– would take too long for the IT department to provide them
– RSS feeds / data streams
4

IBM Mashup Center
• IBM Mashup Center
– mashup workflow
– leverages Lotus, DB2 plus LDAP, Web Services, ...

5

Yahoo pipes

Is there actually a “join” in the set of operators?
also google mashup editor, and more... 6

Integration in dataspaces

8

Assumptions

– no prior knowledge of data sets (streams) to be joined
– assumptions on implicit parent-child attribute relationships
– no guarantee of matching values

• sources 1..n: collection of car
insurance DBs
• target app: mapping accidents
hotspots

9

The broad context: record linkage
• Are two (slightly) different records two different surface
representations of the same real-world entity?
Name: John Smith Name: John Smith Record values incomplete
SSN: SSN: 123-45-6789
Address: 477 Cedar Street Address:
Brendan Hughes Brenda Hughes Twins or typo?
Address: 564 Hickory Pl. Address: 564 Hickory Pl.
Name: Jean Smith Name: Conflict between forenames
Phone #: (337) 555-6676 Phone #: (337) 555 5676 and phone number
Name: Alice Jones Names: Lois Avon Same SSN, different
SSN: 123-45-6789 SSN: 123-45-6789 names:??

10

The broad context: record linkage
• Are two (slightly) different records two different surface
representations of the same real-world entity?
Name: John Smith Name: John Smith Record values incomplete
SSN: SSN: 123-45-6789
Address: 477 Cedar Street Address:
Brendan Hughes Brenda Hughes Twins or typo?
Address: 564 Hickory Pl. Address: 564 Hickory Pl.
Name: Jean Smith Name: Conflict between forenames
Phone #: (337) 555-6676 Phone #: (337) 555 5676 and phone number
Name: Alice Jones Names: Lois Avon Same SSN, different
SSN: 123-45-6789 SSN: 123-45-6789 names:??

• A difficult / uncertain decision process
• which attributes should I consider for matching
• what are the different weights
• context: relative frequency of values?
• external knowledge, user input

10

Results on record linkage
A mature field - ample literature
– 1969: I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Assoc., vol. 64,
no. 328, pp. 1183-1210, Dec. 1969

– 2007: A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, “Duplicate Record Detection: A Survey”, IEEE
Transactions on Knowledge and Data Engineering, VOL. 19, NO. 1, Jan 2007

11

no. 328, pp. 1183-1210, Dec. 1969


Record Linkage:
Similarity Measures and
Algorithms

Nick Koudas (University of Toronto)
Sunita Sarawagi (IIT Bombay)
Divesh Srivastava (AT&T Labs-Research)

Sigmod 2006 Data Quality tutorial
11

no. 328, pp. 1183-1210, Dec. 1969


Application: Merging Lists
! Application: merge address lists
(customer lists, company lists)
Record Linkage:
to avoid redundancy Similarity Measures and
! Current status: “standardize”, Algorithms
different values treated as
distinct for analysis
! Lot of heterogeneity Nick Koudas (University of Toronto)
! Need approximate joins
Sunita Sarawagi (IIT Bombay)
Divesh Srivastava (AT&T Labs-Research)
! Relevant technologies
! Approximate joins
! Clustering/partitioning
7/3/06
Sigmod 2006 Data Quality tutorial
6
11

Offline vs online linkage
• Offline linkage:
– performed once before queries involving joins
– reconcile R and S on joining attributes R.A, S.B using your
favourite record linkage technique
R → R ,S → S

– perform regular equijoin on the transformed tables:

R S
➡ok for tables that can be analysed ahead of the join
➡suitable when multiple queries issued on integrated tables

12

Offline vs online linkage
• Offline linkage:
– performed once before queries involving joins
– reconcile R and S on joining attributes R.A, S.B using your
favourite record linkage technique
R → R ,S → S

– perform regular equijoin on the transformed tables:

R S
➡ok for tables that can be analysed ahead of the join
➡suitable when multiple queries issued on integrated tables

• Online linkage:
– performed just-in-time before a query
– exact join approximate join
12

Integration with approximate joins
• Assume relational data: tables R, S
• Assume schema integration is understood
– we focus on data integration only

• Ultimately, data integration involves joining tables
R A=B S
C D A B A

Mcrosoft
• ordinary “exact” match
Y X Microsoft
Microsoft Z
misses out on the
similar values
• compromises integration
completeness
Y X Microsoft Microsoft Z

13

Approximate joins
Historical timeline:

from:
N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and
algorithms. Tutorial in SIGMOD '06.
14

Edit distance / similarity functions
• Core sub-problem in approximate join:
– define / choose distance function between values in pairs
of joining attributes

1. Similarity function sim(r1 , r2 ) between record pairs r1 , r2

2. Decision rules of the form

sim(r1 , r2 ) < θ1 → not match
θ1 ≤ sim(r1 , r2 ) ≤ θ2 → unknown
θ2 < sim(r1 , r2 ) → match

15

Edit distance / similarity functions
• Core sub-problem in approximate join:
– define / choose distance function between values in pairs
of joining attributes

1. Similarity function sim(r1 , r2 ) between record pairs r1 , r2

2. Decision rules of the form

sim(r1 , r2 ) < θ1 → not match
θ1 ≤ sim(r1 , r2 ) ≤ θ2 → unknown
θ2 < sim(r1 , r2 ) → match

A common choice of similarity function in the context of
approximate joins is one based on string q-grams

15

Measuring string similarity using q-grams
• q-grams map string s to a set q(s) of substrings of length q:
Ex.: 3-grams:

q(“Microsoft Corporation”) =
{‘Mic’, ‘icr’, ‘cro’, ‘ros’, ‘oso’, ‘sof ’, ‘oft’, ‘ft ’, ‘t C’, ‘ Co’, ‘Cor’, ‘orp’ }.

q(“Mcrosoft Corporation”) =
{‘Mcr’, ‘cro’, ‘ros’, ‘oso’, ‘sof’, ‘oft’, ‘ft ’, ‘t C’, ‘ Co’, ‘Cor’, ‘orp’, ‘rp#’ }.
|q(s1 ) ∩ q(s2 )|
sim(s1 , s2 ) = (Jaccard coefficient)
|q(s1 ) ∪ q(s2 )|
This is a commonly used measure of string similarity

Online linkage using q-grams
– approximate join is a θ join:
R θA,B S
– where θΑ,Β incorporates a similarity measure, eg Jaccard

• Naïve method: for each record pair, compute similarity
score
– I/O and CPU intensive, not scalable

• Goal: reduce O(n2) cost to O(n*w), where w << n
– Reduce number of pairs on which similarity is computed
– Take advantage of efficient relational join methods

17

Efficient relational approximate joins
Idea:
reduce approximate join to aggregated set intersection:

dis(s1 , s2 ) ≤ d if |(s1 ) ∩ q(s2 )| ≥ max (|s1 |, |s2 |) − (d − 1) × q − 1

In practice:
• known similarity measures can be used to compare pairs
of records
• cheap filters (length, count, position) to prune non-matches
• Implementation using standard SQL
• cost-based join methods

Efficient relational representation:
[CGK06] S. Chaudhuri, V. Ganti and R. Kaushik,
“A primitive operator for similarity joins in data cleaning” (ICDE’06)‫‏‬
18

Is full approximate join always necessary?
• Remaining source of complexity:
– overhead for storing and indexing q-grams
– cost of computing set intersection

• Typical mismatch rate in real datasets around 5%
• Complexity of full-fledged approximate join not fully
justified

Research hypothesis: time-completeness trade-offs

Offer users the option to trade completeness of integration
with the time required to complete the join

19

Adaptive query processing
Idea:
implement a hybrid join algorithm that combines
exact and approximate join

Intuition:
leverage known results on Adaptive Query Processing
– developed in the context of query re-optimization
– switch physical join operators in mid-flight

[DIR07] A. Deshpande, Z. G. Ives, and V. Raman. Adaptive query processing.
Foundations and Trends in Databases, 1(1):1–140, 2007

See also VLDB 2007 Tutorial at
http://www.vldb2007.org/program/slides/s1426-deshpande.pdf

20

Autonomic computing framework

[KC03] J. O. Kephart and D. M. Chess. The vision of
autonomic computing. IEEE Computer, 36(1):41–50, 2003.
21


monitor

respond assess

21


incremental
result size
monitor

estimate
result size
respond assess

switch
join compute
operators divergence

start with an exact join (optimistically)
at step t during the execution:
• estimate the expected size of the join result Ōt at that point
• monitor the actual size Ot of the result
• when using exact join: if Ōt and Ot diverge “too much”, then switch to
approximate join
• when using approximate join: if Ōt and Ot are very close, then switch to
exact join 21

Technical approach and challenges
Need to add several new capabilities to a standard query
processing infrastructure

• Assess:
– estimating result size at specific points during join execution
• Respond:
– switching between join operators at specific points during
execution
• Adaptive Query Processing (AQP): operator replacement
in pipelined query plans [EFP06]

– adding an approximate join operator to the query processor
[CGK06]

[EFP06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the
replacement of pipelined physical join operators in adaptive query processing. In EDBT
Workshops 2006, LNCS 4254
[CGK06] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in
data cleaning. In ICDE 2006, p. 5. 22

Symmetric hash join
Well-known join operator
– basis for approximate join [CGK06]
– can be applied to streams of data
• they can read tuples from whichever input is available, and they incrementally
produce output based on the tuples received so far.
– a pipelined operator ← this is a key requirement for use in AQP

R S

23

Symmetric hash join

R S

build

m x

n y
R hash table

23

Symmetric hash join

R S

build build

m x
y r
n y
x s
R hash table
S hash table

23

Symmetric hash join

R S when a tuple appears at either input,
it is incrementally added to the
build build corresponding hash table and
probed against the opposite hash
table.
m x
y r
n y
x s
R hash table
S hash table

23

Symmetric hash join

probe table.
m x
y r
n y
x s
R hash table
S hash table

23

Symmetric hash join

probe table.
m x
y r
n y
x s
R hash table
S hash table

[R.m,S.s]
23

Symmetric hash join

probe table.
m x
probe
y r
n y
x s
R hash table
S hash table

[R.m,S.s]
23

Symmetric hash join

probe table.
m x
probe
y r
n y
x s
R hash table
S hash table

[R.m,S.s]
[R.n, S.r] 23

Estimating result size
• Exploit implicit parent-child key assumption:
– at the end of join, we expect a result of size |S|
R (parent) S (child)

c x n
y b
d y
x a

• When there are no mismatches:
after scanning n < |S| tuples on S:
P(a=x in |S| has been matched) = P(tuple c=x is in top n of R) = n/|R|

Thus, join result size On is a binomial random variable:

n
On ∼ bin(n, )
|R|
24

Detecting divergent observed result size
¯
Observation On is an outlier wrt expected result size
On after n tuples have been scanned, if:
¯
Pn,p(n) (On ≤ O) ≤ θout

where Pn,p(n) (.) is the cumulative distribution function for
a binomial with parameters n, p(n)

25

Instantiating the MAR framework
On
incremental
result size ✔
monitor

estimate ✔
result size
respond assess
switch compute
join divergence
operators predicates

26

On
incremental
result size ✔
monitor

estimate ✔
result size
respond assess
switch
join
compute
divergence
✔

26

On
incremental
result size ✔
monitor

estimate ✔
result size
respond assess
switch
join
compute
divergence
✔
σ(t), µ(t), π(t)
¯
σ(n) ≡ Pn,p(n) (On ≤ O) ≤ θout Discrepancy detected
At,W
µi (t) ≡ ≤ θcurpert Current perturbations on
W left/right?
26

πi (t) ≡ I(µi (t )) ≤ θpastpert Past perturbations on left/
t <t
right?

Responder’s state machine
• Operator switch defined in terms of state transitions
• Owing to symmetry, we can use a different operator
on each of the two tables

left: exact left: approximate
right: exact right: approximate

left: exact left: approximate
right: approximate right: exact

27

Rationale for state transitions

lex /
rex

evidence that lex / lap / evidence that left
left and /or right rap rex and /or right input
input perturbed no longer
perturbed
lap /
rap

predicates σ(t), µ(t), π(t) provide the evidence needed to
drive the transitions

Assessment → state transitions
¯
σ(n) ≡ Pn,p(n) (On ≤ O) ≤ θout
At,W
µi (t) ≡ ≤ θcurpert
W

πi (t) ≡ I(µi (t )) ≤ θpastpert
t <t

ϕ0 (t) = ¬σ(t) ∧ µleft (t) ∧ µright (t)
ϕ1 (t) = σ(t) ∧ ¬µleft (t) ∧ ¬µright (t)
ϕ2 (t) = σ(t) ∧ ¬µleft (t) ∧ µright (t) ∧ πleft (t)
29

Completing the loop
On
incremental δadapt
result size ✔
monitor

estimate ✔
result size
✔ respond assess
switch
compute
✔
join
operators divergence

ϕ0 (t) = ¬σ(t) ∧ µleft (t) ∧ µright (t) ¯
σ(n) ≡ Pn,p(n) (On ≤ O) ≤ θout
ϕ1 (t) = σ(t) ∧ ¬µleft (t) ∧ ¬µright (t) At,W
µi (t) ≡ ≤ θcurpert
ϕ2 (t) = σ(t) ∧ ¬µleft (t) ∧ µright (t) ∧ πleft (t) W
30

πi (t) ≡ I(µi (t )) ≤ θpastpert
t <t

Note on operator replacement
• Details on how to switch operators on the fly are
omitted
– main point: pipelined operators expose specific quiescent
states where replacement can take place with no loss of
work [EPF06]

[EPF06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the
replacement of pipelined physical join operators in adaptive query processing. In
31
EDBT Workshops 2006, LNCS 4254

Experimental evaluation
Trade-off analysis

• Benefits:
– achieved level of result completeness
– baseline: approximate join throughout
• model marginal gain of hybrid algorithm

• Cost
– baseline: exact join throughout
• model marginal cost of hybrid algorithm

32

Test datasets
Datasets chosen as representative of 4 distinct patterns

we expect our results to vary:
• uniform perturbation: evidence grows slowly => slow reaction
• bursty perturbation: strong evidence => timely reaction

Parameters tuning and gain/cost models
• Each of the MAR parameters tuned empirically
• Experiments executed using the best possible
configuration
• Nice result: parameter setting is quite independent
from the specific variant pattern

Relative gain grel:
• R: result size for approx join only
• r: result size for exact only
• rabs: result size actually observed
grel = (rabs – r) / (R – r)‫‏‬

(details on cost model omitted)

Cost model
unit cost of executing one step in state i: wi
– weights determined experimentally
• number of steps in each state ti
• unit state transition cost – experimental: vi
• number of state transitions tri
total absolute cost:
cabs = sumi(sci) + sumi(tci)‫‏‬
relative cost:
c: best cost (exact only)‫‏‬
C: worst cost (approx only)‫‏‬
crel = cabs / (C - c)‫‏‬

Discussion
• Results similar across different variant patterns
– good!

• Transition cost is not overwhelming:
– we never pay more for hybrid than for approx
– this gives us a good space for trade-offs
– we could let users tune the algorithm without fear of
“breaking” it

Conclusions
• An exact / approximate hybrid approach to join with
violations to implicit referential integrity across tables
– relational setting

• Approach based on autonomic computing principles
– Adaptive query processing techniques

• Application: on-the-fly integration scenarios (mashups,
personal dataspaces)

• Results: cost / completeness trade-off analysis
– initial encouraging experimental conclusions

Study requires additional testing on real datasets

References used in the presentation
• A. Halevy and D. Maier, Dataspaces: the Tutorial, VLDB 2008
tutorial, Auckland, NZ, Aug 2008

• N. Koudas, S. Sarawagi, D.Srivastava, Record Linkage: Similarity
Measures and Algorithms, VLDB 2006 tutorial, Seoul, Corea, 2006

• [FS69] I.P. Fellegi and A.B. Sunter, A Theory for Record Linkage, J.
Am. Statistical Assoc., vol. 64, no. 328, pp. 1183-1210, Dec. 1969

• [EIV07] A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, Duplicate
Record Detection: A Survey, IEEE Transactions on Knowledge and
Data Engineering, VOL. 19, NO. 1, Jan 2007

• [KC03] J. O. Kephart and D. M. Chess. The vision of autonomic
computing. IEEE Computer, 36(1):41–50, 2003.

• EFP06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A
foundation for the replacement of pipelined physical join operators
in adaptive query processing. In EDBT Workshops 2006, LNCS
4254

• [CGK06] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive
39
operator for similarity joins in data cleaning. In ICDE 2006, p. 5

Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (14)

Ähnlich wie Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups

Ähnlich wie Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups (20)

Mehr von Paolo Missier

Mehr von Paolo Missier (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Invited talk @ Cardiff University, 2008: Approximate entity reconciliation for on-the-fly integration in data mashups