Authros: Nguyen Quoc Viet Hung (1), Nguyen Thanh Tam (1), Zoltán Miklós (2), Karl Aberer (1),
Avigdor Gal (3), and Matthias Weidlich (4)
1 École Polytechnique Fédérale de Lausanne
2 Université de Rennes 1
3 Technion – Israel Institute of Technology
4 Imperial College London
Pay-as-you-go Reconciliation in Schema Matching Networks
1. Pay-as-you-go Reconciliation in Schema
Matching Networks
Nguyen Quoc Viet Hung1, Nguyen Thanh Tam 1, Zoltán Miklós2, Karl Aberer1,
Avigdor Gal3, and Matthias Weidlich4
1 École Polytechnique Fédérale de Lausanne
2 Université de Rennes 1
3 Technion – Israel Institute of Technology
4 Imperial College London
2. ICDE | 2014 2
Schema Matching - Where?
Schema matching is the process of establishing correspondences between the
attributes of schemas, for the purpose of data integration
Large enterprises
Cloud
WWW
Collaborative Systems
P2P Networks
3. Private PhD Thesis Defense | 12.2013 3
Schema Matching Network
A network of schemas that are matched against each other
Traditional approach:
Mediated schema
Our approach:
Schema Matching Network
S1 S2 S3
S1
S2 S3
Require consensus on schema
Updated Frequently
4. ICDE | 2014 4
Pay-as-you-go Reconciliation
Reconciliation is the process of asking human user to give feedback on correspondences.
Need of reconciliation: automatic techniques use heuristics results are inherently uncertain
s1: EoverI
s2: BBC
s3: DVDizzy
a4: productionDate
a1: releaseDate
a3: availabilityDate
c4
c2
c1
c3
c5
a2: screeningDate
Attribute names are quite similar
automatic matching tools often fail to identify the
correct correspondences.
Instantiation
Selective matching
Uncertainty
Reduction
Pay‐as‐you‐go
reconciliation
Incrementally improve matching
quality with minimal user effort
Instantiate a single trusted
set of correspondences
5. ICDE | 2014 5
System Overview
General approach:
1. Develop a probabilistic matching network (pSMN) can measure the overall
uncertainty of the network
2. Reduce network uncertainty: guide user feedback with minimal effort
3. Instantiate a selective matching: maintain a good set of attribute correspondences
to make the system available at any time
6. ICDE | 2014 6
Outline
Probabilistic Schema Matching Network (pSMN):
Model
Computation
Uncertainty Reduction
Instantiation of the selective matching
Experimental results
Conclusion and future work
7. ICDE | 2014 7
pSMN - Modeling
Schema matching network is modeled as a quadruple N ൌ ܵ, ܩ௦, Γ, ܥ, ܲ
ܵ – set of schemas ݏ
ܩ௦ ‐ interaction graph: represents the connections in the networks.
ܥ – set of attribute correspondences
Γ – set of integrity constraints
An integrity constraint is the formulation of natural properties
1‐1 constraint
Cycle constraint (transitivity)
Etc.
ܲ ൌ ሼpୡሽ – a set of probabilities. Each probability is associated with a
correspondence ܿ ∈ ܥ.
8. ICDE | 2014 8
pSMN - Computing
Probability of a correspondence
Semantics: indicate the correctness of these correspondences
Source: integrity constraints and user input. Idea: a correspondence that involves
many violations has a high chance of being problematic.
Computation:
Step 1: construct all possible matching instances Ω ൌ ሼIଵ, … , I୬ሽ. Matching
instance is a maximal set of correspondences satisfying all integrity constraints
and user input.
Step 2: compute by the formula:
ൌ #௧ ௦௧௦ ௧
# ௦௦ ௧ ௦௧௦ (i.e. ൌ ሼூ∈ஐ:∈ூሽ
ஐ )
Challenge: probability computation has a high complexity We use non‐uniform
sampling and a view‐maintenance technique to approximate the probability
efficiently.
Network Uncertainty: quantify the uncertainty of pSMN based on entropy:
ܪ ܥ ൌ െ
log ሺ1 െ ሻ logሺ1 െ ሻ
∈
9. ICDE | 2014 9
Outline
Probabilistic Schema Matching Network (pSMN):
Model
Computation
Uncertainty Reduction
Instantiation of the selective matching
Experimental results
Conclusion and future work
10. ICDE | 2014 10
Reduce Network Uncertainty
Goal: guide user to give feedback with minimal user effort
Problem (UNCERTAINTY MINIMIZATION WITH LIMITED EFFORT BUDGET). Given a
probabilistic matching network 〈ܵ, ܥ, ܩ, Γ, ܲ〉 and a budget of user effort ݇, find a set of
correspondences ܥᇱ ⊆ ܥ with ܥᇱ ݇, such that ܪሺܥ, ܲሻ is minimal.
11. ICDE | 2014 11
Approach – Use heuristic ordering
Idea: feed users the correspondences with highest information‐gain first.
Information gain: the uncertainty reduction before and after validation:
ܫܩ ܿ ൌ ܪ ܥ െ ܪሺܥ|ܿሻ
ܪ ܥ ܿ : expected network uncertainty when knowing the true value of c
Two possible solutions: {c1,c2,c3} and
{c1,c4,c5}.
Ask c1 first the network is unchanged
no uncertainty reduction.
Ask c2 first only 1 solution left the
network becomes certain.
SA
SB
SC
c3
c4
c5
c1 c2
SA
SB
SC
c5
c3
c4
c1 c2
SA
SB
SC
c3
c1 c2
12. ICDE | 2014 12
Instantiate a selective matching
Goal: Maintain a single trusted set of correspondences
Goodness measurement of a set of correspondences ܫ ⊆ ܥ:
Repair distance: information loss of eliminating some correspondences to
guarantee integrity constraint
Δ ܫ ൌ ܥ ∖ ܫ
Likelihood: represents the collective correctness of correspondences:
ݑ ܫ ൌ ෑ
∈ூ
Instantiation problem: given a schema matching network, identify a set of
correspondences ܫ ⊆ ܥ with minimal repair distance (w.r.t. ܥ) and maximal
likelihood.
13. ICDE | 2014 13
Approach
The instantiation problem is NP‐complete use heuristic approach
Algorithm:
Step 1: Initialization ‐ Pickup a sampled matching instance with minimal repair
distance
Step 2: Optimization – Randomized local search
Repair Distance
Likelihood
minimal repair distance + maximal likelihood
I0
Iopt
randomized local search
matching instances:
satisfy all constraints
non‐sampled instance
sampled instance
sampled + minimal repair distance
14. ICDE | 2014 14
Outline
Probabilistic Schema Matching Network (pSMN):
Model
Computation
Uncertainty Reduction
Instantiation of the selective matching
Experimental results
Conclusion and future work
15. ICDE | 2014 15
Experiment – Dataset and Setting
Datasets:
Business Partner: schemas from enterprise systems
Purchase Order: purchase order e‐business schemas
University Application Form: schemas from Web interfaces of American university
application forms
WebForm: schemas from Web forms of different domains
Thalia: schemas describing university courses
Metrics:
Precision: measures quality improvement at each user interaction step ݅, with G
being the exact match.
ܲ ൌ ሺD୧
∩ ܩሻ/|D୧|
User effort: the percentage of feedback steps relative to the size of the matcher
output.
ܧ ൌ ݅/|ܥ|
16. Efficiency of guiding strategy on uncertainty reduction
Goal: compare between guiding vs. non‐guiding strategy on uncertainty reduction
Evaluation procedure:
ICDE | 2014 16
Increases user effort
Upon each user input, measure the network uncertainty and precision
Interesting finding: heuristic ordering strategy achieves savings of up to 48% user
effort compared to random ordering.
17. ICDE | 2014 17
Efficiency of guiding strategy on instantiation
Goal: compare between guiding vs. non‐guiding strategy on instantiation
Evaluation procedure:
Increases user effort
Measure the precision and recall of the instantiated matching
Interesting finding: heuristic ordering strategy outperforms the baseline with an
average difference of 15% (precision) and 14% (recall).
18. ICDE | 2014 18
Conclusions
We introduce the concept of schema matching networks and probabilistic matching
networks
We define a model for pay‐as‐you‐go reconciliation on top of matching networks.
We propose a guiding technique to reduce network uncertainty and a heuristic
approach to instantiate a selective matching.
Through experiments with real‐world schemas, our guiding strategy outperforms the
baseline:
Saving user effort by up to 48%
Increasing precision (15%) and recall (14%)
19. ICDE | 2014 19
Future Work
Generalizing pay‐as‐you‐go reconciliation for crowdsourced models:
Business process matching
Ontology alignment