Salesforce Miami User Group Event - 1st Quarter 2024
Data Integration
1. Clio: Schema Mapping Creation and
Data Exchange
Presented by
Leila Jalali
Information Systems Group Candidacy Exam, Jan. 2010
2. the Clio project
•Wants data from S
•Understands T
•May not understand S Q
Source
Schema Mapping Target
schema T
schema S
“conforms to” “conforms to”
Data Exchange
data
to transform data
Clio addresses two main problems:
How to generate schema mappings and how to use them for data exchange?
exchange
Information Systems Group Leila Jalali, Candidacy Exam
3. Outline
The Motivating Example
2. Schema Mapping Generation
Mapping generation algorithm
2. Data Exchange
Query generation algorithm
Conclusions
Information Systems Group Leila Jalali, Candidacy Exam
4. A Motivating Example
Schema S:
Companies: Set of Rcd Schema T:
Name v1 Organizations: Set of Rcd
Address Code
Year Year
f1
Fundings: Set of Rcd
Grants : Set of Rcd v2 FId
Gid FinId
Recipient
f4
Amount Finances: Set of Rcd
v3
Supervisor FinId
f2 Manager Budget
f3 Phone
Contacts : Set of Rcd
v4 Correspondences
Cid
(given by a "schema matcher“ or
Email
a“user”)
Phone
Information Systems Group Leila Jalali, Candidacy Exam
5. Correspondences
Companies Using tuple generating dependency(tgd):
Name v1 Organizations
Address Code ∀n,d,y Companies(n,d,y) →
v1:
∃y',F Organizations(n,y',F))
Year Year
f1 Grants
Fundings
Gid v2 FId
Recipient FinId
Amount
foreach c in companies
f2 Supervisor v3 Finances f4
f3 exists o in organizations,
Manager FinId
Contacts Budget with o.code = c.name
Cid Phone
Email
Phone
v4
Information Systems Group Leila Jalali, Candidacy Exam
6. More complex mappings
Companies ∀n,d,y,g,a,s,m Companies(n,d,y),
Name v1 Organizations Grants(g,n,a,s,m) →
Address Code ∃y',F,f, p
Year Year
f1 Grants Organizations(n,y',F)),
Fundings
v2 F(g,f),
Gid FId
Recipient FinId Finances(f,a,p)
Amount
foreach c in companies, g in grants
f2 Supervisor v3 Finances f4
f3 where c.name=g.recipient
Manager FinId exists o in organizations,
Contacts Budget f in o.fundings,
Cid Phone i in finances
Email where f.finId = i.finId
v4
Phone with o.code = c.name
and f.fId = g.gId
and i.budget = g.amount
Information Systems Group Leila Jalali, Candidacy Exam
7. More complex mappings
Companies ∀n,d,y,g,a,s,m Companies(n,d,y),
Name v1 Organizations Grants(g,n,a,s,m) →
Address Code ∃y',F,f, p
Year Year
f1 Grants Organizations(n,y',F)),
Fundings
v2 F(g,f),
Gid FId
Recipient FinId Finances(f,a,p)
Amount
foreach c in companies, g in grants
f2 Supervisor v3 Finances f4
f3 where c.name=g.recipient
Manager FinId exists o in organizations,
Contacts Budget f in o.fundings,
Cid Phone i in finances
Email where f.finId = i.finId
v4
Phone query on the with o.code = c.name
source:QS and f.fId = g.gId
and i.budget = g.amount
query on the
Correspondences QS QT target: QT
Information Systems Group Leila Jalali, Candidacy Exam
8. Outline
The Motivating Example
2. Schema Mapping Generation
Mapping generation algorithm
2. Data Exchange
Query generation algorithm
Conclusions
Information Systems Group Leila Jalali, Candidacy Exam
9. Mapping Generation
Source Schema Generate all possible associations within the Source
Structural Associations
Target Schema Generate all possible associations within the Target
Information Systems Group Leila Jalali, Candidacy Exam
10. Mapping Generation
Source Schema Generate all possible associations within the Source
Structural Associations
Target Schema Generate all possible associations within the Target
Companies:
Name Organizations:
f1 Address from p in companies Code
Year Year from o in organizations
Grants: from g in grants Fundings:
Gid FId
f4
f2 Recipient FinId
f3 Finances:
Amount
Supervisor FinId
Manager Budget
Contacts: Phone
Cid
Email
Information Systems Group Leila Jalali, Candidacy Exam
11. Mapping Generation
Source Schema Generate all possible associations within the Source
Structural Associations
Target Schema Generate all possible associations within the Target
Logical Associations
Build larger associaitons in Source (AS) and Target (AT)
Information Systems Group Leila Jalali, Candidacy Exam
12. Mapping Generation
Source Schema Generate all possible associations within the Source
Structural Associations
Target Schema Generate all possible associations within the Target
Logical Associations
Build larger associaitons in Source (AS) and Target (AT)
Companies:
Name starting with a structural association and "chasing" constraints
f1 Address
AS :
Year
Grants:
Gid
f2 Recipient
f3 Amount
Supervisor
Manager
Contacts:
Information Systems Group Leila Jalali, Candidacy Exam
13. Mapping Generation
Source Schema Generate all possible associations within the Source
Structural Associations
Target Schema Generate all possible associations within the Target
Logical Associations
Build larger associaitons in Source (AS) and Target (AT)
Use a pair of <AS,AT > and Correspondeces covered by <AS , AT> to generate a
Clio Mapping: foreach AS exists AT with W
W is the conjunction of equalities h (eS )=h’(eT ) (captured from correspondences)
Information Systems Group Leila Jalali, Candidacy Exam
14. Clio mapping, example
Generate a Clio Mapping: foreach AS exists AT with W
Companies
W is the conjunction of equalities h (eS )=h’(eT )
Name v1 Organizations
Address Code AS : from g in grants, c in companies,
Year Year s in contacts, m in contacts
f1 Grants where g.recipient = c.name
Fundings
Gid v2 FId
and g.supervisor = s.cid
Recipient and g.manager = m.cid
FinId
Amount AT: from o in organizations,
f2 Supervisor v3 Finances f4 f in o.fundings, i in finances
f3 Manager FinId where f.finId = i.finId
Contacts Budget
Cid Phone v1, v2, v3 are covered
Email
Phone
v4foreach g in grants, c in companies, s in contacts, m in contacts
where g.recipient = c.name and g.supervisor = s.cid and g.manager = m.cid
exists o in organizations, f in o.fundings, i in finances
where f.finId = i.finId
with c.name = o.code and g.gId = f. fId and g.amount = i.budget
Information Systems Group Leila Jalali, Candidacy Exam
15. Dominance
A2 dominates A1 (A1 ≤ A2 ) if
the from and where clauses of A1 are subsets of those of A2 (after
suitable renaming)
A2 : from g in grants, c in companies, s in contacts, m in contacts
where g.recipient = c.name and
g.supervisor = s.cid and
g.manager = m.cid
A1 : from g in grants, c in companies
where g.recipient = c.name
Information Systems Group Leila Jalali, Candidacy Exam
16. Coverage of a coresspondence
A correspondence v : foreach PS exists PT with eS=eT
is covered by a pair of associations <AS , AT> if PS ≤ AS and PT ≤ AT
with some renaming h, h’
AS : from c in companies v: foreach c in companies
Example: AT : fom o in organizations exists o in organizations
with c.name = o.code
Information Systems Group Leila Jalali, Candidacy Exam
17. Mapping Generation
Source Schema Generate all possible associations within the Source
Structural Associations
Target Schema Generate all possible associations within the Target
Logical Associations
Build larger associaitons in Source (AS) and Target (AT)
Use a pair of <AS,AT > and Correspondeces covered by <AS , AT> and generate a
Clio Mapping: foreach AS exists AT with W
W is the conjunction of equalities h (eS )=h’(eT ) (captured from correspondences)
Information Systems Group Leila Jalali, Candidacy Exam
18. Mapping Generation
Source Schema Generate all possible associations within the Source
Structural Associations
Target Schema Generate all possible associations within the Target
Logical Associations
Build larger associaitons in Source (AS) and Target (AT)
Use a pair of <AS,AT > and Correspondeces covered by <AS , AT> and generate a
Clio Mapping: foreach AS exists AT with W
W is the conjunction of equalities h (eS )=h’(eT ) (captured from correspondences)
Add the Clio Mapping to the Set of Mappings
the Set of Mappings
Information Systems Group Leila Jalali, Candidacy Exam
19. Logical associations are meaningful
combinations of correspondences
Finds maximal sets of correspondences
that can be interpreted together
Discard the “larger” mapping
Generate a Clio mapping
Information Systems Group Leila Jalali, Candidacy Exam
20. Outline
The Motivating Example
1. Schema Mapping Generation
Mapping generation algorithm
2. Data Exchange
Query generation algorithm
Conclusions
Information Systems Group Leila Jalali, Candidacy Exam
21. Query generation for data exchange
Mapping
generation
Source Target
schema schema
Query
generation
Information Systems Group Leila Jalali, Candidacy Exam
22. Overview of Query Generation
Input: A Clio Mapping
x 0.name
1. Query Graph is constructed which represents y 0 (organizations)
the key portions of the query in the graph x 0.name
x1. amount, x1.gid,
x 0.name,
y 0.year
2. Annotate the graph to generate Skolem terms y 1(fundings)
x 0.name
y 0 .code
x1.gid
x 0.name, x1.gid
3. Traverse the graph and produce the query y 0.fid y 0.finId
x1. gid
Output: the data exchange Query
(in SQL, XQuery, or XSLT)
Information Systems Group Leila Jalali, Candidacy Exam
23. 1. Constructing the Query Graph
Adding a node for each variable in the exists clause
y0 (organizations) y2(finances)
y1(fundings)
Information Systems Group Leila Jalali, Candidacy Exam
24. 1. Constructing the Query Graph (cont.)
Organizations:
Code
Year
Fundings:
FId
f4
Adding nodes for all the atomic type elements reachable from these FinId
nodes via record projection Finances
FinId
y0 (organizations) y2(finances) Budget
Phone
y1(fundings) y2.phone
y0.code y0.year y2.finId
y2.budget
y1.fid y1.finId
Information Systems Group Leila Jalali, Candidacy Exam
25. 1. Constructing the Query Graph (cont.)
Organizations:
Code
Year
Fundings:
FId
Add structural edges to reflect the relationships between nodes FinId
Finances
FinId
y0 (organizations) y2(finances) Budget
Phone
y1(fundings) y2.phone
y0.code y0.year y2.finId
y2.budget
y1.fid y1.finId
Information Systems Group Leila Jalali, Candidacy Exam
26. 1. Constructing the Query Graph (cont.)
Add the source nodes for all source expressions in the with clause
y0 (organizations) y2(finances)
y1(fundings) y2.phone
y0.code y0.year y2.finId
y2.budget
y1.fid y1.finId x0.name
x2.phone
x1.amount
x1. gid
Information Systems Group Leila Jalali, Candidacy Exam
27. 1. Constructing the Query Graph (cont.)
Attach the source nodes to the target nodes to which they are “equal”
y0 (organizations) y2(finances)
y1(fundings) y2.phone
y0.code y0.year y2.finId
y2.budget
y1.fid y1.finId x0.name
x2.phone
x1.amount
x1. gid
Information Systems Group Leila Jalali, Candidacy Exam
28. 1. Constructing the Query Graph (cont.)
Use the equalities in the where clause to add edges between target nodes
y0 (organizations) y2(finances)
y1(fundings) y2.phone
y0.code y0.year y2.finId
y2.budget
y1.fid y1.finId x0.name
x2.phone
x1.amount
x1. gid
Information Systems Group Leila Jalali, Candidacy Exam
29. 2. Annotating the Graph
Each node is annotated with a set of source expressions
Upward propagation: Every expression that a node acquires is propagated
to its parent node, unless the (acquiring) node is a variable.
y0 (organizations) y2(finances)
x 2.phone
x 0.name
x 1.amount y2.phone
y1(fundings) y0.code y0.year y2.finId
y2.budget
x1.gid
y1.fid y1.finId x0.name
x2.phone
x1.amount
x1. gid
Information Systems Group Leila Jalali, Candidacy Exam
30. 2. Annotating the Graph (cont.)
Downward propagation: Every expression that a node acquires is
propagated to its children
x 0.name
x 1.amount, x 2.phone
y0 (organizations) y2(finances)
x 2.phone
x1.gid
x 0.name
x 1.amount y2.phone
y1(fundings) y0.code y0.year y2.finId
y2.budget
x1.gid x 0.name
y1.fid y1.finId x0.name
x2.phone
x1.amount
x1. gid
Information Systems Group Leila Jalali, Candidacy Exam
31. 2. Annotating the Graph (cont.)
Eq. propagation: Every expression that a node acquires is propagated to
the nodes related to it through equality edges.
x 0.name
x 1.amount, x 2.phone
y0 (organizations) y2(finances)
x 2.phone
x1.gid,x 0.name x 0.name x 1.amount, x 2.phone
x 0.name
x 1.amount y2.phone
y1(fundings) y0.code y0.year y2.finId
y2.budget
x1.gid,x 0.name
x1.gid
y1.fid y1.finId x0.name
x2.phone
x1.amount
x1. gid
Information Systems Group Leila Jalali, Candidacy Exam
32. 2. Annotating the Graph (cont.)
Apply the rules until no more rules can be applied
x1.gid,x 0.name
x 0.name
x 1.amount, x 2.phone
y0 (organizations) y2(finances)
x 1.amount, x 2.phone x1.gid,x 0.name x 2.phone
x1.gid,x 0.name x 0.name x 1.amount, x 2.phone
x 0.name
x 1.amount y2.phone
y1(fundings) y0.code y0.year y2.finId
x 1.amount, x 2.phone y2.budget
x1.gid,x 0.name
x1.gid
y1.fid y1.finId x0.name
x2.phone
x1.amount
x1. gid
Information Systems Group Leila Jalali, Candidacy Exam
33. 3. Generation of Transformation Queries
Generate the query fragment:
The for each clause is converted to a query fragment:
Information Systems Group Leila Jalali, Candidacy Exam
34. 3. Generation of Transformation Queries
Perform a depth-first traversal on the Graph
x1.gid,x 0.name
x 0.name
x 1.amount, x 2.phone
y0 (organizations) y2(finances)
x 1.amount, x 2.phone
x1.gid,x 0.name x 2.phone
x1.gid,x 0.name x 0.name x 1.amount, x 2.phone
x 0.name
x 1.amount y2.phone
y1(fundings) y0.code y0.year y2.finId
x 1.amount, x 2.phone y2.budget
x1.gid,x 0.name
x1.gid
y1.fid y1.finId x0.name
x2.phone
x1.amount
x1. gid
Information Systems Group Leila Jalali, Candidacy Exam
35. 3. Generation of Transformation Queries
x 0.name x1.gid,x 0.name
y0 (organizations) x 1.amount, x 2.phone
y2(finances)
x 1.amount, x 2.phone
x1.gid,x 0.name x 2.phone
x1.gid,x 0.name x 0.name x 1.amount, x 2.phone
x 0.name
x 1.amount y2.phone
y1(fundings) y0.code y0.year y2.finId
x 1.amount, x 2.phone y2.budget
x1.gid,x 0.name
x1.gid
y1.fid y1.finId x0.name
x2.phone
x1.amount
x1. gid
Information Systems Group Leila Jalali, Candidacy Exam
36. Finally we have the Query:
Information Systems Group Leila Jalali, Candidacy Exam
37. Clio: Conclusion
Providing tools that help in automating and managing the
problem of Data Conversion
The key contributions of Clio:
Schema mapping generation
Mapping as a query discovery problem
Capable of mapping between relational and nested schemas
Query generation for data exchange
SQL, XQuery, XSLT, generating Skolems,...
Information Systems Group Leila Jalali, Candidacy Exam
39. Back ups
Clio Requirements
Complex mappings: using association
Definitions:
Mapping language
Paths
Schema&Types
Dominance
Query Generation Challenges,the problem of Recursion in XML schema
Nested Referential Integrity (NRI) constraints
The Chase
Information Systems Group Leila Jalali, Candidacy Exam
40. the Clio project- overview of the requirements
Q
Schema Mapping Target
Source
schema T
schema S
“conforms to” “conforms to”
no assumptions about the schemas
data A general mapping language
Mapping at different levels of granularities
Incremental mapping algorithms
Capable of mapping between relations schemas and nested schemas
Information Systems Group Leila Jalali, Candidacy Exam
41. Formalize correspondences
Companies Using tuple generating dependency(tgd):
Name v1 Organizations
Address Code ∀n,d,y Companies(n,d,y) →
v1:
∃y',F Organizations(n,y',F))
Year Year
f1 Grants
Fundings
Gid v2 FId
Recipient FinId v3:
∀g, r, a, s, m Grants(g,r,a,s,m) →
Amount
∃f,p Finances(f,a,p)
f2 Supervisor v3 Finances f4
f3 Manager
∀c, e, p Contacts(c,e,p) →
FinId
Contacts Budget v4:
Cid Phone ∃f,b Finances(f,b,p)
Email
Phone
v4
∀n,d,y,g,a,s,m Companies(n,d,y),Grants(g,n,a,s,m) →
v2:
∃ y',F,f Organizations(n,y’,F), F(g,f )
Information Systems Group Leila Jalali, Candidacy Exam
42. Correspondences alone are not enough
How individual data values should be connected in the target?
Companies
Name v1 Organizations
Address Code
Year Year
f1 Grants
Fundings
Gid v2 FId
Recipient FinId
Amount
f4 Companies Organizations
f2 Supervisor v3 Finances Name Address Year Code Year Fundings
f3 Manager FinId MS SA 1976
FId FinId
Contacts Budget AT&T TX 1980
f3 IBM NY 1955 MS
Cid Phone
Email Grants AT&T
Phone
v4 GId Amt
Rec.t IBM
301 MS 30
301
302 MS 40
303 IBM 30 302
Information Systems Group Leila Jalali, Candidacy Exam
43. More complex mappings are needed
Companies
Name v1 Organizations
Address Code The "association" between companies and grants in
Year Year the source is suggested by f1 (a foreign key)
f1 Grants
Fundings
Gid v2 ∀n,d,y,g,a,s,m Companies(n,d,y),Grants(g,n,a,s,m) →
FId
Recipient FinId ∃ y',F,f Organizations(n,y’,F), F(g,f )
Amount
f2 Supervisor v3 Finances f4
f3 Manager FinId
Contacts Budget Companies
Organizations
Name Address Year
Cid Phone
MS SA 1976 Code Year Fundings
Email AT&T TX 1980
v4 FId FinId
Phone f3 IBM NY 1955
MS 301
Grants
302
GId Rec.t Amt
301 MS 30 AT&T
302 MS 40 IBM 303
303 IBM 30
Information Systems Group Leila Jalali, Candidacy Exam
44. Yet more complex...
Companies
Name v1 Organizations ∀g, r, a, s, m Grants(g,r,a,s,m) →
v3:
Address Code ∃f,p Finances(f,a,p)
Year Year
f1 Grants
Fundings
Gid v2 FId ∀n,d,y,g,a,s,m Companies(n,d,y),Grants(g,n,a,s,m) →
Recipient FinId
∃y',F,f, p Organizations(n,y',F), F(g,f), Finances(f,a,p)
Amount
f2 Supervisor v3 Finances f4
f3 Manager FinId
Contacts Budget • Three tuples are generated for each pair of related
Cid Phone companies and grants
Email • The mapping specifies that there exist an f, appearing in
Phone
v4 two places, without saying what its value must be
Information Systems Group Leila Jalali, Candidacy Exam
45. Yet more complex... Companies
Name v1 Organizations
v4 ∀c, e, p Contacts(c,e,p) → Address Code
Year
∃f,b Finances(f,b,p) f1 Grants
Year
Fundings
Gid v2 FId
• How do we obtain the phone to be Recipient FinId
put in finances? Amount
• Is it the supervisor's one or the f2 Supervisor Finances f4
v3
manager's? f3 Manager FinId
• FKs suggest either (or even both) Contacts Budget
• Human intervention is needed to choose Phone
Cid
Email
Phone
v4
Information Systems Group Leila Jalali, Candidacy Exam
46. The Mapping Language- Syntax
foreach x1 in g1, . . . , xn in gn xi in gi (generator)
where B1 •xi variable
•gi set (either the root or a set
exists y1 in g'1, . . . , ym in g'm nested within it)
where B2
B1 conjunction of equalities over
with e1 = e'1 and . . . and ek = e'k
the xi variables
The example:
e1 = e'1 … equalities between a
foreach c in companies, g in grants
source expression and a target
where c.name=g.recipient expression
exists o in organizations,
f in o.fundings,
i in finances
where f.finId = i.finId
with o.code = c.name
and f.fId = g.gId
and i.budget = g.amount
Information Systems Group Leila Jalali, Candidacy Exam
47. Primary and Relative paths
Primary path (given a schema root R, that is a first level
element in the schema):
x1 in g1, x2 in g2, …, xn in gn
where g1 is an expression on R (just R?), gi (for i ≥ 2) g1 is an expression
on xi-1
Examples
c in companies
o in organizations, f in o.fundings
Relative path with respect to a variable x
x1 in g1, x2 in g2, …, xn in gn
where g1 is an expression on x, gi (for i ≥ 2) g1 is an expression on xi-1
Example
f in o.fundings
Information Systems Group Leila Jalali, Candidacy Exam
48. Schema and types
A schema: a sequence of labels(roots) each with associated
type, defined by this grammar:
Complex types
Atomic types A set type
All and choice model-groups
Repeated elements
Instances: associates each schema root a value
A value for atomic types
setID
An unordered tuple of pairs
A pair
Information Systems Group Leila Jalali, Candidacy Exam
50. the data exchange problem
Information Systems Group Leila Jalali, Candidacy Exam
51. Query generation challenges
1. Creation of New Values in the Target
Optional: Null
name
salary
spouse
dateofbirth
Not nullable: one-to-one Skolem function But if it is emp ID
Information Systems Group Leila Jalali, Candidacy Exam
52. Query generation challenges
1. Creation of New Values in the Target
Refrential constraints
Information Systems Group Leila Jalali, Candidacy Exam
54. Query generation challenges
3. Value Creation interacts with Grouping
Information Systems Group Leila Jalali, Candidacy Exam
55. Recursion in XML schema
Information Systems Group Leila Jalali, Candidacy Exam
56. the Chase
Given as association, repeatedly applying a chase rule to the "current"
association (initialed as the input one)
If there is a NRI constraint
foreach X exists Y where B
such that the "current" association contains X and does not contain a Y that
satisfies B
then add Y to the generators and B to the where clause
Example. If we start with
from g in grants
then we have to add various components and obtain
from g in grants, c in companies,
s in contacts, m in contacts
where g.recipient = c.name and
g.supervisor = s.cid and
g.manager = m.cid
Information Systems Group Leila Jalali, Candidacy Exam
57. Clio: Analysis and Conclusion
Termination and Complexity of the Chase:
the Chase with general dependecies may not be terminate
Cyclic dependencies
NRIs: A weakly acyclic set
the number of Chase steps is polynomial
Conculsion
Information Systems Group Leila Jalali, Candidacy Exam
58. Clio mapping
A Clio mapping: for each AS exists AT with E
AS , AT : logical associations (on source and target, resp.)
E a conjunction of equalities:
for each correspondence v in C covered by <AS , AT> ,
E includes the equality h(eS )=h(eT ) which is the result of the coverage,
for one of the coverages
Information Systems Group Leila Jalali, Candidacy Exam
59. Structural Association
Structural association:
− from P (with P primary path)
Starts from the Root of the schema
Companies
Name Organizations
Address Code
Year Year
Grants Fundings
Gid FId
Recipient FinId
Amount
Supervisor Finances
Manager FinId
Contacts Budget
Information Systems Group Cid Leila Jalali, Phone
Candidacy Exam
60. Nested Referential Integrity (NRI) constraints
The basis for discovery of associations: capture relation foreign key and
referential constraints as well as XML keyref constraint:
foreach P1 exists P2 where B
o in organizations, f in o.fundings
P1 is a primary path f in o.fundings
Organizations:
P2 is a primary path or a relative path with respect to a
Code
variable in P1 Year
B is a conjunction of equalities Fundings:
FId
between an expression on a variable of P1
FinId
f4
and an expression on a variable of P2 Finances
foreach o in organizations, f in o.fundings FinId
exists i in finances Budget
where f.finId = i.finId Phone
Information Systems Group Leila Jalali, Candidacy Exam
61. Logical Association
Logical association: semantic relationships between schema
elements
Obtained by starting with a structural association
and "chasing" NRI constraints
Information Systems Group Leila Jalali, Candidacy Exam
62. Logical Association- the Chase
start with a structural association
Companies
Name v1 Organizations
Address Code
f1 Year Year
Grants Fundings
v2
Gid FId
Recipient FinId f2
Amount Finances
f2 Supervisor v3 f4
FinId
f3 Manager Budget
Contacts Phone
Cid
f3
Email v4
Phone
Information Systems Group Leila Jalali, Candidacy Exam
63. Logical Association Relationships
A2 dominates A1 (A1 ≤ A2 ) if
the from and where clauses of A1 are subsets of those of A2 (after
suitable renaming)
A2 : from g in grants, c in companies, s in contacts, m in contacts
where g.recipient = c.name and
g.supervisor = s.cid and
g.manager = m.cid
A1 : from g in grants, c in companies
where g.recipient = c.name
Information Systems Group Leila Jalali, Candidacy Exam
64. Mapping Generation Algorithm
Inputs: S , T , Correspondences AS : from c in companies
AT : fom o in organizations
Logical associations are meaningful combinations of correspondences
Generate all Logical Associations : AS , AT
Which correspondences can be interpreted together?
For each suitable pair <AS , AT>: find the correspondences covered by the pair
with some renaming <h,h‘>, Check for dominance
Generate Clio Mapping: foreach AS exists AT with W
W is the equality h(eS )=h(eT )
Add the Clio Mapping to the Set of Mappings
M: for each c in companies
Output: the set of Schema Mappings exists o in organizations
with c.name = o.code
Information Systems Group Leila Jalali, Candidacy Exam
Hinweis der Redaktion
Providing tools that help in automating and managing the problem of Data Conversion use of Schema Mappings (specification to describe the relationship between data in two different schemas) To transform data between two different representations Schema Mappings to generate: A view to reformulates queries: Data Integration A code to transform data : Data Exchange
Contributions of the paper
Information about companies and grants…. Nested relational representation one can present both relational and xml schemas Schema S is a relational schema: with 3 tables : companies, grants and contacts The grant has grantidentifier, recipient which is the name of the company that receives, and the amount The green lines: referential constraints: foreign key or dependency The target is the XML schema: the funding that an organization receives is nested with the organization record Dashed arrows : Correspondences : the relationships between the schemas, may given by the schema matcher, or we can ask the user to draw these lines V1: the company name in the first schema referred to the organization code in the second schema Why there is no lines between year: 2 diff. concepts. The year. The time the company founded vs the time it had its first initial public offer Their approach does not care about how these correspondence are created, but consider about matchings are incompelete and sometimes incorrect For simplicity these 4 correcpondences are correct
Correspondence can be formally expressed using tuple generating dependency(tgd) Using shared variables: for each company there must be an organization whose code is the same as companies.name All the shared variables are underlined
For each x i in g i (generator) x i variable g i set (either the root or a set nested within it) where B 1 conjunction of equalities over the x i variables with e 1 = e' 1 … equalities between a source expression and a target expression The mapping as a source to target constraint: &quot;the result of Q T (over the target, projected as in the with-clause) must contain the result of Q S (over the source, projected as in the with-clause)&quot;
For each x i in g i (generator) x i variable g i set (either the root or a set nested within it) where B 1 conjunction of equalities over the x i variables with e 1 = e' 1 … equalities between a source expression and a target expression The mapping as a source to target constraint: &quot;the result of Q T (over the target, projected as in the with-clause) must contain the result of Q S (over the source, projected as in the with-clause)&quot;
Contributions of the paper
Logical association: An association obtained by &quot;chasing&quot; constraints (starting with a structural or a user association) Logical associations are meaningful combinations of correspondences A set of correspondences can be interpreted together if there are two logical associations (one in the source and one in the target) that cover them
Logical association: An association obtained by &quot;chasing&quot; constraints (starting with a structural or a user association) Logical associations are meaningful combinations of correspondences A set of correspondences can be interpreted together if there are two logical associations (one in the source and one in the target) that cover them
Logical association: An association obtained by &quot;chasing&quot; constraints (starting with a structural or a user association) Logical associations are meaningful combinations of correspondences A set of correspondences can be interpreted together if there are two logical associations (one in the source and one in the target) that cover them
Logical association: An association obtained by &quot;chasing&quot; constraints (starting with a structural or a user association) Logical associations are meaningful combinations of correspondences A set of correspondences can be interpreted together if there are two logical associations (one in the source and one in the target) that cover them
Logical association: An association obtained by &quot;chasing&quot; constraints (starting with a structural or a user association) Logical associations are meaningful combinations of correspondences A set of correspondences can be interpreted together if there are two logical associations (one in the source and one in the target) that cover them
n ,d,y Companies( n ,d,y) → y',F Organizations( n ,y',F)) n ,d,y, g , a,s,m Companies( n ,d,y), Grants( g , n ,a,s,m) → y',F ,f Organizations( n ,y’ ,F), F( g ,f ) g, r, a , s, m Grants( g,r, a ,s,m) → f,p Finances(f, a ,p) c, e, p Contacts( c,e, p ) → f,b Finances(f,b, p )
Logical association: An association obtained by &quot;chasing&quot; constraints (starting with a structural or a user association) Logical associations are meaningful combinations of correspondences A set of correspondences can be interpreted together if there are two logical associations (one in the source and one in the target) that cover them
Logical association: An association obtained by &quot;chasing&quot; constraints (starting with a structural or a user association) Logical associations are meaningful combinations of correspondences A set of correspondences can be interpreted together if there are two logical associations (one in the source and one in the target) that cover them
A set of correspondences can be interpreted together if there are two logical associations (one in the source and one in the target) that cover them
Contributions of the paper
The schema mapping specify how the data of two schemas relate to each other For data exchange an instance of the source schema must be transformed to an instance of the target schema Note the schema mapping migth not contain all the target values, and may not specify the grouping/ nested semantics for target data
When one schema is XML Clio can generate a data exchange query in Xquery or XSLT The paper describe how to generate Xquery , SQL is similar without having nested elements
Obvious relationships
Obvious relationships
Obvious relationships
finally
Annotation to facilitate generation of Skolem functions These source elements will be the arguments of the potential skolem functions
Every expression that a node acquires is propagated to its children if they do not already have it and if they are not equal to any of the source nodes. Annotation to facilitate generation of Skolem functions These source elements will be the arguments of the potential skolem functions
Annotation to facilitate generation of Skolem functions These source elements will be the arguments of the potential skolem functions
Annotation to facilitate generation of Skolem functions These source elements will be the arguments of the potential skolem functions
It is straightforward, Clio binds one variable to each term, and add the conditions in the where clause Noted it by Q S M1 It is not the complete query because it does not have the result yet It will be used repeatedly in the larger query
It will be used repeatedly in the larger query Starts at the target schema root in query graph , depth first traversal If a node is a complex type element (like y1 fundings) , the element is generated by visiting the children If the node is an atomic type, if it is linked to the source node (like y1.fid) , a simple element is created with the value equal to source, If it is an optional element, nothing generated If it is a nullable element, null value is generated else (like y1. finId) a value will be generated using a new Skolem function, with all arguments that annotate to the node (take care that all the nodes equal to this node receive the same Skolem function name) If it is a variable, For Where Return query produced, copy Q S M1 (the query fragment) rename all the variables, compare annotation with its parent variable, for each common expression correlated sub query generated
If it is a variable, For Where Return query produced, copy Q S M1 (the query fragment) rename all the variables, compare annotation with its parent variable, for each common expression correlated sub query generated
It will be used repeatedly in the larger query Starts at the target schema root in query graph , depth first traversal
The path in an NRI require matchings, to determine the variables in the path However it is exponential to the size of the path , which is often small . Some matching are not possible because of schema restrictions a Chase step can take exponential (in the worst case, it could be multiple ways of matching a variable in a path)
Providing tools that help in automating and managing the problem of Data Conversion Makes no assumption about the schemas, their relationships or how they were created The mapping language is more general than TSIMMIS, Information Manifold Able to map between relational schemas and nested schemas Mapping at different levels of granularities: fine grained mappings such as translating the salary in francs to dollars, boarder concept (documents from one schema to the other schema) Incremental mapping algorithms: sometimes the complete mapping is not the goal (we want a single concept to be mapped) or we have partial knowledge of the schemas so we want to support incomplete mappings as well
Correspondence can be formally expressed using tuple generating dependency(tgd) Using shared variables: for each company there must be an organization whose code is the same as companies.name All the shared variables are underlined
Correspondences alone do not specify how individual values should be connected in the target For e.g. fundings is nested inside organization which means there is a semantic association between them We should look for the association between organization information and funding information in the source to know about the association in the target One such association is f1, each grant is associated with a company. Thus in target we can associate with each organization a set of fundings The algorithm use logical inference to find all associations represented by referential constraints and a schema relational and nesting structure
F is a set identifier, set of fundings that an organizations tuple has This mapping tells us that if there is a pattern in source data what must be true in the target, if we join grant and a company there must be organization with the name of company as its source, and fundings inside it, with fid equal gid.
V3 does not recognize that grant amounts are associated with specific gids. Using f4 the better mapping would be this
To complete our example, consider v4, there are two ways to associate the grant amount(budget) to the phone, Using f2 supervisor phone or f3 manager phone
Consider this simple mapping An employee in the source has atomic elements A ,B, C , Employee record in the targer: A’, B’, C’, and an extra elemnt E’ A and B are mapped to A’, B’. But E’ and C’ left unmapped. Now what should be the values for C’, E’: 1. When neither used in the schema as contraints: creating null value is sufficient 2. If E’ is a key in target : not nullable, not optional like employee id: create values using one-to-one Skolem function, E’ depends only on A and B not on C
E’ is the refrence page 224
Target schema contains two levels.
One reason for XSLT is that there are no efficient, robust implementation of Xquery today I give the size of the largest schemas and some idea of compilation/interpretation times
The path in an NRI require matchings, to determine the variables in the path However it is exponential to the size of the path , which is often small . Some matching are not possible because of schema restrictions a Chase step can take exponential (in the worst case, it could be multiple ways of matching a variable in a path)
Primary path (given a schema root R, that is a first level element in the schema): x 1 in g 1 , x 2 in g 2 , …, x n in g n where g 1 is an expression on R (just R?), g i (for i ≥ 2) g 1 is an expression on x i-1 Examples c in companies o in organizations, f in o.fundings Relative path with respect to a variable x x 1 in g 1 , x 2 in g 2 , …, x n in g n where g 1 is an expression on x, g i (for i ≥ 2) g 1 is an expression on x i-1 Example f in o.fundings Given as association, repeatedly applying a chase rule to the &quot;current&quot; association (initialed as the input one) If there is a NRI constraint foreach X exists Y where B such that the &quot;current&quot; association contains X and does not contain a Y that satisfies B then add Y to the generators and B to the where clause Example. If we start with from g in grants then we have to add various components and obtain from g in grants, c in companies, s in contacts, m in contacts where g.recipient = c.name and g.supervisor = s.cid and g.manager = m.cid
NRI capture relations foreign key and referential constraints as well as xml keyref constraints Referential integrity is essential in this approach as the basis for the discovery of &quot;associations&quot; Given the nested model, they need a rather complex definition Primary path (given a schema root R, that is a first level element in the schema): x 1 in g 1 , x 2 in g 2 , …, x n in g n where g 1 is an expression on R (just R?), g i (for i ≥ 2) g 1 is an expression on x i-1 Examples c in companies o in organizations, f in o.fundings Relative path with respect to a variable x x 1 in g 1 , x 2 in g 2 , …, x n in g n where g 1 is an expression on x, g i (for i ≥ 2) g 1 is an expression on x i-1 Example f in o.fundings
Primary path (given a schema root R, that is a first level element in the schema): x 1 in g 1 , x 2 in g 2 , …, x n in g n where g 1 is an expression on R (just R?), g i (for i ≥ 2) g 1 is an expression on x i-1 Examples c in companies o in organizations, f in o.fundings Relative path with respect to a variable x x 1 in g 1 , x 2 in g 2 , …, x n in g n where g 1 is an expression on x, g i (for i ≥ 2) g 1 is an expression on x i-1 Example f in o.fundings Given as association, repeatedly applying a chase rule to the &quot;current&quot; association (initialed as the input one) If there is a NRI constraint foreach X exists Y where B such that the &quot;current&quot; association contains X and does not contain a Y that satisfies B then add Y to the generators and B to the where clause Example. If we start with from g in grants then we have to add various components and obtain from g in grants, c in companies, s in contacts, m in contacts where g.recipient = c.name and g.supervisor = s.cid and g.manager = m.cid
Logical association: An association obtained by &quot;chasing&quot; constraints (starting with a structural or a user association) Logical associations are meaningful combinations of correspondences A set of correspondences can be interpreted together if there are two logical associations (one in the source and one in the target) that cover them