Provenance for Nested Subqueries
Boris
Glavic
Database Technology Group
Department of Informatics
University of Zurich
glavic@ifi.uzh.ch
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Gustavo Alonso
Systems Group
Department of Computer Science
ETH Zurich
alonso@inf.ethz.ch
2
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Overview
1. Introduction
2. The Provenance of Subqueries
3. Experimental Results
4. Conclusion
3
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
Query
 Which input data item(s)
influenced which output data
item(s)?
 Granularity
 Tuple
 Attribute Value
 ...
 Contribution semantics
 Influence (Lineage / Why)
 Copy (Where)
 ...
4
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
 Most application domains that benefit from
provenance use complex queries
 Subqueries
 Correlated
 Nested
 Not supported by existing systems
 Semantics not clear
 Complex computation
5
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
 Steps to solve this problem
1. Establish sound semantics for provenance
of subqueries
2. Algorithms for subquery provenance
computation
3. Integrate algorithms into a Provenance
Management system (Perm)
6
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
 Steps to solve this problem
1. Establish sound semantics for provenance
of subqueries
2. Algorithms for subquery provenance
computation
3. Integrate algorithms into a Provenance
Management system (Perm)
7
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
 Definition of contribution semantics
 Why/Influence-provenance
 Introduced in [Cui, Widom ICDE ‘00]
 Provenance represented as list of subsets of
the input relations
 Defined for a single algebra operator and a
single result tuple
8
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
 Definition 1: For a single algebra
operator op with input relations T1, ... , Tn a
list (T1*, ... ,Tn*) of maximal subsets of
the input relation is the provenance of a
tuple t from the result of op iff:
u op(T1*, ..., Tn*) = t
u For all i and t* with t* in Ti*:
op(T1*, ... Ti-1*, t* , Ti+1*, ... ,Tn*) != ∅
9
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
 Perm
 Provenance Extension of the Relational
Model
 Provenance Management System (PMS)
 “Pure” Relational representation of
provenance
 Provenance computation trough algebraic
query rewrite
 Implemented as extension of PostgreSQL
10
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
 Provenance representation
Original
Attributes
Relation 1
Attributes
Relation n
Attributes
Query
1
Original
Result
2 n
11
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
 Provenance representation
Original
Attributes
Relation R
Attributes
Relation S
Attributes
Query
R
Original
Result
S
r1
s1r2
t 1
t 1 r1
t 1 r2
s1
s1
12
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
 Provenance Computation though query
rewrite:
 Given query q generate query q+ that
computes the provenance of q
 Representation as defined before
 Rewrites operate on the algebraic
representation of a query
 Rewrite rules for each operator op that transform
op into a algebra statement that propagates the
provenance
13
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
 Rewrite rules example:
SELECT agg, G
FROM T
GROUP BY G
SELECT agg, G, prov(T)
FROM
(SELECT agg, G FROM T GROUP BY G) AS agg,
LEFT OUTER JOIN
(SELECT G AS G’, prov(T) FROM T+) AS prov
ON G = G’
14
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
 Rewrite rules example:
SELECT sum(revenue) AS sum, shop
FROM sales
GROUP BY shop
shop month revenue
Migros Jan 100
Migros Feb 10
Migros Mar 10
Coop Jan 25
Coop Feb 25
sales
sum shop
120 Migros
50 Coop
result
15
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
SELECT sum, shop, pShop, pMonth, pRevenue
FROM
(SELECT sum(revenue) AS sum, shop
FROM sales GROUP BY shop) AS agg
LEFT OUTER JOIN
(SELECT shop AS shop’, pShop, pMonth, pRevenue
FROM sales ) AS prov
ON shop = shop’
sum shop pShop pMonth pRevenu
e
120 Migros Migros Jan 100
120 Migros Migros Feb 10
120 Migros Migros Mar 10
50 Coop Coop Jan 25
50 Coop Coop Feb 25
+
16
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Overview
1. Introduction
2. The Provenance of Subqueries
3. Experimental Results
4. Conclusion
17
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 Sublinks
 Subqueries in e.g. SELECT-clause
 Correlated
 References outside attributes
 Nested
 Sublink that contains sublinks
σa IN σ (b=3) (S) (R)
σa IN σ (b=a) (S) (R)
σa IN σ (b = ANY (T )) (S) (R)
18
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 What is the provenance of a sublink
according to Definition 1?
 Sublinks can be used in different contexts
 Selection
 Projection
 ...
 Sublink either
 Produces exactly one value
 Or produces a boolean value
19
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 Single uncorrelated ANY-sublinks in
selection conditions
 For other
 Types of sublinks
 Correlated sublinks
 Nested sublinks
20
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 For other
 Types of sublinks
 Correlated sublinks
 Nested sublinks
READ THE PAPER!
21
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 Single uncorrelated ANY-sublinks in
selection conditions
 The result of the sublink query is fixed
 For a given input tuple t the sublink condition
is either true or false
σa =ANY σ(b=3) (S) (R)
22
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 Some terminology
 The query of a sublink
 The conditional expression of a sublink
Tsub
q =σa =ANY Πb (S) (R)
Πb (S)
a = ANY Πb (S)Csub
Tsub
Csub
23
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 Sublink condition can play different roles in
a condition C of a selection (for one input
tuple t):
 Reqtrue: the selection condition is true, iff
is true
 Reqfalse: the selection condition is true, iff
is false
 Ind: the selection condition is true
indepedent of the result of
Csub
Csub
Csub
24
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 Some more terminology
 All tuples from the sublink query that fulfill the
“unquantified” sublink condition
 All tuples from the sublink query that do not
fulfill the “unquantified” sublink condition
Tsub
true
(t)
Tsub
false
(t)
Csub = (a = ANY σb=3(S)) Csub° = (a = b)
25
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 Back to ANY-sublinks in selections
 Proposition:
Tsub
*
(t) =
Tsub
true
(t) reqtrue
Tsub reqfalse,ind
⎧
⎨
⎩
26
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
a
1
2
3
b c
1 100
2 10
4 24
SR
q =σa =ANY Πb (S) (R)
a
1
2
Result
Compute provenance for t = (1)
 Example:
27
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Tsub = Πb (S)
Tsub
true
(t) = {(1)}
is reqtrueCsub
Tsub
*
=Tsub
true
Csub° = (a = b)
q =σa =ANY Πb (S) (R)
28
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Tsub
true
(t) = {(1)}
q =σa =ANY Πb (S) (R)
b
1
2
4
Tsub
a
1
2
3
R
Csub° = (a = b)
Compute provenance for t = (1)
29
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
a
1
2
3
b c
1 100
2 10
4 24
SR
q =σa =ANY Πb (S) (R)
a
1
b
1
R
*
Tsub
*
b
1
2
4
Tsub
a
1
2
Result
Compute provenance for t = (1)
30
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 Definition 1 is ambiguous for queries with
more than one sublink!
b
1
2
100
c
1
5
SR
q =σC1∨C2
(U)
C1 = (a =ANY R)
C2 = (a > ALL S)
t = (5)
a
5
Result
a
5
U
31
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 Definition 1 is ambiguous for queries with
more than one sublink!
b
1
2
100
c
1
5
SR
q =σC1∨C2
(U)
C1 = (a =ANY R)
C2 = (a > ALL S)
t = (5)
a
5
Result
a
5
U
true
false
32
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
b
5
c
1
5
S*R*
q =σC1∨C2
(U)
C1 = (a =ANY R)
C2 = (a > ALL S)
t = (5)
a
5
U*
b
1
100
R*
b
1
S*
a
5
U*
Solution 1 Solution 2
33
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
b
5
c
1
5
S*R*
q =σC1∨C2
(U)
C1 = (a =ANY R)
C2 = (a > ALL S)
t = (5)
a
5
U*
b
1
100
R*
b
1
S*
a
5
U*
Solution 1 Solution 2
true
false
34
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
b
5
c
1
5
S*R*
q =σC1∨C2
(U)
C1 = (a =ANY R)
C2 = (a > ALL S)
t = (5)
a
5
U*
b
1
100
R*
b
1
S*
a
5
U*
Solution 1 Solution 2
false
true
35
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 Reasons for this ambiguity:
 The definition requires the provenance to
produce the same result
 But not to produce the same results for the
sublinks
-> Definition 1 produces false positives
36
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 Solution: Extend definition 1
 Add a third condition:
 For each sublink:
 If computed for
 one result tuple t
 one tuple from the provenance of the sublink
 Produces same sublink result as in the original
query
37
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
b
5
c
5
S*R*
q =σC1∨C2
(U)
C1 = (a =ANY R)
C2 = (a > ALL S)
t = (5)
a
5
U*
b
1
100
R*
b
1
S*
a
5
U*
Solution 1 Solution 2
38
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 How to compute the provenance
according to the extended definition?
 Use query rewrite
 Generic strategy (Gen)
 Specialized strategies
 Use un-nesting
 Check: does not change the provenance
39
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 Gen-strategy
 For queries we cannot un-nest
1. Join original query with all possible
provenance tuples (base relations)
2. Rewrite the sublink query
3. Introduce additional correlation to simulate
a join between 1) and 2)
40
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Overview
1. Introduction
2. The Provenance of Subqueries
3. Experimental Results
4. Conclusion
41
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
3. Experimental Results
 TPC-H benchmark (10 MB size)
42
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
3. Experimental Results
 TPC-H benchmark (1 GB size)
43
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Overview
1. Introduction
2. The Provenance of Subqueries
3. Experimental Results
4. Conclusion
44
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
4. Conclusion
 Definition 1 fails in the presence of
sublinks
 Can be extended to deal with sublinks
 Provenance computation for sublinks
 By using query rewrites
 Implemented in the Perm
 Future Work
 Physical provenance-aware operators
45
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Questions
? ? ?

EDBT 2009 - Provenance for Nested Subqueries

  • 1.
    Provenance for NestedSubqueries Boris Glavic Database Technology Group Department of Informatics University of Zurich glavic@ifi.uzh.ch Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. Gustavo Alonso Systems Group Department of Computer Science ETH Zurich alonso@inf.ethz.ch
  • 2.
    2 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. Overview 1. Introduction 2. The Provenance of Subqueries 3. Experimental Results 4. Conclusion
  • 3.
    3 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 1. Introduction Query  Which input data item(s) influenced which output data item(s)?  Granularity  Tuple  Attribute Value  ...  Contribution semantics  Influence (Lineage / Why)  Copy (Where)  ...
  • 4.
    4 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Most application domains that benefit from provenance use complex queries  Subqueries  Correlated  Nested  Not supported by existing systems  Semantics not clear  Complex computation
  • 5.
    5 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Steps to solve this problem 1. Establish sound semantics for provenance of subqueries 2. Algorithms for subquery provenance computation 3. Integrate algorithms into a Provenance Management system (Perm)
  • 6.
    6 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Steps to solve this problem 1. Establish sound semantics for provenance of subqueries 2. Algorithms for subquery provenance computation 3. Integrate algorithms into a Provenance Management system (Perm)
  • 7.
    7 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Definition of contribution semantics  Why/Influence-provenance  Introduced in [Cui, Widom ICDE ‘00]  Provenance represented as list of subsets of the input relations  Defined for a single algebra operator and a single result tuple
  • 8.
    8 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Definition 1: For a single algebra operator op with input relations T1, ... , Tn a list (T1*, ... ,Tn*) of maximal subsets of the input relation is the provenance of a tuple t from the result of op iff: u op(T1*, ..., Tn*) = t u For all i and t* with t* in Ti*: op(T1*, ... Ti-1*, t* , Ti+1*, ... ,Tn*) != ∅
  • 9.
    9 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Perm  Provenance Extension of the Relational Model  Provenance Management System (PMS)  “Pure” Relational representation of provenance  Provenance computation trough algebraic query rewrite  Implemented as extension of PostgreSQL
  • 10.
    10 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Provenance representation Original Attributes Relation 1 Attributes Relation n Attributes Query 1 Original Result 2 n
  • 11.
    11 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Provenance representation Original Attributes Relation R Attributes Relation S Attributes Query R Original Result S r1 s1r2 t 1 t 1 r1 t 1 r2 s1 s1
  • 12.
    12 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Provenance Computation though query rewrite:  Given query q generate query q+ that computes the provenance of q  Representation as defined before  Rewrites operate on the algebraic representation of a query  Rewrite rules for each operator op that transform op into a algebra statement that propagates the provenance
  • 13.
    13 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Rewrite rules example: SELECT agg, G FROM T GROUP BY G SELECT agg, G, prov(T) FROM (SELECT agg, G FROM T GROUP BY G) AS agg, LEFT OUTER JOIN (SELECT G AS G’, prov(T) FROM T+) AS prov ON G = G’
  • 14.
    14 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Rewrite rules example: SELECT sum(revenue) AS sum, shop FROM sales GROUP BY shop shop month revenue Migros Jan 100 Migros Feb 10 Migros Mar 10 Coop Jan 25 Coop Feb 25 sales sum shop 120 Migros 50 Coop result
  • 15.
    15 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 1. Introduction SELECT sum, shop, pShop, pMonth, pRevenue FROM (SELECT sum(revenue) AS sum, shop FROM sales GROUP BY shop) AS agg LEFT OUTER JOIN (SELECT shop AS shop’, pShop, pMonth, pRevenue FROM sales ) AS prov ON shop = shop’ sum shop pShop pMonth pRevenu e 120 Migros Migros Jan 100 120 Migros Migros Feb 10 120 Migros Migros Mar 10 50 Coop Coop Jan 25 50 Coop Coop Feb 25 +
  • 16.
    16 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. Overview 1. Introduction 2. The Provenance of Subqueries 3. Experimental Results 4. Conclusion
  • 17.
    17 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Sublinks  Subqueries in e.g. SELECT-clause  Correlated  References outside attributes  Nested  Sublink that contains sublinks σa IN σ (b=3) (S) (R) σa IN σ (b=a) (S) (R) σa IN σ (b = ANY (T )) (S) (R)
  • 18.
    18 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  What is the provenance of a sublink according to Definition 1?  Sublinks can be used in different contexts  Selection  Projection  ...  Sublink either  Produces exactly one value  Or produces a boolean value
  • 19.
    19 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Single uncorrelated ANY-sublinks in selection conditions  For other  Types of sublinks  Correlated sublinks  Nested sublinks
  • 20.
    20 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  For other  Types of sublinks  Correlated sublinks  Nested sublinks READ THE PAPER!
  • 21.
    21 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Single uncorrelated ANY-sublinks in selection conditions  The result of the sublink query is fixed  For a given input tuple t the sublink condition is either true or false σa =ANY σ(b=3) (S) (R)
  • 22.
    22 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Some terminology  The query of a sublink  The conditional expression of a sublink Tsub q =σa =ANY Πb (S) (R) Πb (S) a = ANY Πb (S)Csub Tsub Csub
  • 23.
    23 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Sublink condition can play different roles in a condition C of a selection (for one input tuple t):  Reqtrue: the selection condition is true, iff is true  Reqfalse: the selection condition is true, iff is false  Ind: the selection condition is true indepedent of the result of Csub Csub Csub
  • 24.
    24 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Some more terminology  All tuples from the sublink query that fulfill the “unquantified” sublink condition  All tuples from the sublink query that do not fulfill the “unquantified” sublink condition Tsub true (t) Tsub false (t) Csub = (a = ANY σb=3(S)) Csub° = (a = b)
  • 25.
    25 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Back to ANY-sublinks in selections  Proposition: Tsub * (t) = Tsub true (t) reqtrue Tsub reqfalse,ind ⎧ ⎨ ⎩
  • 26.
    26 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries a 1 2 3 b c 1 100 2 10 4 24 SR q =σa =ANY Πb (S) (R) a 1 2 Result Compute provenance for t = (1)  Example:
  • 27.
    27 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries Tsub = Πb (S) Tsub true (t) = {(1)} is reqtrueCsub Tsub * =Tsub true Csub° = (a = b) q =σa =ANY Πb (S) (R)
  • 28.
    28 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries Tsub true (t) = {(1)} q =σa =ANY Πb (S) (R) b 1 2 4 Tsub a 1 2 3 R Csub° = (a = b) Compute provenance for t = (1)
  • 29.
    29 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries a 1 2 3 b c 1 100 2 10 4 24 SR q =σa =ANY Πb (S) (R) a 1 b 1 R * Tsub * b 1 2 4 Tsub a 1 2 Result Compute provenance for t = (1)
  • 30.
    30 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Definition 1 is ambiguous for queries with more than one sublink! b 1 2 100 c 1 5 SR q =σC1∨C2 (U) C1 = (a =ANY R) C2 = (a > ALL S) t = (5) a 5 Result a 5 U
  • 31.
    31 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Definition 1 is ambiguous for queries with more than one sublink! b 1 2 100 c 1 5 SR q =σC1∨C2 (U) C1 = (a =ANY R) C2 = (a > ALL S) t = (5) a 5 Result a 5 U true false
  • 32.
    32 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries b 5 c 1 5 S*R* q =σC1∨C2 (U) C1 = (a =ANY R) C2 = (a > ALL S) t = (5) a 5 U* b 1 100 R* b 1 S* a 5 U* Solution 1 Solution 2
  • 33.
    33 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries b 5 c 1 5 S*R* q =σC1∨C2 (U) C1 = (a =ANY R) C2 = (a > ALL S) t = (5) a 5 U* b 1 100 R* b 1 S* a 5 U* Solution 1 Solution 2 true false
  • 34.
    34 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries b 5 c 1 5 S*R* q =σC1∨C2 (U) C1 = (a =ANY R) C2 = (a > ALL S) t = (5) a 5 U* b 1 100 R* b 1 S* a 5 U* Solution 1 Solution 2 false true
  • 35.
    35 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Reasons for this ambiguity:  The definition requires the provenance to produce the same result  But not to produce the same results for the sublinks -> Definition 1 produces false positives
  • 36.
    36 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Solution: Extend definition 1  Add a third condition:  For each sublink:  If computed for  one result tuple t  one tuple from the provenance of the sublink  Produces same sublink result as in the original query
  • 37.
    37 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries b 5 c 5 S*R* q =σC1∨C2 (U) C1 = (a =ANY R) C2 = (a > ALL S) t = (5) a 5 U* b 1 100 R* b 1 S* a 5 U* Solution 1 Solution 2
  • 38.
    38 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  How to compute the provenance according to the extended definition?  Use query rewrite  Generic strategy (Gen)  Specialized strategies  Use un-nesting  Check: does not change the provenance
  • 39.
    39 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Gen-strategy  For queries we cannot un-nest 1. Join original query with all possible provenance tuples (base relations) 2. Rewrite the sublink query 3. Introduce additional correlation to simulate a join between 1) and 2)
  • 40.
    40 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. Overview 1. Introduction 2. The Provenance of Subqueries 3. Experimental Results 4. Conclusion
  • 41.
    41 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 3. Experimental Results  TPC-H benchmark (10 MB size)
  • 42.
    42 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 3. Experimental Results  TPC-H benchmark (1 GB size)
  • 43.
    43 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. Overview 1. Introduction 2. The Provenance of Subqueries 3. Experimental Results 4. Conclusion
  • 44.
    44 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. 4. Conclusion  Definition 1 fails in the presence of sublinks  Can be extended to deal with sublinks  Provenance computation for sublinks  By using query rewrites  Implemented in the Perm  Future Work  Physical provenance-aware operators
  • 45.
    45 Zur Anzeige wirdder QuickTime™ Dekompressor „“ benötigt. Questions ? ? ?

Hinweis der Redaktion

  • #2 If we have to shorten: -remove query rewrite example in the introduction (-3 pages) Welcome to my presentation... My from ..., together with Gustavo from ... And its about ....
  • #3 The talk we be organised as follows: first a short introduction to out PMS Perm, the I’ll show what the Provenance of a subquery looks like and then how it can be computed, as usual a conclusion in the end
  • #4 In the context of relational database : The main problem faced can then be stated as: Which input... This problem can be solved for different levels of granularity of data items: Tuples, Attribute Values and so on. (We are looking at tuple level granularity) -different definitions of what influences means (we call this contribution semantics) for example only tuples that have been copied literally from the source to the result. (We are looking at influence contribution semantics which also have been called Why-Provenance
  • #5 Most app-doms where provenance would be important use complex query that use features like aggregation, user def. functions and subqueries in selections, aggregations, that are possibly correlated or nested Oooh, these are not supported by existing systems -add perm introd beofre this one, talk about ICDE paper -reasons why it is not supported
  • #8 Lets have a look at the contribution semantics we use for Perm. It was introduced by Cui et. Al. In 2000, prov. Was represented as a tuple of subsets ... And the definition defines the provenance of single operators (but is assumed to be transitive)
  • #9 1 means the provenance if used as input of op produces exactly t and nothing else 2 means each tuple from each part of the provenance contributes something to the result (conter example, for selection all tuples that do not fullfill the selection condition obviously do not contribute something to the result but would be in the provenance if we leave out condition 2) Change slides: the disadvantage is the non relational representation of provenance, so we decided to use same semantics with another representation:
  • #10 Perm means: provenance extension of the relational model Uses “pure” ... Computes prov. By .... Uses influence contribution semantics with tuple level granularity -move before intro introduction
  • #11 Single result table that contains all the original result attributes and the attributes from the input relations of the query. Each original result tuple is extended by attaching contributing tuples from the base relations (and thus has to be duplicated if there is more than one contributing tuple from one of the input relations
  • #12 Single result table that contains all the original result attributes and the attributes from the input relations of the query. Each original result tuple is extended by attaching contributing tuples from the base relations (and thus has to be duplicated if there is more than one contributing tuple from one of the input relations
  • #13 Perm uses query rewrite techniques to compute the provenance of a query, ... -intro introd.
  • #14 Lets have a short example: (the + operator transforms a operator or algebra statement into a provenance computation) / P is a list of provenance attributes of an algebra expression. Here we have the rewrite rule for aggregation operator -SQL! Fast too to intro
  • #15 Lets have a short example: (the + operator transforms a operator or algebra statement into a provenance computation) / P is a list of provenance attributes of an algebra expression. Here we have the rewrite rule for aggregation operator -SQL! Fast too to intro
  • #16 Lets have a short example: (the + operator transforms a operator or algebra statement into a provenance computation) / P is a list of provenance attributes of an algebra expression. Here we have the rewrite rule for aggregation operator -SQL! Fast too to intro
  • #17 So now look what happens if we introduce subqueries
  • #18 We call them sublinks to distinguish between “normal” subqueries used in FROM-claus. We use algebraic representation (will see later why) Sublinks are called correlated if... Sublinks are called nested if...
  • #19 If we want to know, What is.... Then we are facing some problems sublinks can be used in different contexts: And we can make some observations: (for a given input tuple) a sublink prduces a contant value (which i either a boolean (e.g. IN, ANY, ...) or a data type contant (subqueries witout special sublink operator)
  • #21 Definition is cab ambigous
  • #26 We have been proven but I will only explain the main idee behind it
  • #27 Lets exercise an example: we have the follwing query and relations and are searching for the provenance of result tuple t = (1)
  • #30 The following tuples from Tsub and R fulfill condition one of the definition. (Note that 2 is not included because it would cause an additional result tuple 2)
  • #31 Definition is cab ambigous
  • #32 Definition is cab ambigous
  • #33 Definition is cab ambigous
  • #34 Definition is cab ambigous
  • #35 Definition is cab ambigous
  • #36 Definition is cab ambigous
  • #37 Definition is cab ambigous
  • #38 Definition is cab ambigous
  • #39 Now we know how the provenance of sublinks looks like, but how can we compute the provenance using query rewrite rules
  • #40 Now we know how the provenance of sublinks looks like, but how can we compute the provenance using query rewrite rules
  • #41 Exchange with results like in ICDE (whole of TPCH, focus on sublinks)