Data provenance is information that describes how a given data item was produced. The provenance includes source and intermediate data as well as the transformations involved in producing the concrete data item. In the context of a relational databases, the source and intermediate data
items are relations, tuples and attribute values. The transformations are SQL queries and/or functions on the relational data items. Existing approaches capture provenance information by extending the underlying data model. This has the intrinsic disadvantage that the provenance must be stored and accessed using a different model than the actual data. In this paper, we present an alternative approach that uses query rewriting to annotate result tuples with provenance information. The rewritten query and its result use the same model and can, thus, be queried, stored and optimized using standard relational database techniques. In the paper we formalize the query rewriting procedures, prove their correctness, and evaluate a first implementation of the ideas using PostgreSQL. As the experiments indicate, our approach efficiently provides provenance information inducing only a small overhead on normal operations.
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model through Query Rewriting
1. Perm
Processing Provenance and Data on the
Same Data Model through Query
Rewriting
Boris
Glavic
Database Technology Group
Department of Informatics
University of Zurich
glavic@ifi.uzh.ch
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Gustavo Alonso
Systems Group
Department of Computer Science
ETH Zurich
alonso@inf.ethz.ch
2. 2
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Overview
1. Introduction to Perm
2. The Perm Provenance Representation
3. Query Rewriting for Provenance
Computation
4. Perm Implementation
5. Experimental Results
6. Conclusion
3. 3
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
Query Transformation
Data items: Result relation
Data items: Base relations
Relational Provenance
4. 4
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
Query
Which input data item(s)
influenced which output data
item(s)?
Granularity
Tuple
Attribute Value
...
Contribution semantics
Influence (Why)
Copy (Where)
...
5. 5
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
The problem of computing this type of
provenance has been solved before
See e.g. [Cui, Widom ICDE ‘00]
but...
Non-relational representation of provenance
data
Separation of provenance and “normal” data
Non-relational computation of provenance
data
1. Introduction
6. 6
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
Perm
Provenance Extension of the Relational
Model
Provenance Management System
“Pure” Relational representation of
provenance
Query result tuples and provenance tuples
are represented as a single relation
7. 7
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
Benefits: Provenance can be...
... Stored in standard DBMS
... Queried using SQL
... Directly interpreted by a user
Direct association between provenance and
“normal data”
8. 8
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
Provenance Computation
-> Use query rewrite
Given query q
Generate query q+
Computes the provenance of all result tuples from
q
9. 9
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
Benefits:
Rewritten query is expressed in relational
algebra
Can be optimized and executed by a R-DBMS
E.g. can be stored as a view
Used as a subquery
10. 10
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Overview
1. Introduction to Perm
2. The Perm Provenance Representation
3. Query Rewriting for Provenance
Computation
4. Perm Implementation
5. Results
6. Conclusion
11. 11
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Perm Approach
sNam
e
itemID
Migros 1
Migros 2
Migros 2
Coop 3
Coop 3
id price
1 100
2 10
3 25
itemssales
12. 12
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Perm Approach
Compute the sum of sales for each shop
SELECT sName, sum(price)
FROM sales, items
WHERE itemId = id
GROUP BY sName;
13. 13
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Perm Approach
sNam
e
itemID
Migros 1
Migros 2
Migros 2
Coop 3
Coop 3
id price
1 100
2 10
3 25
itemssales
name Sum(price)
Migros 120
Coop 50
result
14. 14
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Perm Approach
sNam
e
itemID
Migros 1
Migros 2
Migros 2
Coop 3
Coop 3
id price
1 100
2 10
3 25
itemssales
name Sum(price)
Migros 120
Coop 50
result
15. 15
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Perm Approach
sNam
e
itemID
Migros 1
Migros 2
Migros 2
Coop 3
Coop 3
id price
1 100
2 10
3 25
itemssales
name Sum(price)
Migros 120
Coop 50
result
16. 16
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Perm Approach
Desired result format:
Original
Attributes
Relation 1
Attributes
Relation n
Attributes
17. 17
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Perm Approach
name sum(price) P(sName) P(itemId) P(id) P(price)
Migros 120 Migros 1 1 100
Migros 120 Migros 2 2 10
Migros 120 Migros 2 2 10
Coop 10 Coop 3 3 25
Coop 10 Coop 3 3 25
Original result sales items
18. 18
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Overview
1. Introduction to Perm
2. The Perm Provenance Representation
3. Query Rewriting for Provenance
Computation
4. Perm Implementation
5. Results
6. Conclusion
19. 19
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
3. Query Rewriting for
Provenance Computation
Rewrite method basics
Use algebra representation of the query
Replace every algebra operator with an
algebra statement that propagates
provenance alongside with the original results
-> need a rewrite rule for each relational
algebra operator
20. 20
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
3. Query Rewriting for
Provenance Computation
Rewrite process
op3
op1
op2
21. 21
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
3. Query Rewriting for
Provenance Computation
Rewrite process
op3
op1
op2 op3
op1b
op2
op1a
op1c
Apply Rewrite rule
22. 22
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
3. Query Rewriting for
Provenance Computation
Rewrite process
op3
op1b
op2
op1a
op1c
Apply Rewrite rules
23. 23
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
3. Query Rewriting for
Provenance Computation
Rewrite rules notations:
Rewritten statement (query)
Provenance attributes
T +
P(T +
)
24. 24
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
3. Query Rewriting for
Provenance Computation
Rewrite rules example:
SELECT agg, G
FROM T
GROUP BY G
SELECT agg, G, P(T)
FROM
(SELECT agg, G FROM T GROUP BY G) AS agg
LEFT OUTER JOIN
(SELECT G AS G’, P(T) FROM T ) AS prov
ON (G = G’)
+
25. 25
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
3. Query Rewriting for
Provenance Computation
Rewrite rules example:
SELECT sum(revenue) AS sum, shop
FROM sales
GROUP BY shop
shop month revenue
Migros Jan 100
Migros Feb 10
Migros Mar 10
Coop Jan 25
Coop Feb 25
sales
sum shop
120 Migros
50 Coop
result
26. 26
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
3. Query Rewriting for
Provenance Computation
SELECT sum, shop, pShop, pMonth, pRevenue
FROM
(SELECT sum(revenue) AS sum, shop
FROM sales GROUP BY shop) AS agg
LEFT OUTER JOIN
(SELECT shop AS shop’, pShop, pMonth, pRevenue
FROM sales ) AS prov
ON (shop = shop’)
sum shop pShop pMonth pRevenu
e
120 Migros Migros Jan 100
120 Migros Migros Feb 10
120 Migros Migros Mar 10
50 Coop Coop Jan 25
50 Coop Coop Feb 25
+
27. 27
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
SELECT sum, shop, pShop, pMonth, pRevenue
FROM
(SELECT sum(revenue) AS sum, shop
FROM sales GROUP BY shop) AS agg
LEFT OUTER JOIN
(SELECT shop AS shop’, pShop, pMonth, pRevenue
FROM sales ) AS prov
ON (shop = shop’)
3. Query Rewriting for
Provenance Computation
sum shop pShop pMonth pRevenu
e
120 Migros Migros Jan 100
120 Migros Migros Feb 10
120 Migros Migros Mar 10
50 Coop Coop Jan 25
50 Coop Coop Feb 25
+
28. 28
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
SELECT sum, shop, pShop, pMonth, pRevenue
FROM
(SELECT sum(revenue) AS sum, shop
FROM sales GROUP BY shop) AS agg
LEFT OUTER JOIN
(SELECT shop AS shop’, pShop, pMonth, pRevenue
FROM sales ) AS prov
ON (shop = shop’)
3. Query Rewriting for
Provenance Computation
sum shop pShop pMonth pRevenu
e
120 Migros Migros Jan 100
120 Migros Migros Feb 10
120 Migros Migros Mar 10
50 Coop Coop Jan 25
50 Coop Coop Feb 25
+
29. 29
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Overview
1. Introduction to Perm
2. The Perm Provenance Representation
3. Query Rewriting for Provenance
Computation
4. Perm Implementation
5. Results
6. Conclusion
30. 30
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
4. Perm Implementation
Extension of PostgreSQL DBMS
Implemented inside of PostgreSQL
-> does not affect client applications
Extended SQL language
Perm module
Implements algebraic rewrite rules as query
rewrites
31. 31
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
4. Perm Implementation
SQL-PLE: SQL extension
SELECT PROVENANCE ...
Nice benefits:
CREATE VIEW x AS SELECT
PROVENANCE ...
SELECT PROVENANCE ... INTO x ...
SELECT ... FROM (SELECT
PROVENANCE ...
33. 33
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Overview
1. Introduction to Perm
2. The Perm Provenance Representation
3. Query Rewriting for Provenance
Computation
4. Perm Implementation
5. Experimental Results
6. Conclusion
34. 34
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
5. Experimental Results
TPC-H benchmark
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
35. 35
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Overview
1. Introduction to Perm
2. The Perm Provenance Representation
3. Query Rewriting for Provenance
Computation
4. Perm Implementation
5. Experimental Results
6. Conclusion
36. 36
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
6. Conclusion
Benefits
Compute provenance for SQL
Full SQL query power for provenance data
Lazy or eager computation
Reuse existing database technology
Supports external provenance
37. 37
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
6. Conclusion
Future work
Physical operators for more efficient
provenance computation
Storage compression
Include transformation provenance
Support different contribution semantics
Support various granularities
38. 38
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Questions
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Zur Anzeige wird der QuickTime™Dekompressor „“benötigt.
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Hinweis der Redaktion
ICDE 2009: 22.5 minutes:
Welcome to my talk, my name, I’m from DBTG, in Assoc. With Gustavo ETH Systems Group
The outline of the talk is
1. Introduc. What is provenance
2. Present a relational representation of provenance info and why this is good
3. Show how to produce such provenance information
We are focusing on the problem of provenance for relational database systems. Where data items are e.g. tuples and transformation are queries, view definitions, user defined functions, etc.
The main problem faced can then be stated as:
Which input...
This problem can be solved for
different levels of granularity of data items: Tuples, Attribute Values and so on. (We are looking at tuple level granularity)
-different definitions of what influences means (we call this contribution semantics)
for example only tuples that have been copied literally from the source to the result.
(We are looking at influence contribution semantics which also have been called Why-Provenance
-add related work slide (my approach)
Problem solved but,
-used non-relational representation of provenance which leads to a number of problems
-computation returns only provenance data -> assocation between provenance and normal data is lost
-not possible to store provenance in an unmodified relational data base
-different data model for provenance and normal data -> so cannot reuse query language and processing of relational systems
-computation requires completely new system or at least some middleware that computes the provenance and is limited to subset of SQL
? Leave out other disadvantages that are hard to explain without more preliminaries (e.g. more user friendly to include complete tuples)
Problem solved but,
-used non-relational representation of provenance which leads to a number of problems
-computation returns only provenance data -> assocation between provenance and normal data is lost
-not possible to store provenance in an unmodified relational data base
-different data model for provenance and normal data -> so cannot reuse query language and processing of relational systems
-computation requires completely new system or at least some middleware that computes the provenance and is limited to subset of SQL
? Leave out other disadvantages that are hard to explain without more preliminaries (e.g. more user friendly to include complete tuples)
Explain benefits.
Directly interpreted by user because complete provenance tuples in result
So nice we have this benefitial format, but how to create it?
Problem solved but,
-used non-relational representation of provenance which leads to a number of problems
-computation returns only provenance data -> assocation between provenance and normal data is lost
-not possible to store provenance in an unmodified relational data base
-different data model for provenance and normal data -> so cannot reuse query language and processing of relational systems
-computation requires completely new system or at least some middleware that computes the provenance and is limited to subset of SQL
? Leave out other disadvantages that are hard to explain without more preliminaries (e.g. more user friendly to include complete tuples)
So lets see what we got from using this type of computation:
-same language for query and provenance computation:
-we can feed this into optimizer of a normal dbms
-and thus store it as a view or use it as a subquery (this fulfills the need for querying provenance and data using the same query language!)
Know how to represent provenance infomation
Lets have an example: we have a sales database with shops and items that were sold and items with an id and price
Consider the following query that computes the sum of sales for each shop
Here is the result of this query for the given table instances
So if we want to know from which tuples the result tuple Migors,120 is derived from intutivelly that are...
Sales with sName “Migros” and all the item tuples for items sold there (we use a formal definition introduced by Cui and Widom to check if a tuple bleongs to the provenace)
We want to present the normal results of a query together with the provenance as a single relation, which contains complete result tuples with attached provenance tuples
So back to our example you can see for each tuple of the result we can directly see which tuples influenced which tuple
Our solution is to use query rewriting for computation
This rewrites is performed on the algebraic representation of query q, by replacing every algebra operator of the original query wiht
Algebra statement that propagates the provenance alongside with the original result.
So we need a rewrite rule for each algebra operator (simplifies things, e.g. incremental computation because of recursive definition)
This rewrites is performed on the algebraic representation of query q, by replacing every algebra operator of the original query wiht
Algebra statement that propagates the provenance alongside with the original result.
So we need a rewrite rule for each algebra operator (simplifies things, e.g. incremental computation because of recursive definition)
This rewrites is performed on the algebraic representation of query q, by replacing every algebra operator of the original query wiht
Algebra statement that propagates the provenance alongside with the original result.
So we need a rewrite rule for each algebra operator (simplifies things, e.g. incremental computation because of recursive definition)
This rewrites is performed on the algebraic representation of query q, by replacing every algebra operator of the original query wiht
Algebra statement that propagates the provenance alongside with the original result.
So we need a rewrite rule for each algebra operator (simplifies things, e.g. incremental computation because of recursive definition)
Before I show you an example of how this rewrite rules look like, some notional preliminaries:
The + operator stands for the rewrite operation
P(T+) of a rewritten statement is the list of provenance attributes in the result of the rewritten statement
Bold characters are use to denote the schema of a relation or algebra statement
A arrow b means rename attr a to b (we use this for lists of attributes too)
Lets have a short example: (the + operator transforms a operator or algebra statement into a provenance computation) / P is a list of provenance attributes of an algebra expression. Here we have the rewrite rule for aggregation operator
-SQL! Fast too to intro
Lets have a short example: (the + operator transforms a operator or algebra statement into a provenance computation) / P is a list of provenance attributes of an algebra expression. Here we have the rewrite rule for aggregation operator
-SQL! Fast too to intro
Lets have a short example: (the + operator transforms a operator or algebra statement into a provenance computation) / P is a list of provenance attributes of an algebra expression. Here we have the rewrite rule for aggregation operator
-SQL! Fast too to intro
Lets have a short example: (the + operator transforms a operator or algebra statement into a provenance computation) / P is a list of provenance attributes of an algebra expression. Here we have the rewrite rule for aggregation operator
-SQL! Fast too to intro
Lets have a short example: (the + operator transforms a operator or algebra statement into a provenance computation) / P is a list of provenance attributes of an algebra expression. Here we have the rewrite rule for aggregation operator
-SQL! Fast too to intro
Enough about the theory, now to the implementation
-> perm -> permimplementation
Our implementation of this principles is called Perm (which, besides beeing an geological area stands for Provenance extension of the relational model. It it implemented as an extension of Postgres. The provenance computation is triggered by the use of a few additional SQl-key-words we added. The algebraic rewrite rules are implemented as query rewriting (some pattern matching need because we have a QGM-like structures here)
Lets have a short look at the SQL-extension we are using. The keyword PROVENANCE after SELECT is used to indicate that this query block (and of cause all contained query blocks) should be rewritten. This suffices to for example store a provenance computation as a view, store the provenance into a table (using SQL into) or use provenance computation in a subquery)
In the postgres system the major change we did was to add a new modul directly above the planer module. The input to this module is a rewritten query graph (here rewritting is basically view unfolding). The module checks if the incoming query has parts that should be rewritten, if necessary applies the rewrites and send the rewritten query graph to the planer which chooses an execution plan and calls the executor the executes the query and returns results to the client
1. Result slide (TPCH) explain everything works, early version of Perm
1. Result slide (TPCH) explain everything works, early version of Perm
1. Result slide (TPCH) explain everything works, early version of Perm
I hope I convinced you the query rewrite techniques implemented in Perm allow .... (benefits)
But there are also disadvantages: the representation we use might store a lot of redundent information
The performace is limited for some types of queries, because for some operators it is not possible to propagate the provenance withou introducing extra joins)
-more merchandise (wholde of SQL), not disad. But open issues
So what we are doing now or like to do:
-implement physical operators for provenance computation (performance)
-test how some of the proposed storage compression mechanisms for provenance data can be integrated into our system and what the ROI
-store and compute information about the queries or other kind of transformations and integrate query language support for this
-support different contribution semantics and granularities
-for flexiblity
-and as further prove that our approach is feasible