This document proposes using web data provenance for automated quality assessment. It defines provenance as information about the origin and processing of data. The goal is to develop methods to automatically assess quality criteria like timeliness. It outlines a general provenance-based assessment approach involving generating a provenance graph, annotating it with impact values representing how provenance elements influence quality, and calculating a quality score with an assessment function. As an example, it shows how the approach could be applied to assess the timeliness of sensor measurements based on their provenance.
1. Using
Web Data Provenance
for
Quality Assessment
Olaf Hartig*
Jun Zhao˚
*Humboldt-Universität zu Berlin ˚University of Oxford
2. Information Quality (IQ)
● Common definition: fitness for use of information
● Multidimensional concept
Category* Criteria / Dimensions
Intrinsic Accuracy, Believability, Objectivity, ...
Contextual Completeness, Relevance, Timeliness, ...
Representational Conciseness, Understandability, ...
Accessibility Availability, Security, ...
*Classification by Wang and Strong, 1996
● IQ criteria not independent of each other
● Relevancy of criteria determined by task and preferences
Olaf Hartig - Using Web Data Provenance for Quality Assessment 2
3. IQ Assessment
● Assigning numerical values (IQ scores) to IQ criteria
● It is difficult!
● Precision vs. Practicality
Manual methods Semi-automatic methods
● Questionnaires ● Rating-based
● Reputation-based
Olaf Hartig - Using Web Data Provenance for Quality Assessment 3
4. Automated IQ Assessment
● Literature only outlines ideas for automatic methods
● Content analysis
● Comparison (e.g. outlier detection)
● Application of information retrieval methods
● Analysis of results from data cleansing
● Sampling techniques
● Context analysis
● Analysis of metadata
● Utilization of domain knowledge
Olaf Hartig - Using Web Data Provenance for Quality Assessment 4
5. Our Goal:
Methods to automatically assess
IQ criteria of Web data
Primary means:
Provenance of assessed data
Olaf Hartig - Using Web Data Provenance for Quality Assessment 5
6. Outline
1. Web Data Provenance
2. General Assessment Approach
3. Development of Assessment Methods
Olaf Hartig - Using Web Data Provenance for Quality Assessment 6
7. Existing Provenance Research
● Main research areas: (scientific) workflows, DBMSs
● General focus:
data creation
Olaf Hartig - Using Web Data Provenance for Quality Assessment 7
8. Provenance of Web Data
Olaf Hartig - Using Web Data Provenance for Quality Assessment 8
9. Provenance of Web Data
Web data provenance
comprises
two dimensions:
Data Creation • Data Access
Olaf Hartig - Using Web Data Provenance for Quality Assessment 9
10. Model of Web Data Provenance
● Provenance graph describes provenance of a data item
● Nodes: provenance elements – pieces of provenance info
● Edges: relate provenance elements to each other
● Subgraphs for related data items possible
Olaf Hartig - Using Web Data Provenance for Quality Assessment 10
11. Model of Web Data Provenance
● Provenance model defines: Actors
● Types of provenance elements
Executions
● Relationships
Artifacts
Olaf Hartig - Using Web Data Provenance for Quality Assessment 11
12. Data Access Dimension
Data Item
Data Accessor
(Non-Human)
contains
performs retrieved by Document
Execution Time
Data Access
accessed
Data Providing Service
(Non-Human)
controls
uses
Service Provider
Data Publisher
(Human)
Relation to
the provided Information
Resource
Olaf Hartig - Using Web Data Provenance for Quality Assessment 12
13. Data Access Dimension cont.
(Verified)
Artifact
Integrity Verification
Verification Result
{incomplete}
Signer
Signature Verification Relation to
the signed Data
Signature Method
Olaf Hartig - Using Web Data Provenance for Quality Assessment 13
14. Data Creation Dimension
Provenance
Information
Source Data
Execution Time Provenance
Information
Creation Guidelines
Data Creator
Data Creation
(Human or Non-human)
{complete,disjoint}
Data Creating Device
(e.g. Sensor) Data Item
Data Creating Service
(e.g. Software Agent) part of
responsible for responsible for Provenance
Data Creating Entity Information
(e.g. Person, Group, Orga.) (Encompassing)
Data Item
Relation to
the created Data
Olaf Hartig - Using Web Data Provenance for Quality Assessment 14
15. Outline
1. Web Data Provenance
2. General Assessment Approach
3. Development of Assessment Methods
Olaf Hartig - Using Web Data Provenance for Quality Assessment 15
16. A General Approach
● Blueprint for actual assessment methods that
● Address specific scenario
● Focus on specific IQ criterion
● Provenance elements have an influence on IQ
● Impact values represent these influences
● Assessment is affected by knowing about the influences
● Calculation of the IQ score with an assessment function
that combines all impact values
Olaf Hartig - Using Web Data Provenance for Quality Assessment 16
17. General Assessment Procedure
Step 1 – Generate a provenance graph for the data item
Step 2 – Annotate the provenance graph with impact values
Step 3 – Execute the assessment function
Olaf Hartig - Using Web Data Provenance for Quality Assessment 17
18. Outline
1. Web Data Provenance
2. General Assessment Approach
3. Development of Assessment Methods
Olaf Hartig - Using Web Data Provenance for Quality Assessment 18
19. Designing Assessment Methods
● Developing the general approach into an actual method
● Fundamental design question:
For which IQ criterion do we want to apply the method?
Olaf Hartig - Using Web Data Provenance for Quality Assessment 19
20. Designing Assessment Methods
● Developing the general approach into an actual method
● Fundamental design question:
For which IQ criterion do we want to apply the method?
● Timeliness: degree to which the data item is up-to-date
with respect to the task at hand
● Representation* as an absolute measure in [0,1]
● 1 – meeting the most strict timeliness standards
● 0 – unacceptable
*Following Ballou et al., 1998
Olaf Hartig - Using Web Data Provenance for Quality Assessment 20
21. 1 Generate the Provenance Graph
What types of provenance elements are necessary?
What level of detail (i.e. granularity) is necessary?
Where and how do we get provenance information?
● Two complementary options:
● Recording
● Analyzing metadata
Olaf Hartig - Using Web Data Provenance for Quality Assessment 21
22. 1 Generate the Provenance Graph
Example:
● Sensors (e.g. sensor1) hourly take measurement (e.g. msr)
● All msr stored in a Web-accessible storage device (store)
● Our system (sys) accesses them for further processing
● sys assesses the timeliness of all msr
Olaf Hartig - Using Web Data Provenance for Quality Assessment 22
23. 1 Generate the Provenance Graph
Example:
● Sensors (e.g. sensor1) hourly take measurement (e.g. msr)
● All msr stored in a Web-accessible storage device (store)
● Our system (sys) accesses them for further processing
● sys assesses the timeliness of all msr
msr created by performed by sensor1
type: Data Item cExc type: Data Creator
type: Data Creation
contained by Execution Time: 10:00
doc retrieved by store
type: Document type: Data Providing Service
aExc accessed
type: Data Access
sys performed by
type: Data Accessor Execution Time: 10:13
Olaf Hartig - Using Web Data Provenance for Quality Assessment 23
24. 2 Annotation with Impact Values
How might each provenance
element influence the IQ criterion?
● Systematically analyze each type of provenance elements
What kind of impact values are necessary?
How do we represent the influences by impact values?
● Impact values not necessarily numerical
● Depends on the assessment function in step 3
How do we determine impact values?
Olaf Hartig - Using Web Data Provenance for Quality Assessment 24
25. Determining Impact Values
● From the provenance information
● From user input
● Configuration options
● Rating-based, Reputation-based
● By content analysis
● Comparison (e.g. outlier detection)
● Adoption of information retrieval methods
● Adoption of data cleansing techniques
● By context analysis
● Further metadata
● Domain knowledge
Olaf Hartig - Using Web Data Provenance for Quality Assessment 25
26. 2 Annotation with Impact Values
How might each provenance
element influence the IQ criterion?
Data Creation Dimension:
Prov. Element Type Impact Values
Data Creation ● creation time
● weights
Creation Guidelines -
(Source) Data Item ● expiry time
Data Creator -
Olaf Hartig - Using Web Data Provenance for Quality Assessment 26
27. 2 Annotation with Impact Values
msr created by performed by sensor1
type: Data Item cExc type: Data Creator
type: Data Creation
contained by Execution Time: 10:00
doc retrieved by store
type: Document type: Data Providing Service
aExc accessed
type: Data Access
sys performed by
type: Data Accessor Execution Time: 10:13
Prov. Element Type Impact Values
Data Creation ● creation time
● weights
Creation Guidelines -
(Source) Data Item ● expiry time
Data Creator -
Olaf Hartig - Using Web Data Provenance for Quality Assessment 27
28. 2 Annotation with Impact Values
msr created by performed by sensor1
type: Data Item cExc type: Data Creator
type: Data Creation
creation time
contained by 10:00 Execution Time: 10:00
doc retrieved by store
type: Document type: Data Providing Service
aExc accessed
type: Data Access
sys performed by
type: Data Accessor Execution Time: 10:13
Prov. Element Type Impact Values
Data Creation ● creation time
● weights
Creation Guidelines -
(Source) Data Item ● expiry time
Data Creator -
Olaf Hartig - Using Web Data Provenance for Quality Assessment 28
29. 2 Annotation with Impact Values
msr created by performed by sensor1
type: Data Item cExc type: Data Creator
expiry time type: Data Creation
11:00 creation time
contained by 10:00 Execution Time: 10:00
doc retrieved by store
type: Document type: Data Providing Service
aExc accessed
type: Data Access
sys performed by
type: Data Accessor Execution Time: 10:13
Prov. Element Type Impact Values
Data Creation ● creation time
● weights
Creation Guidelines -
(Source) Data Item ● expiry time
Data Creator -
Olaf Hartig - Using Web Data Provenance for Quality Assessment 29
30. 3 Assessment Function
How do we represent the IQ criterion by an IQ score?
What does the assessment function look like?
● Develop the function together with the impact values
● Take incompleteness into consideration
● Provenance graphs could be fragmentary
● Annotations could be missing
Olaf Hartig - Using Web Data Provenance for Quality Assessment 30
31. Step 3 – Assessment Function
Olaf Hartig - Using Web Data Provenance for Quality Assessment 31
32. Step 3 – Assessment Function
msr created by performed by sensor1
type: Data Item cExc type: Data Creator
expiry time type: Data Creation
11:00 creation time
contained by 10:00 Execution Time: 10:00
doc retrieved by store
type: Document type: Data Providing Service
aExc accessed
type: Data Access
sys performed by
type: Data Accessor Execution Time: 10:13
Olaf Hartig - Using Web Data Provenance for Quality Assessment 32
33. Step 3 – Assessment Function
msr created by performed by sensor1
type: Data Item cExc type: Data Creator
expiry time type: Data Creation
11:00 creation time
contained by 10:00 Execution Time: 10:00
doc retrieved by store
type: Document type: Data Providing Service
aExc accessed
type: Data Access
sys performed by
type: Data Accessor Execution Time: 10:13
Olaf Hartig - Using Web Data Provenance for Quality Assessment 33
34. Step 3 – Assessment Function
t(msr) = 1 – (10:15 – 10:00) / (11:00 – 10:00)
=1– 0.25h / 1h
= 0.75
msr created by performed by sensor1
type: Data Item cExc type: Data Creator
expiry time type: Data Creation
11:00 creation time
contained by 10:00 Execution Time: 10:00
doc retrieved by store
type: Document type: Data Providing Service
aExc accessed
type: Data Access
sys performed by
type: Data Accessor Execution Time: 10:13
Olaf Hartig - Using Web Data Provenance for Quality Assessment 34
35. Conclusion
● Web Data Provenance (data creation + data access)
● General approach for provenance-based IQ assessment
● Impact values: influence of provenance elements on IQ
● Design decisions for actual assessment methods
● Application to timeliness (more in the paper)
● Future work:
● How do we deal with incompleteness?
● Application of the approach to other IQ criteria
Olaf Hartig - Using Web Data Provenance for Quality Assessment 35
36. These slides have been created by
Olaf Hartig
http://olafhartig.de
This work is licensed under a
Creative Commons Attribution-Share Alike 3.0 License
(http://creativecommons.org/licenses/by-sa/3.0/)
Attribution:
● http://www.flickr.com/photos/rrrrred/3809362767/
● http://www.hasslefreeclipart.com
Olaf Hartig - Using Web Data Provenance for Quality Assessment 36