Presentation slides of paper by Shawn Bowers, Timothy McPhillips, and Bertram Ludäscher, given by Shawn at Provenance and Annotation of Data and Processes - 7th International Provenance and Annotation Workshop, IPAW 2018, King's College London, UK, July 9-10, 2018.
The paper won a the IPAW best paper award: https://twitter.com/kbelhajj/status/1017082775856467968
ABSTRACT. An advantage of scientific workflow systems is their ability to collect runtime provenance information as an execution trace. Traces include the computation steps invoked as part of the workflow run along with the corresponding data consumed and produced by each workflow step. The information captured by a trace is used to infer "lineage'' relationships among data items, which can help answer provenance queries to find workflow inputs that were involved in producing specific workflow outputs. Determining lineage relationships, however, requires an understanding of the dependency patterns that exist between each workflow step's inputs and outputs, and this information is often under-specified or generally assumed by workflow systems. For instance, most approaches assume all outputs depend on all inputs, which can lead to lineage "false positives''. In prior work, we defined annotations for specifying detailed dependency relationships between inputs and outputs of computation steps. These annotations are used to define corresponding rules for inferring fine-grained data dependencies from a trace. In this paper, we extend our previous work by considering the impact of dependency annotations on workflow specifications. In particular, we provide a reasoning framework to ensure the set of dependency annotations on a workflow specification is consistent. The framework can also infer a complete set of annotations given a partially annotated workflow. Finally, we describe an implementation of the reasoning framework using answer-set programming.
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
1. Validation and Inference of Schema-Level
Workflow Data-Dependency Annotations
Shawn Bowers1, Timothy McPhillips2, Bertram Lud¨ascher2
1Dept. of Computer Science, Gonzaga University
2School of Information Sciences, University of Illinois,
Urbana-Champaign
IPAW 2018
2. Scientific Workflows and Provenance
A workflow specification modeled as a graph of computation
steps (nodes) and data/control flow (edges)
gen_boundary_region
gen_boundary_region
boundary_coordinates
user_map_marker_pos
prism_data
file:data/112W36N.nc
d3gend1
d2 filter
c
Steps are often “black boxes” (invoke external programs)
3. Scientific Workflows and Provenance
During a workflow execution, systems
record “provenance” information ...
I invocation of steps
I data received/produced by steps
A workflow trace modeled as a graph
of invocations and corresponding data
I a trace is a specification instance
I capturing details of a workflow run
4gen:11 4 filter:1
1
4
gen:12
4 filter:1
1
77 filter:2
1
gen:11 0 filter:1
1
Di↵erent traces of the same specification
4. Data Dependency Assumptions and Issues
Traces are used to infer the “lineage” of data products (⇤)
I e.g., all steps and inputs/outputs that led to an output
I assume all outputs “depend on” all inputs of a step
4gen:11 4 filter:1
1
However, the inferred “dependencies” can be incorrect and vague
1. some outputs might not “depend on” all inputs
2. outputs can depend on inputs di↵erently (derivation, copy, ...)
(⇤)
some systems provide APIs for steps to declare dependencies at runtime
5. Prospective (Schema-Level) Dependency Annotations
Our approach:
I Allow wf authors to specify dependency patterns (annotations)
I Support di↵erent data dependency types
I Use dependency annotations to infer trace-level dependencies
6. Prospective (Schema-Level) Dependency Annotations
Our approach:
I Allow wf authors to specify dependency patterns (annotations)
I Support di↵erent data dependency types
I Use dependency annotations to infer trace-level dependencies
Prior work:
I Allows dependency annotations for individual workflow steps
I Rules for extracting trace-level invocation dependencies
I Requires each step to be (fully) annotated
7. Prospective (Schema-Level) Dependency Annotations
Our approach:
I Allow wf authors to specify dependency patterns (annotations)
I Support di↵erent data dependency types
I Use dependency annotations to infer trace-level dependencies
Prior work:
I Allows dependency annotations for individual workflow steps
I Rules for extracting trace-level invocation dependencies
I Requires each step to be (fully) annotated
Current contributions focus on workflow design:
1. Allow partially annotated workflow specifications
2. Infer complete sets of (possible) annotations
3. Validate correctness of annotations
8. Workflow Specifications
Minimally, a workflow specification W = (P, D, E) consists of
• a set P of program blocks (computation steps)
p1
• a set D of data blocks (data items or containers)
d1
• a set E ✓ P ⇥ L ⇥ D ⇥ {in, out} of uniquely labeled edges
p1
d1
p2
x1
x2
We use in(pi , xi , di ) and out(pj , xj , dj ) for input and output edges
• where xi , xj are labels in L
9. Dependency Annotations
Dependency annotations A ✓ Lout ⇥ Lin ⇥ T for a workflow W ...
• associate dependency types t 2 T (more later)
• to input-output edge pairs of W (identified by their labels)
We use dep rule(xi , xj , t) for annotations xi
t
xj (drawn in red)
d3gend1
d2 filter
c
cutoff
n r v1
v2
DependsOn CopyOf
DependsOn
• dep rule(n, r, depends on), dep rule(v1, v2, copy of),
dep rule(cutoff, v2, depends on)
10. Dependency Types
We consider five di↵erent dependency annotation types ... (⇤,†)
FlowsFrom: input present during invocation (e.g., a trigger)
DependsOn: output has control (statement) dependency on input
DerivedFrom: output has data (read-after-write) dependency on input
ValueOf: input value copied to the output (new data item)
SameAs: input copied to the output (same item “passed through”)
11. Dependency Types
We consider five di↵erent dependency annotation types ... (⇤,†)
FlowsFrom: input present during invocation (e.g., a trigger)
DependsOn: output has control (statement) dependency on input
DerivedFrom: output has data (read-after-write) dependency on input
ValueOf: input value copied to the output (new data item)
SameAs: input copied to the output (same item “passed through”)
Ordered from weakest to strongest form of dependency ...
FlowsFrom DependsOn DerivedFrom ValueOf SameAs
12. Dependency Types
We consider five di↵erent dependency annotation types ... (⇤,†)
FlowsFrom: input present during invocation (e.g., a trigger)
DependsOn: output has control (statement) dependency on input
DerivedFrom: output has data (read-after-write) dependency on input
ValueOf: input value copied to the output (new data item)
SameAs: input copied to the output (same item “passed through”)
Ordered from weakest to strongest form of dependency ...
FlowsFrom DependsOn DerivedFrom ValueOf SameAs
Or as subclasses (e.g., FlowsFrom+ as “at least FlowsFrom”) ...
FlowsFrom+
w DependsOn+
w DerivedFrom+
w ValueOf +
w SameAs+
(⇤)
Plus NotFlowsFrom, described later (†)
A more formal description is given in the paper
13. Reasoning using Dependency Composition
Given two “connected” program blocks:
p1
d1
d2
x1
x2
p2
d3
x3
x4
tj
ti
t
A composite (indirect) dependency x1
t
x4 is the weaker of the
dependencies x1
ti
x2 and x3
tj
x4
dep rule(x1, x2, ti)^dep rule(x3, x4, tj)^ti tj $ dep rule(x1, x4, ti)
dep rule(x1, x2, ti)^dep rule(x3, x4, tj)^tj ti $ dep rule(x1, x4, tj)
This extends to longer “chains” of connected program blocks
14. Dependency Composition with Multiple Paths
When multiple annotation “paths” exist ...
p1
p4
d1
d2
d5
x1
x2
x7
x9
DerivedFrom
p2
p3
d3
d4
x3
x4
x5
x6
x8
FlowsFrom
DerivedFrom
SameAs
DerivedFrom
The composite annotation type is the strongest type of the paths
• the top path implies FlowsFrom
• the bottom path implies DerivedFrom
• the infered type is DerivedFrom (i.e., “at least DerivedFrom”)
15. Use Case 1: Infer Composite Dependencies
Given annotations on blocks (steps), find composite annotations
I helps verify intent and construction of workflow
I e.g., that certain outputs are derived from inputs
normalize filterd1
d3
d5
d2
d4
xrange
x1
x2
x3
x4
xcutoff
DependsOn
SameAsDerivedFrom
DerivedFrom
DerivedFrom
DerivedFrom
Inferred annotations shown in blue
16. Use Case 2: Constraining Dependency Annotations
Add annotations to constrain choices
I e.g., may know the output should be derived from the input
I which can guide (constrain) block-level annotation choices
I or guide the workflow design itself
p1
p2
d1
d2
d3
x1
x2
x3
x4
DerivedFrom
DerivedFrom,
ValueOf, or SameAs?
DerivedFrom,
ValueOf, or SameAs?
17. Use Case 3: Validating Dependency Annotations
Ensure annotations are compatible
I e.g., lower-level (block) annotations are not consistent with
composite annotation (shown in purple)
generate
sample
d2
dtype
diter
xout
xiter
d1
xin
DerivedFrom
initial
sample
perturbd1
d2
dtype
diter
xtype
n x1
x2
s
xiter
DependsOn
DerivedFromDependsOn
DependsOn
xtype
din
p1 p2 dout
18. Dependency Reasoning Prototype Implementation
Answer-Set Programming (ASP) prototype in Potascco (clingo)
High level idea: use a generate-and-test algorithm
(i) “guess” annotations for non-annotated input-output pairs
(ii) ensure annotations satisfy composition rules
(iii) ensure annotations satisfy “strongest-path” constraint
Result is all possible and complete annotation sets (possible worlds)
(iv) find all annotations common to all worlds
(v) report possible choices for remaining input-output pairs
19. Prototype Implementation (cont)
The following “choice rule” guesses annotations
{dep_rule(I,O,R) : dep_type(R)} = 1 :- up_stream(I,O).
The up stream relation finds all possible input-output pairs
up_stream(I,O) :- in(I,P,_), out(O,P,_).
up_stream(I,O) :- in(I,P1,_), out(O1,P1,D1),
in(I2,P2,D1), up_stream(I2,O).
The following constraint ensures composition rules are satisfied
:- dep_rule(I,O,R), not valid_dep_path(I,O,R).
20. Prototype Implementation (cont)
The valid dep path relation finds valid compositions
valid_dep_path(I,O,R) :- in(I,P,_), out(O,P,_),
dep_rule(I,O,R).
valid_dep_path(I,O,R) :- in(I,P,_), out(O1,P,_), O != O1,
dep_rule(I,O1,R1), connected(O1,I1),
I != I1, valid_dep_path(I1,O,R2),
compose(R1,R2,R).
The connected relation ensures an output is connected to an input
connected(O,I) :- out(O,_,D), in(I,_,D).
compose computes composition (where weaker eq implements )
compose(R1,R2,R1) :- weaker_eq(R1,R2).
compose(R1,R2,R2) :- weaker_eq(R2,R1).
21. Prototype Implementation (cont)
Finally, the following constraint ensures “strongest” paths
:- dep_rule(I,O,R), valid_dep_path(I,O,R1),
weaker_eq(R,R1), R != R1.
Recently added NotFlowsFrom type (e.g., for subworkflows)
I Required only minimal changes: NotFlowsFrom FlowsFrom
I Full subworkflow support not yet implemented (future work)
d1
p1
x1
d2
p2
x2
d3
x3
d4
x4
22. Preliminary Performance Results
(1) Increase the depth of the
workflow (2-50 steps) and %
of block annotations
ps
ds
pe
de
...
...
(2) Increase the width of the
workflow (2-50 steps) and %
of block annotations
pe
de
...
23. Future Work
Add dependency annotations to YesWorkflow’s annotation types
I combine schema-level support and extend trace-level support
Apply schema-level dependency annotations to workflows in YW
I we can now do this, e.g., for paleocar (with NotFlowsFrom)
I extend annotation types as needed
Develop specialized reasoning support (as needed)
I ASP great for prototyping!
I but can improve performance with dedicated implementation
24. Dr. Shawn Bowers presenting the paper on July 10th, 2018 at IPAW, King’s College, London, UK.