Repro pdiff-talk (invited, Humboldt University, Berlin)

Provenance and data differencing for
workflow reproducibility analysis

Paolo Missier
School of Computing Science
Newcastle University, UK

Humboldt University, Berlin
March 4, 2013

Provenance Metadata
Provenance refers to the sources of information, including entities and processes, involving
in producing or delivering an artifact (*)

Provenance is a description of how things came to be, and how they came to be in the state
they are in today (*)

Why does it matter?

• To establish quality, relevance, trust
• To track information attribution through complex transformations
• To describe one’s experiment to others, for understanding / reuse
• To provide evidence in support of scientific claims
• To enable post hoc process analysis for debugging, improvement, evolution

2 (*) Definitions proposed by the W3C Incubator Group on provenance:
http://www.w3.org/2005/Incubator/prov/wiki/What_Is_Provenance

A colourful provenance graph
Remote past Recent past

Editing phase
wasDerivedFrom

used
paper3 reading
wasGeneratedBy

specializationOf specializationOf
Bob-1 Bob Bob-2

type=person
type=person actedOnBehalfOf
role=author role=main_editor role=author
role=jr_editor
Alice wasAssociatedWith
wasAttributedTo role=editor
wasAssociatedWith wasAssociatedWith

wasGeneratedBy draft used wasGeneratedBy draft used wasGeneratedBy draft
used drafting commenting editing
v1 comments v2

distribution=internal wasDerivedFrom
status=draft distribution=internal
version=0.1 status=draft
version=0.1

Publishing phase
type=person
role=headOfPublication wasDerivedFrom

pub wasGeneratedBy pub
guideline
actedOnBehalfOf guidelines guidelines
Alice Charlie update
v1 v2

distribution=public
role=issuer wasAssociatedWith status=draft
version=1.0 wasAssociatedWith

draft used wasGeneratedBy
publication WD1 w3c: type=institution
v2
consortium
3

Motivation: Reproducibility in e-science
• Setting: Collaborative, Open Science
– Increasing rate of data sharing in science

• The stick: both journals and funders demand that data be uploaded
– Multiple data journals, data repositories emerging
• The carrot: data is given a DOI and is citable, scientists get credit
• Thomson’s Data Citation Index
• Dryad data repository for biosciences(*)
• The DataBib repository of research data
• NSF Data Preservation projects: DataONE
• best practices document: notebooks.dataone.org/bestpractices/
•... and many others

(*) As of Jan 27, 2013, Dryad contains 2585 data packages and 7097 data files, associated
with articles in 187 journals.
4

General problems
• Quality assurance
– from non-malicious errors in method or data, all the way to fraud
– ... leading to retractions in scientific publications
• see eg http://retractionwatch.wordpress.com/

• Repeatability
– If I replicate your experiment / repeat your process on the same data, will I get the
same results?

• Reproducibility -- a more general notion

The ability for a third party who has access to the description of the
original experiment and its results to reproduce those results, using a
possibly different setting, with the goal to to confirm or dispute the
original experimenter’s claims.

5

Specifically, in e-science...
• Experimental method → scripts, programs, workflows

• Publication = results + {program, workflow} + evidence of results

• Repeatability, reproducibility
– will I be able to run (my version of) your workflow on (my version of) your input
and compare my results to yours?

• Evidence of result: provenance of {program, workflow} execution

• Side note: portability issues are out of scope
– VMs often solve the problem, with some limitations
• not when workflows depend on third party services
• only for limited size data dependencies

Main issue: Workflow evolution and decay

6

Mapping the reproducibility space
Environmental variations

ED ED ! ED'
wfs wfs ! wfs'
Experimental
variations

repeatability wf results
d confirmation

disfunctional
wf ! wf' method workflow non-functioning
d variation - service updates workflow
- state changes

wf exception
data divergence
d ! d' analysis,
variation analysis
debugging

wf ! wf' data and
reproducibility d ! d' method
variation

decay region

Goal: to help scientists understand the effect of workflow / data / dependencies
evolution on workflow execution results
Approach: compare provenance traces generated during the runs: PDIFF

P. Missier, S. Woodman, H Hiden, P. Watson. Provenance and data differencing for workflow
7 reproducibility analysis, Concurrency Computat.: Pract. Exper., 2013. In press.

Decay
• Workflows that have external dependencies are harder to maintain
– they may become disfunctional or break altogether

Zhao, Jun, Jose Gomez-Perez, Khalid Belhajjame, Graham Klyne, and Et Al. “Why Workﬂows Break -
Understanding and Combating Decay in Taverna Workﬂows.” In Procs. E-science Conference.
8 Chicago, 2012.

Workflows and provenance traces
Workflow (structure): directed graph W=(T,E)
T: set of tasks (computational units)
P: set of (input, output) ports associated to each task t ∈ T
E ⊂ T X T: graph edges representing data dependencies

⟨ti.pA,tj.pB⟩ ∈ E: data produced by ti on port pA ∈ P is routed to port pB ∈ P of tj

Execution trace: tr = exec(W, d, ED, wfms)
A: activities
D: data items
R = { used, genBy } relations: used ⊂ A × D × P, genBy ⊂ D × A × P

Workflow inputs: tr.I = {d ∈ D|a ∈ A, p ∈ P (d,a,p) ∉ genBy}.

Workflow outputs: tr.O = {d ∈ D|a ∈ A, p ∈ P (a,d,p) ∉ used}.

9

Workflow evolution
tr = exec(W, d, ED, wfms)
Each of the elements in an execution may evolve (semi) independently
from the others:
trt = exec t (Wi , ED j , dh , wfms k ), with i, j, h, k < t
W ED d wfms

t1 W1 ED1 d1 wfms1 tr1 = exec1(W1,ED1,d1,wfms1)

t2 W2 tr2 = exec2(W2,ED1,d1,wfms1)

t3 ED3 d3 tr3 = exec3(W2,ED3,d3,wfms1)

t4 wfms4 tr4 = exec4(W2,ED3,d3,wfms4)

t5 ED5 tr5 = exec5(W2,ED5,d3,wfms4)

Repeatability:
• Can trt be computed again at some time t’>t?
• Requires saving EDt but may be impractical (eg large DB state)
10

Reproducibility
Can a new version trt’ of trt be computed at some later time t’ > t, after one
of more of the elements has changed?

trt = exec t (Wi , ED j , dh , wfms k ), with i, j, h, k < t
tr t = exec t (Wi , ED j , dh , wfms k )
W ED d wfms

t1 W1 ED1 d1 wfms1 tr1 = exec1(W1,ED1,d1,wfms1)

t2 W2 tr2 = exec2(W2,ED1,d1,wfms1)

t3 ED3 d3 tr3 = exec3(W2,ED3,d3,wfms1)

t4 wfms4 tr4 = exec4(W2,ED3,d3,wfms4)

t5 ED5 tr5 = exec5(W2,ED5,d3,wfms4)

Potential issues:
• Wi may not run new EDj’
• Wi may not run with wfmsk’
• Wi’ may not run with dh’
11
• ...

Data divergence analysis using provenance
• All work done with reference to the e-Science Central WFMS
• Assumption: workflow WFj (new version) runs to completion
– thus it produces a new provenance trace
– however, it may be disfunctional relative to WFi (the original)

• Example: only input data changes: d != d’, WFj == WFi
tr t = exec t (W, ED, d, wfms), tr t = exec t (W, ED, d , wfms)

S0 S1

S2 S3

S4

Note: results may diverge even when the input datasets are identical, for example when one
or more of the services exhibits non-deterministic behaviour, or depends on external state
12 that has changed between executions.

Reproducibility requires comparing datasets
• Experimenters may validate results by deliberately altering the
experimental settings (Wi′, dj′)
• The outcomes will not be identical, but
– are they similar enough to the original to conclude that the experiment was
successfully reproduced?

∆D(trt.O, trt′.O)

• Data comparison is type- and format-dependent in general
• Example:
– workflow output: a classification model computed using model builders
– two models may be different but statistically equivalent

• e-Science Central accommodates user-defined data diff blocks
– these are just Java-based workflow blocks

if ∆D(trt.O, trt′.O) > threshold:
why are results diverging?
13

Provenance traces for two runs
d1 d2 d1' d2'

S0 S1 S0 S1
S0 S1

P0 P0 d3 z w d3 z w'

P0 P1 P0 P1 P0 P1
S3 S2 S3 S2
S2 S3

P0 P0
x y x y'
P0 P1
S4 used P0 P1
P0 P1
S4 S4
genBy
df df'

(i) Trace A (ii) Trace B

14

Delta graphs
A graph obtained as a result of traces “diff”
which can be used to explain observed differences in workflow outputs, in
terms of differences throughout the two executions.
This is the simplest
d1 d2 d1' d2' possible delta “graph”!

S0 S1 S0 S1

d3
d2 , d 2
z w d3 z w'

P0 P1 P0 P1
S3 S2 S3 S2 w, w

x y x y'
y, y
P0 P1
P0 P1
S4 S4

dF , d F
df df'

(iii) Delta tree
15

More involved workflow differences

WA

sv2

WB

• S0 is followed by S0 in WA but not in WB ;

• S3 is preceded by S3 in WB but not in WA ;

• S2 in WA is replaced by a new version, S2v2 , in WB ;

• S1 in WA is replaced by S5 in WB .
16

The corresponding traces
tr t = exec t (W, ED, d, wfms), tr t = exec t (W , ED , d, wfms)
d0 d0

S Sv2

d1 d1'

S0 S0

d2 d2

S0' S1 S3' S5

w h k w' h' k'

P0 P1 P0 P1
S3 S2 S3 S2v2

y z z'
y'

P0 P1
P0 P1
S4 S4

x
17 x'


Delta graph computed by PDIFF
S, Sv2
(version change)

d1 , d 1

S0' S0'
S3'

S1 , S 5
S0 , S 0 (service repl.)
S0 , S 3
h, h k, k

w, w P0 branch of S2 P1 branch of S2
S2 , S2v2
(version change)

y, y z, z
18 P0 branch of S4 x, x P1 branch of S4

Summary
• Setting:
– scientific results computed using workflows
– openness / data sharing has potential to accelerate science
– but requires results validation and reproducibility

• Problem: reproducibility is hard to achieve
– workflow decay
– evolution of data, workflow spec, dependencies, wf engine

• Goal: support divergence analysis

• Approach: PDIFF -- comparing provenance traces generated during
the runs

19

Selected references
Zhao, Jun, Jose Gomez-Perez, Khalid Belhajjame, Graham Klyne, and Et Al. Why Workflows
Break - Understanding and Combating Decay in Taverna Workflows. In Procs. E-science
Conference. Chicago, 2012.

Cohen-Boulakia S, Leser U. Search, adapt, and reuse: the future of scientific workflows.
SIGMOD Rec. Sep 2011; 40(2):6–16, doi:http://doi.acm.org/10.1145/2034863.2034865

Peng RD,Dominici F., Zeger SL. ReproducibleEpidemiologicResearch. American Journal of
Epidemiology 2006; 163(9):783–789, doi:10.1093/aje/kwj093.

Drummond C. Science, Replicability is not Reproducibility: Nor is it Good Science. Procs. 4th
workshop on Evaluation Methods for Machine Learning In conjunction with ICML 2009,
Montreal, Canada, 2009.

Peng R. Reproducible Research in Computational Science. Science Dec 2011; 334(6060):
1226–1127

Schwab M, Karrenbach M, Claerbout J. Making Scientific Computations Reproducible.
Computing in Science Engineering 2000; 2(6):61–67

P. Missier, S. Woodman, H Hiden, P. Watson. Provenance and data differencing for workflow
reproducibility analysis, Concurrency Computat.: Pract. Exper., 2013. In press.

Mesirov J. Accessible Reproducible Research. Science 2010; 327

20

Repro pdiff-talk (invited, Humboldt University, Berlin)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von Paolo Missier

Mehr von Paolo Missier (20)

Repro pdiff-talk (invited, Humboldt University, Berlin)