This is a keynote that I have given in polyweb workshop on the state of the art of data science reproducibility. I review tools that have been developed over the last few years in the first part. In the second part, I focus on proposals that I have been involved in to facilitate workflow reproducibility and preservation.
2. “Science
is
built
upon
the
founda0ons
of
theory
and
experiment
validated
and
improved
through
open,
transparent
communica0on.
With
the
increasingly
central
role
of
computa0on
in
scien0fic
discovery,
this
means
communica0ng
all
details
of
the
computa0ons
needed
for
others
to
replicate
the
experiment.
V. Stodden, D. H. Bailey, J. M. Borwein, R. J. LeVeque, W. Rider, and W. Stein. Setting the default to
reproducible: Reproducibility in computational and experimental mathematics.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
2
3. basic
studies
on
cancer
are
unreliable,
with
grim
consequences
for
producing
new
medicines
in
the
future
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
3
4. The
research
result,
obtained
by
Stapel
and
co-‐workers
Roos
Vonk
(Radboud
University)
and
Marcel
Zeelenberg
(nl)
(Tilburg
University),
showing
that
meat
eaters
are
more
selfish
than
vegetarians,
which
was
widely
publicized
in
Dutch
media
is
suspected
to
be
based
on
faked
data.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
4
5. ¡ ReplicaEon
means
conducEng
studies
with
independent:
§ InvesEgators
§ Data,
§ methods,
§ Laboratories,
§ Instruments.
¡ ReplicaEon
is
the
ulEmate
standard
for
strengthening
evidence
and
trust
in
scienEfic
findings.
¡ However,
replicaEon
is
most
of
the
Eme
not
possible:
expensive
(Eme
and
money),
opportunisEc
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
5
6. Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
6
Way
too
expensive
Reproducible Research:
Make data and code
available so that others
Replication may reproduce findings
Scholarly Article,
is not enough
Reproducibility
(Re)useless
8. ¡ The
huge
increases
in
performance
both
at
the
level
of
hardware
and
soVware,
meant
that
highly
complex
analysis
are
possible.
¡ However,
these
same
advances
meant
a
higher
risk
of
generaEng
results
that
cannot
be
reproduced.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
8
9. ¡ Researchers
in
experimental
biology
use
carefully
lab
notebooks
to
document
different
aspects
of
their
experiments.
¡ This
is
not
the
case
for
computaEonal
scienEsts
who
tend
to
run
their
analysis
with
no
clear
record
of
the
exact
process
they
followed
or
intermediary
datasets
(results)
they
used
and
generated.
¡ It
is
therefore
possible
that
numerous
published
results
may
be
unreliable
or
even
completely
invalid.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
9
10. ¡ OVen,
there
is
no
record
of
the
process
(workflow)
that
produced
the
published
computaEonal
results
in
scholarly
communicaEons.
¡ Even
the
code
is
missing,
or
underwent
changes.
§ It
cannot
be
used
to
process
the
data
referred
to,
(if
we
are
lucky).
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
10
11. “The
reproducible
research
movement
recognizes
that
tradi0onal
scien0fic
research
and
publica0on
prac0ces
now
fall
short
…,
and
encourages
all
those
involved
i n
the
produc0on
of
computa0onal
science
...
to
facilitate
and
prac0ce
really
reproducible
research.”
We
witnessed
recently
the
emergence
of
a
number
of
methods
and
tools
for
enabling
reproducibility
V. Stodden, D. H. Bailey, J. M. Borwein, R. J. LeVeque, W. Rider, and W. Stein. Setting the default to
reproducible: Reproducibility in computational and experimental mathematics.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
11
12. Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
12
System-‐Level
Reproducibility
Reprozip
Burrito
ES3
Scripting
oriented
Reproducibility
IPython
Knitr
IJulia
Workflow
oriented
reproducibility
Galaxy
Taverna
Vistrails
Article
Centered
Reproducibility
SOLE
DEEP
SHARE
Investigation
oriented
Reproducibility
ISA
Research
Object
FuGE
13. Packing Experiments AUTHORS
Computational Environment E
Execution p’
Experiment ReproZip
p
Provenance Tree
Capture of Provenance
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
13
14. Packing Experiments AUTHORS
Computational Environment E
Execution p’
Experiment ReproZip
Capture of Provenance
p
• command-line
arguments
• working directory
• files read
• files written
…
process p’
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
14
15. Packing Experiments AUTHORS
Computational Environment E
Experiment ReproZip
Capture of Provenance
Description of data
Description of experiment
Description of environment
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
15
Execution
Provenance Tree
Identification of
Necessary
Components
Input and output files
Executable programs and steps
Environment variables, dependencies, …
16. Packing Experiments AUTHORS
Computational Environment E
Experiment ReproZip
Capture of Provenance
Description of data
Description of experiment
Description of environment
17. Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
16
Execution
Provenance Tree
Identification of
Necessary
Components
Input and output files
Executable programs and steps
Environment variables, dependencies, …
VisTrails Workflow
Specification of
Workflow
Reproducible
Package
Figure taken from Chirigati et al., 2012
18. Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
17
System-‐Level
Reproducibility
Reprozip
Burrito
ES3
Scripting
oriented
Reproducibility
IPython
Knitr
IJulia
Workflow
oriented
reproducibility
Galaxy
Taverna
Vistrails
Article
Centered
Reproducibility
SOLE
DEEP
SHARE
Investigation
oriented
Reproducibility
ISA
Research
Object
FuGE
19. ¡ IPython
provides
a
rich
architecture
for
interacEve
compuEng
with:
§ A
browser-‐based
notebook
with
support
for
code,
text,
mathemaEcal
expressions,
inline
plots
and
other
rich
media.
§ Support
for
interacEve
data
visualizaEon
and
use
of
GUI
toolkits.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
18
21. Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
20
System-‐Level
Reproducibility
Reprozip
Burrito
ES3
Scripting
oriented
Reproducibility
IPython
Knitr
IJulia
Workflow
oriented
reproducibility
Galaxy
Taverna
Vistrails
Article
Centered
Reproducibility
SOLE
DEEP
SHARE
Investigation
oriented
Reproducibility
ISA
Research
Object
FuGE
22. ¡ Inputs
to
computaEonal
science
are
not
linked
with
its
outputs.
§ Inputs:
Large
quanEEes
of
data,
complex
data
manipulaEon
and/or
numerical
simulaEon
use
of
large
and
oVen
distributed
soVware
stacks.
§ Outputs:
Research
papers
(text-‐based,
non-‐interacEve)
¡ Authors
and
Readers
§ approach
computaEonal
§ science
from
opposite
direcEons
¡ The
objecEve
of
SOLE
is
to
link
research
papers
with
auxiliary
resources
that
have
been
uElized,
e.g.,
datasets,
soVware
programs,
files,
etc.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
21
23. Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
22
System-‐Level
Reproducibility
Reprozip
Burrito
ES3
Scripting
oriented
Reproducibility
IPython
Knitr
IJulia
Workflow
oriented
reproducibility
Galaxy
Taverna
Vistrails
Article
Centered
Reproducibility
SOLE
DEEP
SHARE
Investigation
oriented
Reproducibility
ISA
Research
Object
FuGE
24. ¡ Assists
users
to
submit
the
structured
content
via
simple
templates
and
an
internal
authoring
tool
¡ Performs
value-‐
added
semanEc
annotaEon
of
the
experimental
metadata
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
23
28. ¡ Data driven analysis pipelines
¡ Systematic gathering of data and
analysis tools into computational
solutions for scientific problem-solving
¡ Tools for automating frequently
performed data intensive activities
¡ Provenance for the resulting datasets
§ The method followed
§ The resources used
§ The datasets used
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
27
29. GWAS,
Pharmacogenomics
Association
study
of
Nevirapine-‐induced
skin
rash
in
Thai
Population
Trypanosomiasis
(sleeping
sickness
parasite)
in
African
Cattle
Astronomy
HelioPhysics
Library
Doc
Preservation
Systems
Biology
of
Micro-‐
Organisms
Observing
Systems
Simulation
Experiments
JPL,
NASA
BioDiversity
Invasive
Species
Modelling
[Credit Carole A. Goble]
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
28
30. ¡ Scientific workflows are primarily used to specify and enact in
silico experiments
¡ However, they can also be used as a a means to document the
experiment that the scientist ran, and even repurpose it!
Khalid Belhajjame @ PoliWeb Workshop, 2014
Kegg pathway
query
Kegg pathway
query
chromosome17
chromosome37
Detect common
pathways
Common
pathways
Scientific workflows
Increasingly adopted in modern sciences.
Transparent documentation of
experimental methods
Repeatable and configurable
29
31. ¡ A decayed or reduced ability to be executed or
produce the same results
¡ To better understand workflow decay, we
conducted an empirical analysis to identify the
causes of workflow decay.
¡ To do so, we analyzed a sample of real
workflows to determine if they suffer from
decay and the reasons that caused their decay
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
30
32. ¡ Taverna workflows
from
myExperiment.org
§ Taverna 1
§ Taverna 2
¡ Selection process
§ By the creation year
§ By the creator
§ By the domain
¡ Software
environment
§ Taverna 2.3
¡ Experiment
metadata
§ June-July 2012
§ 4 researchers
Khalid Belhajjame @ PoliWeb Workshop, 2014 31
33. Number of Taverna 1 workflows from 2007 to 2011
2007 2008 2009 2010 2011
Tested 12 10 10 10 4*
Total 74 341 101 26 13
Number of Taverna 2 workflows from 2009 to 2012
2009 2010 2011 2012
Tested 12 10 15 9
Total 97 308 289 184
Khalid Belhajjame @ PoliWeb Workshop, 2014
32
35. ¡ 75% of the 92 tested
workflows failed to
be either executed or
produce the same
result (if testable)
¡ Those from early
years (2007-2009)
had 91% failure rate
Khalid Belhajjame @ PoliWeb Workshop, 2014
Taverna 1
Taverna 2
34
36. ¡ Manual analysis
§ By the validation report from Taverna workbench
§ By interpreting experiment results reported by Taverna
¡ Identified 4 categories of causes
§ Missing example data
§ Missing execution environment
§ Insufficient descriptions about workflows
§ Volatile third-party Resources
¡ Other unconsidered possible factors
§ Changes in the local operating environment (hardware, OS, middleware,
compiler, etc)
Khalid Belhajjame @ PoliWeb Workshop, 2014
35
37. Causes
Refined
Causes
Examples
Third
party
resources
are
not
available
Underlying
dataset,
particularly
those
locally
hosted
in-‐house
dataset,
is
no
longer
available
Khalid Belhajjame @ PoliWeb Workshop, 2014
Researcher
hosting
the
data
changed
institution,
server
is
no
longer
available
Services
are
deprecated
DDBJ
web
services
are
not
longer
provided
despite
the
fact
that
they
are
used
in
many
myExperiment
workflows
Third
party
resources
are
available
but
not
accessible
Data
is
available
but
identified
using
different
IDs
than
the
ones
known
to
the
user
Due
to
scalability
reasons
the
input
data
is
superseded
by
new
one
making
the
workflow
not
executable
or
providing
wrong
results
Data
is
available
but
permission,
certificate,
or
network
to
access
it
is
needed
Cannot
get
the
input,
which
is
a
security
token
that
can
only
be
obtained
by
a
registered
user
of
ChemiSpider
Services
are
available
but
need
permission,
certificate,
or
network
to
access
and
invoke
them
The
security
policies
of
the
execution
framework
are
updated
due
to
new
hosting
institution
rules
Third
party
resources
have
changed
Services
are
still
available
by
using
the
same
identifiers
but
their
functionality
have
changed
The
web
services
are
updated
36
38. ¡ 50% of the decay was caused by
volatility of 3rd-party resource
§ Unavailable
§ Inaccessible
§ Updated
¡ Missing example data
§ Unable to re-run
¡ Missing execution environment
§ Such as local plugins
¡ Insufficient metadata
§ Such as any required dependency
libraries or permission
information
Khalid Belhajjame @ PoliWeb Workshop, 2014
37
40. ¡ Some
services
that
compose
workflows
are
annotated
using
concepts
from
domain
ontologies
¡ Such
annotaEons
can
be
used
to
repair
workflow
§ IdenEfy
available
services
that
can
play
the
same
role
as
an
unavailable
service
within
a
workflow.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
39
41. Task ontology: captures information about the action carried
out by service operations within a domain of interest, e.g.,
Sequence_alignment and Protein_identification
Domain ontology: captures information about the application
domains covered by operation parameters, e.g., Protein_record
and DNA_sequence
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
40
42. Task replaceability: For an operation op2 to be able to substitute
an operation op1, op2 must fulfil a task that is equivalent to or
subsumes the task op1 performs:
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
41
43. Parameter replaceability: To be compatible the domain of the
output must be the same as or subconcept of the domain of the
subsequent input.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
42
44. While the method just presented is sound, its practical applicability
is hindered by the following facts
§ Semantic annotations of web services are scarce.
§ Our experience suggests that a large proportion of existing
semantic annotations suffer from inaccuracies
§ As a result, a substitute that is discovered for replacing an
unavailable operation using such annotations may turn out to be
unsuitable, and, inversely, a suitable substitute may be
discarded.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
43
46. Formally,
let
wf1
be
a
workflow
in
which
the
operation
op1
is
unavailable.
The
operation
op2
can
replace
the
operation
op1
in
terms
of
its
inputs
and
outputs
if:
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
45
47. ¡ In addition to the compatibility in terms of inputs and outputs, we have to
check that the candidate substitute performs a task compatible with that of
the unavailable operation.
¡ To perform this test, we exploit the following observation. An operation
op2 is able to replace the operation op1 in terms of task, if for every
possible input instances that op1 is able to consume, op2 delivers the same
output as that obtained by invoking op1.
¡ To perform the above test, however, we will have to call the missing
operation op1!
¡ A solution that we adopt for overcoming the above problem makes use of
workflow provenance logs. These are traces that contain intermediate data
that were used as input and delivered as output by the constituent
operations of a workflow when enacted.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
46
48. § An
operation
op2
may
be
compatible
in
terms
of
task
with
op1
if:
op2
delivers
the
same
results
that
op1
delivered
in
past
execuEons,
that
are
logged
within
provenance
logs,
when
fed
using
the
same
input
values.
§ Notice that we say may be compatible. This is because we may not be able to
compare the outputs obtained for every possible input value of the operation
op1.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
47
49. ¡ The
condiEon
that
we
have
described
for
checking
the
suitability
of
an
operation
as
a
substitute
for
another
one
may
be
stronger
than
is
required
in
practice.
¡ There are various parameter representations that are adopted
in bioinformatics.
¡ Because of representation mismatch, a service operation that
performs a task similar to the missing operation may be found
to be unsuitable.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
48
50. Example
of
values
delivered
by
two
operaEons
using
the
same
input
value
Value1
Value2
CosSym(value1,value2)
=
0.007
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
49
51. To
overcome
this
problem,
we
use
a
two
step
process
when
comparing
the
values
of
parameters:
1. Given
a
parameter
value,
we
derive
its
representaEon.
2. If
the
representaEon
is
associated
with
a
key
ahribute
(idenEfier),
extract
the
value
of
such
an
ahribute
If
two
parameter
values
are
associated
with
idenEfiers,
then
they
are
compared
by
comparing
their
idenEfiers.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
50
52. Example
of
values
delivered
by
two
operaEons
using
the
same
input
value
Value1
Value2
Fasta Format
Uniprot Format
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
51
54. ¡ ScienEfic
workflows
are
increasingly
used
by
scienEsts
as
a
means
for
specifying
and
enacEng
their
experiments.
¡ They
tend
to
be
data
intensive
¡
The
data
sets
obtained
as
a
result
of
their
enactment
can
be
stored
in
public
repositories
to
be
queried,
analyzed
and
used
to
feed
the
execuEon
of
other
workflows.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
53
55. ¡ The
datasets
obtained
as
a
result
of
workflow
execuEon
oVen
contain
duplicates.
¡ As
a
result:
§ The
analysis
and
interpretaEon
of
workflow
results
may
become
tedious.
§ The
presence
of
duplicates
also
unnecessarily
increases
the
size
of
workflow
results.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
54
56. ¡ Research
in
duplicate
record
detecEon
has
been
acEve
for
more
than
three
decades.
§ Elmagarmid
et
al.,
2007
conducted
a
comprehensive
survey
of
the
topics.
¡ We
do
not
aim
to
design
yet
another
algorithm
for
comparing
and
matching
records.
¡ Rather,
we
invesEgate
how
provenance
traces
produced
as
a
result
of
workflow
execuEons
can
be
used
to
guide
the
detecEon
of
duplicate
records
in
workflow
results.
Ahmed
K.
Elmagarmid,
Panagiotis
G.
Ipeirotis,
and
Vassilios
S.
Verykios.
Du-‐plicate
record
detection:
A
survey.
IEEE
Trans.
Knowl.
Data
Eng.,
19(1):1–16,2007.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
55
57. ¡ A
data
driven
workflow
can
be
defined
as
a
directed
graph:
wf = hN, Ei
¡ A
node
represent
an
analysis
operaEon,
which
has
a
set
of
input
and
output
parameters.
hop, Iop, Oopi 2 N
hhop, oi, hop0, iii 2 E
¡ The
edges
are
dataflow
dependencies:
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
56
58. The
execuEon
of
workflows
gives
rise
to
provenance
trace,
which
we
capture
using
two
relaEons.
¡
Transforma5on:
to
specify
that
the
execuEon
of
an
operaEon
took
as
input
a
given
ordered
set
of
records
and
generated
another
ordered
set
of
records.
op, o1, ro1 , . . . , op, om, rom op, i1, ri1 , . . . , op, in, rin
OutBop InBop
¡ Transfer:
to
specify
transfer
of
records
along
the
edges
of
the
workflow.
op , i , r op, o, r
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
57
59. To
guide
the
detecEon
of
duplicates
in
workflow
results
we
exploit
the
following
fact:
¡ An
operaEon
that
is
known
to
be
determinisEc
produces
idenEcal
output
bindings
given
the
same
input
binding.
deterministic op OutBop InBop T OutBop InBop T
id OutBop, OutBop
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
58
60. Provenance-‐Guided
Detection
of
Duplicates:
Example
IdentifyProtein
GetGOTerm
Ri
Ro
R’i
R’o
1. The
set
of
records
Ri
that
are
bound
to
the
input
parameter
of
the
starEng
operaEon
are
compared
to
idenEfy
duplicate
records.
The
result
of
this
phase
is
a
parEEon
of
disjoint
sets
of
idenEcal
records.
i
o
i’
o’
Ri R1i
Rni
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
59
61. Provenance-‐Guided
Detection
of
Duplicates:
Example
IdentifyProtein
Ri
Ro
R’i
R’o
2. The
sets
of
records
Ro,
R’i
GetGOTerm
and
R’o
are
parEEoned
into
sets
of
idenEcal
records
based
on
the
parEEoning
of
Ri.
For
example:
Ro R1o
Rno
Rio
ro Ro s.t. ri Rii
, IdentifyProtein, o, ro IdentifyProtein, i, ri
i
o
i’
o’
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
60
62. Provenance-‐Guided
Detection
of
Duplicates:
Example
¡ In
the
example
just
described,
the
operaEons
that
compose
the
workflow
have
exactly
one
input
and
one
output
parameter.
§ However,
the
algorithm
we
developed
supports
operaEons
with
mulEple
input
and
output
parameters.
¡ NoEce
that
we
assumes
that
the
analysis
operaEons
that
compose
the
workflow
are
determinisEc.
This
is
not
always
the
case.
§ This
raises
the
quesEon
as
to
how
to
determine
that
a
given
operaEon
is
determinisEc.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
61
63. To
verify
the
determinism
of
operaEons,
we
use
an
approach
whereby
operaEons
are
probed.
1. Given
an
operaEon
op,
we
select
examples
values
that
can
be
used
by
the
inputs
of
op,
and
invoke
op
using
those
values
mulEple
Emes.
2.
If
op
produces
idenEcal
output
values
given
idenEcal
input
values,
then
it
is
likely
to
be
determinisEc,
otherwise,
it
is
not
determinisEc.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
62
64. To
support
duplicates
detecEon
in
collecEon
based
workflows
we
need
to
be
able
to:
¡ Iden5fy
when
two
collec5ons
are
iden5cal
Two
collecEons
Ri
and
Rj
are
idenEcal
if
they
are
of
the
same
size
and
there
is
a
bijecEve
mapping:
that
maps
each
record
ri
in
Ri
to
a
record
rj
in
Rj
such
that
ri
and
rj
are
idenEcal
¡ Iden5fy
duplicates
records
between
two
collec5ons
that
are
known
to
be
iden5cal
IdenEfy
a
bijecEve
mapping
that
maps
every
ri
in
Ri
to
an
idenEcal
rj
in
Rj.
map : Ri Rj
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
63
66. ¡ Overwhelming for users who are not
the developers
¡ Abstractions required for reporting
¡ Lineage queries result in very long
trails
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
65
67. ¡ a.k.a. Shims
D.
Hull
et
al
¡ Dealing with data and
protocol heterogeneities
¡ Local organization of data
~ 60%
Garijo
D.,
Alper.
P.,
Belhajjame
K.
et
al
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
66
68. Process-Wise and Data-
Wise abstractions
¡ Sub-workflows
§ Not always a significant unit
of function (e.g. aesthetic
purposes)
¡ Bookmarked data links
§ Cluster the output signature
§ Further complicates workflow
¡ Components
§ Library dependent
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
67
69. ¡ A graph model for representing workflows
¡ Graph re-write rules for summarization
IF performs certain function THEN re-write WF graph !
!!!!!!
motifs reduction-primitives
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
68
71. Pure Dataflows
W= N,E!
Operation and Port Nodes
N = (Nop U Np)!
!
Dataflow edges
E = (Eopèp U Epèp U
Epèop )!
!
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
70
77. ¡ Strategies as a set of rules for
summarization
¡ Two sample strategies based on an
empirical analysis of workflows
¡ Reporting:
§ Process: Significant activities (Retrieval,
Analysis, Visualization)
§ Data:
§ Reduced cardinality
§ Stripped of protocol specific payload/formatting
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
76
78. ¡ By-Eliminate
§ Minimal annotation effort
§ Single rule
¡ By Collapse
§ More specific annotation
§ Multiple rules
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
77
86. ¡ Establishing Trust, but also understanding and
reusability, in Computational Science is more
than ever needed
¡ Reproducibility seems to be a cost-effective
solution
¡ A number of tools and methods have been
developed for doing so.
¡ However, …. that is not enough
¡ Changing our ways (culture) of doing science is
more challenging
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
85
87. ¡ Pinar
Alper
¡ Óscar
Corcho
¡ Fernando
ChirigaE
¡ Juliana
Freire
¡ David
De
Roure
¡ Yolanda
Gil
¡ Daniel
Garijo
¡ Carole
Goble
¡ David
Koop
¡ SEan
Soiland-‐Reyes
¡ Paolo
Missier
¡ Jun
Zhao
¡ and
many
others
…
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
86