SlideShare ist ein Scribd-Unternehmen logo
1 von 86
Downloaden Sie, um offline zu lesen
Khalid 
Belhajjame 
https://sites.google.com/site/kbelhajjame 
@kbelhajj
“Science 
is 
built 
upon 
the 
founda0ons 
of 
theory 
and 
experiment 
validated 
and 
improved 
through 
open, 
transparent 
communica0on. 
With 
the 
increasingly 
central 
role 
of 
computa0on 
in 
scien0fic 
discovery, 
this 
means 
communica0ng 
all 
details 
of 
the 
computa0ons 
needed 
for 
others 
to 
replicate 
the 
experiment. 
V. Stodden, D. H. Bailey, J. M. Borwein, R. J. LeVeque, W. Rider, and W. Stein. Setting the default to 
reproducible: Reproducibility in computational and experimental mathematics. 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
2
basic 
studies 
on 
cancer 
are 
unreliable, 
with 
grim 
consequences 
for 
producing 
new 
medicines 
in 
the 
future 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
3
The 
research 
result, 
obtained 
by 
Stapel 
and 
co-­‐workers 
Roos 
Vonk 
(Radboud 
University) 
and 
Marcel 
Zeelenberg 
(nl) 
(Tilburg 
University), 
showing 
that 
meat 
eaters 
are 
more 
selfish 
than 
vegetarians, 
which 
was 
widely 
publicized 
in 
Dutch 
media 
is 
suspected 
to 
be 
based 
on 
faked 
data. 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
4
¡ ReplicaEon 
means 
conducEng 
studies 
with 
independent: 
§ InvesEgators 
§ Data, 
§ methods, 
§ Laboratories, 
§ Instruments. 
¡ ReplicaEon 
is 
the 
ulEmate 
standard 
for 
strengthening 
evidence 
and 
trust 
in 
scienEfic 
findings. 
¡ However, 
replicaEon 
is 
most 
of 
the 
Eme 
not 
possible: 
expensive 
(Eme 
and 
money), 
opportunisEc 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
5
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
6 
Way 
too 
expensive 
Reproducible Research: 
Make data and code 
available so that others 
Replication may reproduce findings 
Scholarly Article, 
is not enough 
Reproducibility 
(Re)useless
Workability 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
7 
Cost 
Reproducibility 
Level 
(Almost) 
Nothing 
Replicability 
Reproducibility
¡ The 
huge 
increases 
in 
performance 
both 
at 
the 
level 
of 
hardware 
and 
soVware, 
meant 
that 
highly 
complex 
analysis 
are 
possible. 
¡ However, 
these 
same 
advances 
meant 
a 
higher 
risk 
of 
generaEng 
results 
that 
cannot 
be 
reproduced. 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
8
¡ Researchers 
in 
experimental 
biology 
use 
carefully 
lab 
notebooks 
to 
document 
different 
aspects 
of 
their 
experiments. 
¡ This 
is 
not 
the 
case 
for 
computaEonal 
scienEsts 
who 
tend 
to 
run 
their 
analysis 
with 
no 
clear 
record 
of 
the 
exact 
process 
they 
followed 
or 
intermediary 
datasets 
(results) 
they 
used 
and 
generated. 
¡ It 
is 
therefore 
possible 
that 
numerous 
published 
results 
may 
be 
unreliable 
or 
even 
completely 
invalid. 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
9
¡ OVen, 
there 
is 
no 
record 
of 
the 
process 
(workflow) 
that 
produced 
the 
published 
computaEonal 
results 
in 
scholarly 
communicaEons. 
¡ Even 
the 
code 
is 
missing, 
or 
underwent 
changes. 
§ It 
cannot 
be 
used 
to 
process 
the 
data 
referred 
to, 
(if 
we 
are 
lucky). 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
10
“The 
reproducible 
research 
movement 
recognizes 
that 
tradi0onal 
scien0fic 
research 
and 
publica0on 
prac0ces 
now 
fall 
short 
…, 
and 
encourages 
all 
those 
involved 
i n 
the 
produc0on 
of 
computa0onal 
science 
... 
to 
facilitate 
and 
prac0ce 
really 
reproducible 
research.” 
We 
witnessed 
recently 
the 
emergence 
of 
a 
number 
of 
methods 
and 
tools 
for 
enabling 
reproducibility 
V. Stodden, D. H. Bailey, J. M. Borwein, R. J. LeVeque, W. Rider, and W. Stein. Setting the default to 
reproducible: Reproducibility in computational and experimental mathematics. 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
11
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
12 
System-­‐Level 
Reproducibility 
Reprozip 
Burrito 
ES3 
Scripting 
oriented 
Reproducibility 
IPython 
Knitr 
IJulia 
Workflow 
oriented 
reproducibility 
Galaxy 
Taverna 
Vistrails 
Article 
Centered 
Reproducibility 
SOLE 
DEEP 
SHARE 
Investigation 
oriented 
Reproducibility 
ISA 
Research 
Object 
FuGE
Packing Experiments AUTHORS 
Computational Environment E 
Execution p’ 
Experiment ReproZip 
p 
Provenance Tree 
Capture of Provenance 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
13
Packing Experiments AUTHORS 
Computational Environment E 
Execution p’ 
Experiment ReproZip 
Capture of Provenance 
p 
• command-line 
arguments 
• working directory 
• files read 
• files written 
… 
process p’ 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
14
Packing Experiments AUTHORS 
Computational Environment E 
Experiment ReproZip 
Capture of Provenance 
Description of data 
Description of experiment 
Description of environment 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
15 
Execution 
Provenance Tree 
Identification of 
Necessary 
Components 
Input and output files 
Executable programs and steps 
Environment variables, dependencies, …
Packing Experiments AUTHORS 
Computational Environment E 
Experiment ReproZip 
Capture of Provenance 
Description of data 
Description of experiment 
Description of environment
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
16 
Execution 
Provenance Tree 
Identification of 
Necessary 
Components 
Input and output files 
Executable programs and steps 
Environment variables, dependencies, … 
VisTrails Workflow 
Specification of 
Workflow 
Reproducible 
Package 
Figure taken from Chirigati et al., 2012
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
17 
System-­‐Level 
Reproducibility 
Reprozip 
Burrito 
ES3 
Scripting 
oriented 
Reproducibility 
IPython 
Knitr 
IJulia 
Workflow 
oriented 
reproducibility 
Galaxy 
Taverna 
Vistrails 
Article 
Centered 
Reproducibility 
SOLE 
DEEP 
SHARE 
Investigation 
oriented 
Reproducibility 
ISA 
Research 
Object 
FuGE
¡ IPython 
provides 
a 
rich 
architecture 
for 
interacEve 
compuEng 
with: 
§ A 
browser-­‐based 
notebook 
with 
support 
for 
code, 
text, 
mathemaEcal 
expressions, 
inline 
plots 
and 
other 
rich 
media. 
§ Support 
for 
interacEve 
data 
visualizaEon 
and 
use 
of 
GUI 
toolkits. 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
18
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
19
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
20 
System-­‐Level 
Reproducibility 
Reprozip 
Burrito 
ES3 
Scripting 
oriented 
Reproducibility 
IPython 
Knitr 
IJulia 
Workflow 
oriented 
reproducibility 
Galaxy 
Taverna 
Vistrails 
Article 
Centered 
Reproducibility 
SOLE 
DEEP 
SHARE 
Investigation 
oriented 
Reproducibility 
ISA 
Research 
Object 
FuGE
¡ Inputs 
to 
computaEonal 
science 
are 
not 
linked 
with 
its 
outputs. 
§ Inputs: 
Large 
quanEEes 
of 
data, 
complex 
data 
manipulaEon 
and/or 
numerical 
simulaEon 
use 
of 
large 
and 
oVen 
distributed 
soVware 
stacks. 
§ Outputs: 
Research 
papers 
(text-­‐based, 
non-­‐interacEve) 
¡ Authors 
and 
Readers 
§ approach 
computaEonal 
§ science 
from 
opposite 
direcEons 
¡ The 
objecEve 
of 
SOLE 
is 
to 
link 
research 
papers 
with 
auxiliary 
resources 
that 
have 
been 
uElized, 
e.g., 
datasets, 
soVware 
programs, 
files, 
etc. 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
21
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
22 
System-­‐Level 
Reproducibility 
Reprozip 
Burrito 
ES3 
Scripting 
oriented 
Reproducibility 
IPython 
Knitr 
IJulia 
Workflow 
oriented 
reproducibility 
Galaxy 
Taverna 
Vistrails 
Article 
Centered 
Reproducibility 
SOLE 
DEEP 
SHARE 
Investigation 
oriented 
Reproducibility 
ISA 
Research 
Object 
FuGE
¡ Assists 
users 
to 
submit 
the 
structured 
content 
via 
simple 
templates 
and 
an 
internal 
authoring 
tool 
¡ Performs 
value-­‐ 
added 
semanEc 
annotaEon 
of 
the 
experimental 
metadata 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
23
Duplicate 
Detection 
Reproducibility 
Summarization 
Combating 
Decay 
of 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
25
ScienEfic 
Workflow 
Reproducibility
¡ Data driven analysis pipelines 
¡ Systematic gathering of data and 
analysis tools into computational 
solutions for scientific problem-solving 
¡ Tools for automating frequently 
performed data intensive activities 
¡ Provenance for the resulting datasets 
§ The method followed 
§ The resources used 
§ The datasets used 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
27
GWAS, 
Pharmacogenomics 
Association 
study 
of 
Nevirapine-­‐induced 
skin 
rash 
in 
Thai 
Population 
Trypanosomiasis 
(sleeping 
sickness 
parasite) 
in 
African 
Cattle 
Astronomy 
 
HelioPhysics 
Library 
Doc 
Preservation 
Systems 
Biology 
of 
Micro-­‐ 
Organisms 
Observing 
Systems 
Simulation 
Experiments 
JPL, 
NASA 
BioDiversity 
Invasive 
Species 
Modelling 
[Credit Carole A. Goble] 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
28
¡ Scientific workflows are primarily used to specify and enact in 
silico experiments 
¡ However, they can also be used as a a means to document the 
experiment that the scientist ran, and even repurpose it! 
Khalid Belhajjame @ PoliWeb Workshop, 2014 
Kegg pathway 
query 
Kegg pathway 
query 
chromosome17 
chromosome37 
Detect common 
pathways 
Common 
pathways 
 Scientific workflows 
 Increasingly adopted in modern sciences. 
 Transparent documentation of 
experimental methods 
 Repeatable and configurable 
29
¡ A decayed or reduced ability to be executed or 
produce the same results 
¡ To better understand workflow decay, we 
conducted an empirical analysis to identify the 
causes of workflow decay. 
¡ To do so, we analyzed a sample of real 
workflows to determine if they suffer from 
decay and the reasons that caused their decay 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
30
¡ Taverna workflows 
from 
myExperiment.org 
§ Taverna 1 
§ Taverna 2 
¡ Selection process 
§ By the creation year 
§ By the creator 
§ By the domain 
¡ Software 
environment 
§ Taverna 2.3 
¡ Experiment 
metadata 
§ June-July 2012 
§ 4 researchers 
Khalid Belhajjame @ PoliWeb Workshop, 2014 31
Number of Taverna 1 workflows from 2007 to 2011 
2007 2008 2009 2010 2011 
Tested 12 10 10 10 4* 
Total 74 341 101 26 13 
Number of Taverna 2 workflows from 2009 to 2012 
2009 2010 2011 2012 
Tested 12 10 15 9 
Total 97 308 289 184 
Khalid Belhajjame @ PoliWeb Workshop, 2014 
32
Khalid Belhajjame @ PoliWeb Workshop, 2014 33
¡ 75% of the 92 tested 
workflows failed to 
be either executed or 
produce the same 
result (if testable) 
¡ Those from early 
years (2007-2009) 
had 91% failure rate 
Khalid Belhajjame @ PoliWeb Workshop, 2014 
Taverna 1 
Taverna 2 
34
¡ Manual analysis 
§ By the validation report from Taverna workbench 
§ By interpreting experiment results reported by Taverna 
¡ Identified 4 categories of causes 
§ Missing example data 
§ Missing execution environment 
§ Insufficient descriptions about workflows 
§ Volatile third-party Resources 
¡ Other unconsidered possible factors 
§ Changes in the local operating environment (hardware, OS, middleware, 
compiler, etc) 
Khalid Belhajjame @ PoliWeb Workshop, 2014 
35
Causes 
Refined 
Causes 
Examples 
Third 
party 
resources 
are 
not 
available 
Underlying 
dataset, 
particularly 
those 
locally 
hosted 
in-­‐house 
dataset, 
is 
no 
longer 
available 
Khalid Belhajjame @ PoliWeb Workshop, 2014 
Researcher 
hosting 
the 
data 
changed 
institution, 
server 
is 
no 
longer 
available 
Services 
are 
deprecated 
DDBJ 
web 
services 
are 
not 
longer 
provided 
despite 
the 
fact 
that 
they 
are 
used 
in 
many 
myExperiment 
workflows 
Third 
party 
resources 
are 
available 
but 
not 
accessible 
Data 
is 
available 
but 
identified 
using 
different 
IDs 
than 
the 
ones 
known 
to 
the 
user 
Due 
to 
scalability 
reasons 
the 
input 
data 
is 
superseded 
by 
new 
one 
making 
the 
workflow 
not 
executable 
or 
providing 
wrong 
results 
Data 
is 
available 
but 
permission, 
certificate, 
or 
network 
to 
access 
it 
is 
needed 
Cannot 
get 
the 
input, 
which 
is 
a 
security 
token 
that 
can 
only 
be 
obtained 
by 
a 
registered 
user 
of 
ChemiSpider 
Services 
are 
available 
but 
need 
permission, 
certificate, 
or 
network 
to 
access 
and 
invoke 
them 
The 
security 
policies 
of 
the 
execution 
framework 
are 
updated 
due 
to 
new 
hosting 
institution 
rules 
Third 
party 
resources 
have 
changed 
Services 
are 
still 
available 
by 
using 
the 
same 
identifiers 
but 
their 
functionality 
have 
changed 
The 
web 
services 
are 
updated 
36
¡ 50% of the decay was caused by 
volatility of 3rd-party resource 
§ Unavailable 
§ Inaccessible 
§ Updated 
¡ Missing example data 
§ Unable to re-run 
¡ Missing execution environment 
§ Such as local plugins 
¡ Insufficient metadata 
§ Such as any required dependency 
libraries or permission 
information 
Khalid Belhajjame @ PoliWeb Workshop, 2014 
37
ScienEfic 
Workflow 
Reproducibility
¡ Some 
services 
that 
compose 
workflows 
are 
annotated 
using 
concepts 
from 
domain 
ontologies 
¡ Such 
annotaEons 
can 
be 
used 
to 
repair 
workflow 
§ IdenEfy 
available 
services 
that 
can 
play 
the 
same 
role 
as 
an 
unavailable 
service 
within 
a 
workflow. 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
39
Task ontology: captures information about the action carried 
out by service operations within a domain of interest, e.g., 
Sequence_alignment and Protein_identification 
 Domain ontology: captures information about the application 
domains covered by operation parameters, e.g., Protein_record 
and DNA_sequence 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
40
Task replaceability: For an operation op2 to be able to substitute 
an operation op1, op2 must fulfil a task that is equivalent to or 
subsumes the task op1 performs: 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
41
Parameter replaceability: To be compatible the domain of the 
output must be the same as or subconcept of the domain of the 
subsequent input. 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
42
While the method just presented is sound, its practical applicability 
is hindered by the following facts 
§ Semantic annotations of web services are scarce. 
§ Our experience suggests that a large proportion of existing 
semantic annotations suffer from inaccuracies 
§ As a result, a substitute that is discovered for replacing an 
unavailable operation using such annotations may turn out to be 
unsuitable, and, inversely, a suitable substitute may be 
discarded. 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
43
Existing 
Workflow 
Specifications 
Provenance 
traces 
of 
missing 
operations 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
44
Formally, 
let 
wf1 
be 
a 
workflow 
in 
which 
the 
operation 
op1 
is 
unavailable. 
The 
operation 
op2 
can 
replace 
the 
operation 
op1 
in 
terms 
of 
its 
inputs 
and 
outputs 
if: 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
45
¡ In addition to the compatibility in terms of inputs and outputs, we have to 
check that the candidate substitute performs a task compatible with that of 
the unavailable operation. 
¡ To perform this test, we exploit the following observation. An operation 
op2 is able to replace the operation op1 in terms of task, if for every 
possible input instances that op1 is able to consume, op2 delivers the same 
output as that obtained by invoking op1. 
¡ To perform the above test, however, we will have to call the missing 
operation op1! 
¡ A solution that we adopt for overcoming the above problem makes use of 
workflow provenance logs. These are traces that contain intermediate data 
that were used as input and delivered as output by the constituent 
operations of a workflow when enacted. 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
46
§ An 
operation 
op2 
may 
be 
compatible 
in 
terms 
of 
task 
with 
op1 
if: 
op2 
delivers 
the 
same 
results 
that 
op1 
delivered 
in 
past 
execuEons, 
that 
are 
logged 
within 
provenance 
logs, 
when 
fed 
using 
the 
same 
input 
values. 
§ Notice that we say may be compatible. This is because we may not be able to 
compare the outputs obtained for every possible input value of the operation 
op1. 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
47
¡ The 
condiEon 
that 
we 
have 
described 
for 
checking 
the 
suitability 
of 
an 
operation 
as 
a 
substitute 
for 
another 
one 
may 
be 
stronger 
than 
is 
required 
in 
practice. 
¡ There are various parameter representations that are adopted 
in bioinformatics. 
¡ Because of representation mismatch, a service operation that 
performs a task similar to the missing operation may be found 
to be unsuitable. 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
48
Example 
of 
values 
delivered 
by 
two 
operaEons 
using 
the 
same 
input 
value 
Value1 
Value2 
CosSym(value1,value2) 
= 
0.007 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
49
To 
overcome 
this 
problem, 
we 
use 
a 
two 
step 
process 
when 
comparing 
the 
values 
of 
parameters: 
1. Given 
a 
parameter 
value, 
we 
derive 
its 
representaEon. 
2. If 
the 
representaEon 
is 
associated 
with 
a 
key 
ahribute 
(idenEfier), 
extract 
the 
value 
of 
such 
an 
ahribute 
If 
two 
parameter 
values 
are 
associated 
with 
idenEfiers, 
then 
they 
are 
compared 
by 
comparing 
their 
idenEfiers. 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
50
Example 
of 
values 
delivered 
by 
two 
operaEons 
using 
the 
same 
input 
value 
Value1 
Value2 
Fasta Format 
Uniprot Format 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
51
ScienEfic 
Workflow 
Reproducibility
¡ ScienEfic 
workflows 
are 
increasingly 
used 
by 
scienEsts 
as 
a 
means 
for 
specifying 
and 
enacEng 
their 
experiments. 
¡ They 
tend 
to 
be 
data 
intensive 
¡ 
The 
data 
sets 
obtained 
as 
a 
result 
of 
their 
enactment 
can 
be 
stored 
in 
public 
repositories 
to 
be 
queried, 
analyzed 
and 
used 
to 
feed 
the 
execuEon 
of 
other 
workflows. 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
53
¡ The 
datasets 
obtained 
as 
a 
result 
of 
workflow 
execuEon 
oVen 
contain 
duplicates. 
¡ As 
a 
result: 
§ The 
analysis 
and 
interpretaEon 
of 
workflow 
results 
may 
become 
tedious. 
§ The 
presence 
of 
duplicates 
also 
unnecessarily 
increases 
the 
size 
of 
workflow 
results. 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
54
¡ Research 
in 
duplicate 
record 
detecEon 
has 
been 
acEve 
for 
more 
than 
three 
decades. 
§ Elmagarmid 
et 
al., 
2007 
conducted 
a 
comprehensive 
survey 
of 
the 
topics. 
¡ We 
do 
not 
aim 
to 
design 
yet 
another 
algorithm 
for 
comparing 
and 
matching 
records. 
¡ Rather, 
we 
invesEgate 
how 
provenance 
traces 
produced 
as 
a 
result 
of 
workflow 
execuEons 
can 
be 
used 
to 
guide 
the 
detecEon 
of 
duplicate 
records 
in 
workflow 
results. 
Ahmed 
K. 
Elmagarmid, 
Panagiotis 
G. 
Ipeirotis, 
and 
Vassilios 
S. 
Verykios. 
Du-­‐plicate 
record 
detection: 
A 
survey. 
IEEE 
Trans. 
Knowl. 
Data 
Eng., 
19(1):1–16,2007. 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
55
¡ A 
data 
driven 
workflow 
can 
be 
defined 
as 
a 
directed 
graph: 
wf = hN, Ei 
¡ A 
node 
represent 
an 
analysis 
operaEon, 
which 
has 
a 
set 
of 
input 
and 
output 
parameters. 
hop, Iop, Oopi 2 N 
hhop, oi, hop0, iii 2 E 
¡ The 
edges 
are 
dataflow 
dependencies: 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
56
The 
execuEon 
of 
workflows 
gives 
rise 
to 
provenance 
trace, 
which 
we 
capture 
using 
two 
relaEons. 
¡ 
Transforma5on: 
to 
specify 
that 
the 
execuEon 
of 
an 
operaEon 
took 
as 
input 
a 
given 
ordered 
set 
of 
records 
and 
generated 
another 
ordered 
set 
of 
records. 
op, o1, ro1 , . . . , op, om, rom op, i1, ri1 , . . . , op, in, rin 
OutBop InBop 
¡ Transfer: 
to 
specify 
transfer 
of 
records 
along 
the 
edges 
of 
the 
workflow. 
op , i , r op, o, r 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
57
To 
guide 
the 
detecEon 
of 
duplicates 
in 
workflow 
results 
we 
exploit 
the 
following 
fact: 
¡ An 
operaEon 
that 
is 
known 
to 
be 
determinisEc 
produces 
idenEcal 
output 
bindings 
given 
the 
same 
input 
binding. 
deterministic op OutBop InBop T OutBop InBop T 
id OutBop, OutBop 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
58
Provenance-­‐Guided 
Detection 
of 
Duplicates: 
Example 
IdentifyProtein 
GetGOTerm 
Ri 
Ro 
R’i 
R’o 
1. The 
set 
of 
records 
Ri 
that 
are 
bound 
to 
the 
input 
parameter 
of 
the 
starEng 
operaEon 
are 
compared 
to 
idenEfy 
duplicate 
records. 
The 
result 
of 
this 
phase 
is 
a 
parEEon 
of 
disjoint 
sets 
of 
idenEcal 
records. 
i 
o 
i’ 
o’ 
Ri R1i 
Rni 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
59
Provenance-­‐Guided 
Detection 
of 
Duplicates: 
Example 
IdentifyProtein 
Ri 
Ro 
R’i 
R’o 
2. The 
sets 
of 
records 
Ro, 
R’i 
GetGOTerm 
and 
R’o 
are 
parEEoned 
into 
sets 
of 
idenEcal 
records 
based 
on 
the 
parEEoning 
of 
Ri. 
For 
example: 
Ro R1o 
Rno 
Rio 
ro Ro s.t. ri Rii 
, IdentifyProtein, o, ro IdentifyProtein, i, ri 
i 
o 
i’ 
o’ 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
60
Provenance-­‐Guided 
Detection 
of 
Duplicates: 
Example 
¡ In 
the 
example 
just 
described, 
the 
operaEons 
that 
compose 
the 
workflow 
have 
exactly 
one 
input 
and 
one 
output 
parameter. 
§ However, 
the 
algorithm 
we 
developed 
supports 
operaEons 
with 
mulEple 
input 
and 
output 
parameters. 
¡ NoEce 
that 
we 
assumes 
that 
the 
analysis 
operaEons 
that 
compose 
the 
workflow 
are 
determinisEc. 
This 
is 
not 
always 
the 
case. 
§ This 
raises 
the 
quesEon 
as 
to 
how 
to 
determine 
that 
a 
given 
operaEon 
is 
determinisEc. 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
61
To 
verify 
the 
determinism 
of 
operaEons, 
we 
use 
an 
approach 
whereby 
operaEons 
are 
probed. 
1. Given 
an 
operaEon 
op, 
we 
select 
examples 
values 
that 
can 
be 
used 
by 
the 
inputs 
of 
op, 
and 
invoke 
op 
using 
those 
values 
mulEple 
Emes. 
2. 
If 
op 
produces 
idenEcal 
output 
values 
given 
idenEcal 
input 
values, 
then 
it 
is 
likely 
to 
be 
determinisEc, 
otherwise, 
it 
is 
not 
determinisEc. 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
62
To 
support 
duplicates 
detecEon 
in 
collecEon 
based 
workflows 
we 
need 
to 
be 
able 
to: 
¡ Iden5fy 
when 
two 
collec5ons 
are 
iden5cal 
Two 
collecEons 
Ri 
and 
Rj 
are 
idenEcal 
if 
they 
are 
of 
the 
same 
size 
and 
there 
is 
a 
bijecEve 
mapping: 
that 
maps 
each 
record 
ri 
in 
Ri 
to 
a 
record 
rj 
in 
Rj 
such 
that 
ri 
and 
rj 
are 
idenEcal 
¡ Iden5fy 
duplicates 
records 
between 
two 
collec5ons 
that 
are 
known 
to 
be 
iden5cal 
IdenEfy 
a 
bijecEve 
mapping 
that 
maps 
every 
ri 
in 
Ri 
to 
an 
idenEcal 
rj 
in 
Rj. 
map : Ri Rj 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
63
ScienEfic 
Workflow 
Reproducibility
¡ Overwhelming for users who are not 
the developers 
¡ Abstractions required for reporting 
¡ Lineage queries result in very long 
trails 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
65
¡ a.k.a. Shims 
D. 
Hull 
et 
al 
¡ Dealing with data and 
protocol heterogeneities 
¡ Local organization of data 
~ 60% 
Garijo 
D., 
Alper. 
P., 
Belhajjame 
K. 
et 
al 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
66
Process-Wise and Data- 
Wise abstractions 
¡ Sub-workflows 
§ Not always a significant unit 
of function (e.g. aesthetic 
purposes) 
¡ Bookmarked data links 
§ Cluster the output signature 
§ Further complicates workflow 
¡ Components 
§ Library dependent 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
67
¡ A graph model for representing workflows 
¡ Graph re-write rules for summarization 
IF performs certain function THEN re-write WF graph ! 
!!!!!! 
motifs reduction-primitives 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
68
¡ Domain Independent 
categorization 
§ Data-Oriented Nature 
§ Resource/Implementation- 
Oriented Nature 
¡ Captured In a lightweight 
OWL Ontology 
http://purl.org/net/wf-­‐motifs 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
69
Pure Dataflows 
W= N,E! 
Operation and Port Nodes 
N = (Nop U Np)! 
! 
Dataflow edges 
E = (Eopèp U Epèp U 
Epèop )! 
! 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
70
DataRetrieva 
l 
DataMovingl 
motifs(color_pathway_by_objects) = {m1:DataRetrieval}! 
motifs(Get_Image_From_URL_2) = {m2:DataMoving}! 
! 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
71
¡ Collapse (Up/Down) 
¡ Eliminate 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
72
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
73
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
74
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
75
¡ Strategies as a set of rules for 
summarization 
¡ Two sample strategies based on an 
empirical analysis of workflows 
¡ Reporting: 
§ Process: Significant activities (Retrieval, 
Analysis, Visualization) 
§ Data: 
§ Reduced cardinality 
§ Stripped of protocol specific payload/formatting 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
76
¡ By-Eliminate 
§ Minimal annotation effort 
§ Single rule 
¡ By Collapse 
§ More specific annotation 
§ Multiple rules 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
77
Workflow 
Designer 
Taverna 
Workbench 
Motif 
Ontology 
WF Summary 
WF Description 
Summarizer 
Summarization 
Rules 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
78
¡ 30 Workflows from the Taverna system 
¡ Entire dataset  queries accessible from 
http://www.myexperiment.org/packs/467.html 
¡ Manual Annotation using Motif Vocabulary 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
79
By-Collapse 
¡ Causal Ordering of operations 
¡ Reduced depth 
By-Elimination 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
80
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
81
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
82
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
83
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
84
¡ Establishing Trust, but also understanding and 
reusability, in Computational Science is more 
than ever needed 
¡ Reproducibility seems to be a cost-effective 
solution 
¡ A number of tools and methods have been 
developed for doing so. 
¡ However, …. that is not enough 
¡ Changing our ways (culture) of doing science is 
more challenging 
Khalid 
Belhajjame 
@ 
PoliWeb 
Workshop, 
2014 
85

Weitere ähnliche Inhalte

Was ist angesagt?

Towards Incidental Collaboratories; Research Data Services
Towards Incidental Collaboratories; Research Data ServicesTowards Incidental Collaboratories; Research Data Services
Towards Incidental Collaboratories; Research Data ServicesAnita de Waard
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
 
2017-11-03 Provenance and Research Object
2017-11-03 Provenance and Research Object2017-11-03 Provenance and Research Object
2017-11-03 Provenance and Research ObjectStian Soiland-Reyes
 
SEEK for Science: A Data and Model Management Platform to support Open and Re...
SEEK for Science: A Data and Model Management Platform to support Open and Re...SEEK for Science: A Data and Model Management Platform to support Open and Re...
SEEK for Science: A Data and Model Management Platform to support Open and Re...Carole Goble
 
Reproducible research: First steps.
Reproducible research: First steps. Reproducible research: First steps.
Reproducible research: First steps. Richard Layton
 
From peer-reviewed to peer-reproduced: a role for research objects in scholar...
From peer-reviewed to peer-reproduced: a role for research objects in scholar...From peer-reviewed to peer-reproduced: a role for research objects in scholar...
From peer-reviewed to peer-reproduced: a role for research objects in scholar...Alejandra Gonzalez-Beltran
 
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
What is Reproducibility? The R* brouhaha (and how Research Objects can help)What is Reproducibility? The R* brouhaha (and how Research Objects can help)
What is Reproducibility? The R* brouhaha (and how Research Objects can help)Carole Goble
 
Reproducibility of model-based results: standards, infrastructure, and recogn...
Reproducibility of model-based results: standards, infrastructure, and recogn...Reproducibility of model-based results: standards, infrastructure, and recogn...
Reproducibility of model-based results: standards, infrastructure, and recogn...FAIRDOM
 
Reproducible research: theory
Reproducible research: theoryReproducible research: theory
Reproducible research: theoryC. Tobin Magle
 
Technology and Students: Mix, Match or Miss?
Technology and Students: Mix, Match or Miss?Technology and Students: Mix, Match or Miss?
Technology and Students: Mix, Match or Miss?Jean-Claude Bradley
 
2017-11-03 Provenance and Research Object
2017-11-03 Provenance and Research Object2017-11-03 Provenance and Research Object
2017-11-03 Provenance and Research ObjectStian Soiland-Reyes
 
RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015William Gunn
 
Data Integration vs Transparency: Tackling the tension
Data Integration vs Transparency: Tackling the tensionData Integration vs Transparency: Tackling the tension
Data Integration vs Transparency: Tackling the tensionPaul Groth
 

Was ist angesagt? (20)

Towards Incidental Collaboratories; Research Data Services
Towards Incidental Collaboratories; Research Data ServicesTowards Incidental Collaboratories; Research Data Services
Towards Incidental Collaboratories; Research Data Services
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
 
2015_CV_J_SHELTON_linked
2015_CV_J_SHELTON_linked2015_CV_J_SHELTON_linked
2015_CV_J_SHELTON_linked
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2017-11-03 Provenance and Research Object
2017-11-03 Provenance and Research Object2017-11-03 Provenance and Research Object
2017-11-03 Provenance and Research Object
 
SEEK for Science: A Data and Model Management Platform to support Open and Re...
SEEK for Science: A Data and Model Management Platform to support Open and Re...SEEK for Science: A Data and Model Management Platform to support Open and Re...
SEEK for Science: A Data and Model Management Platform to support Open and Re...
 
Peer Review and Science2.0
Peer Review and Science2.0Peer Review and Science2.0
Peer Review and Science2.0
 
Reproducible research: First steps.
Reproducible research: First steps. Reproducible research: First steps.
Reproducible research: First steps.
 
From peer-reviewed to peer-reproduced: a role for research objects in scholar...
From peer-reviewed to peer-reproduced: a role for research objects in scholar...From peer-reviewed to peer-reproduced: a role for research objects in scholar...
From peer-reviewed to peer-reproduced: a role for research objects in scholar...
 
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
What is Reproducibility? The R* brouhaha (and how Research Objects can help)What is Reproducibility? The R* brouhaha (and how Research Objects can help)
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
 
Reproducibility of model-based results: standards, infrastructure, and recogn...
Reproducibility of model-based results: standards, infrastructure, and recogn...Reproducibility of model-based results: standards, infrastructure, and recogn...
Reproducibility of model-based results: standards, infrastructure, and recogn...
 
Reproducible research: theory
Reproducible research: theoryReproducible research: theory
Reproducible research: theory
 
OpenSciNY Open Notebook Science
OpenSciNY Open Notebook ScienceOpenSciNY Open Notebook Science
OpenSciNY Open Notebook Science
 
Philadelphia U Sciences 2011
Philadelphia U Sciences 2011Philadelphia U Sciences 2011
Philadelphia U Sciences 2011
 
Technology and Students: Mix, Match or Miss?
Technology and Students: Mix, Match or Miss?Technology and Students: Mix, Match or Miss?
Technology and Students: Mix, Match or Miss?
 
CV_10/17
CV_10/17CV_10/17
CV_10/17
 
2017-11-03 Provenance and Research Object
2017-11-03 Provenance and Research Object2017-11-03 Provenance and Research Object
2017-11-03 Provenance and Research Object
 
RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Data Integration vs Transparency: Tackling the tension
Data Integration vs Transparency: Tackling the tensionData Integration vs Transparency: Tackling the tension
Data Integration vs Transparency: Tackling the tension
 

Andere mochten auch

A Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsA Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsKhalid Belhajjame
 
Research Object Model in Sepublica
Research Object Model in SepublicaResearch Object Model in Sepublica
Research Object Model in SepublicaKhalid Belhajjame
 
Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014Khalid Belhajjame
 
Case studyworkshoponprovenance
Case studyworkshoponprovenanceCase studyworkshoponprovenance
Case studyworkshoponprovenanceKhalid Belhajjame
 
Detecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow ResultsDetecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow ResultsKhalid Belhajjame
 
Предиктивная аналитика и Big Data: методы, инструменты, решения
Предиктивная аналитика и Big Data: методы, инструменты, решенияПредиктивная аналитика и Big Data: методы, инструменты, решения
Предиктивная аналитика и Big Data: методы, инструменты, решенияDell_Russia
 
2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShareSlideShare
 
What to Upload to SlideShare
What to Upload to SlideShareWhat to Upload to SlideShare
What to Upload to SlideShareSlideShare
 
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShareSlideShare
 

Andere mochten auch (14)

Credible workshop
Credible workshopCredible workshop
Credible workshop
 
A Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsA Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its Extensions
 
Research Object Model in Sepublica
Research Object Model in SepublicaResearch Object Model in Sepublica
Research Object Model in Sepublica
 
Why Workflows Break
Why Workflows BreakWhy Workflows Break
Why Workflows Break
 
Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014
 
Edbt2014 talk
Edbt2014 talkEdbt2014 talk
Edbt2014 talk
 
Case studyworkshoponprovenance
Case studyworkshoponprovenanceCase studyworkshoponprovenance
Case studyworkshoponprovenance
 
Ikc 2015
Ikc 2015Ikc 2015
Ikc 2015
 
D-prov use-case
D-prov use-caseD-prov use-case
D-prov use-case
 
Detecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow ResultsDetecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow Results
 
Предиктивная аналитика и Big Data: методы, инструменты, решения
Предиктивная аналитика и Big Data: методы, инструменты, решенияПредиктивная аналитика и Big Data: методы, инструменты, решения
Предиктивная аналитика и Big Data: методы, инструменты, решения
 
2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare
 
What to Upload to SlideShare
What to Upload to SlideShareWhat to Upload to SlideShare
What to Upload to SlideShare
 
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShare
 

Ähnlich wie Reproducibility 1

Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017Carole Goble
 
The beauty of workflows and models
The beauty of workflows and modelsThe beauty of workflows and models
The beauty of workflows and modelsmyGrid team
 
Research Objects for FAIRer Science
Research Objects for FAIRer Science Research Objects for FAIRer Science
Research Objects for FAIRer Science Carole Goble
 
Mtsr2015 goble-keynote
Mtsr2015 goble-keynoteMtsr2015 goble-keynote
Mtsr2015 goble-keynoteCarole Goble
 
Apollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of GenomesApollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of GenomesMonica Munoz-Torres
 
The FAIRDOM Commons for Systems Biology
The FAIRDOM Commons for Systems BiologyThe FAIRDOM Commons for Systems Biology
The FAIRDOM Commons for Systems BiologyFAIRDOM
 
Introduction to FAIRDOM
Introduction to FAIRDOMIntroduction to FAIRDOM
Introduction to FAIRDOMCarole Goble
 
Reproducibility (and the R*) of Science: motivations, challenges and trends
Reproducibility (and the R*) of Science: motivations, challenges and trendsReproducibility (and the R*) of Science: motivations, challenges and trends
Reproducibility (and the R*) of Science: motivations, challenges and trendsCarole Goble
 
Virtual research environments for implementing long tail open science
Virtual research environments for implementing long tail open scienceVirtual research environments for implementing long tail open science
Virtual research environments for implementing long tail open scienceBlue BRIDGE
 
The BlueBRIDGE approach to collaborative research
The BlueBRIDGE approach to collaborative researchThe BlueBRIDGE approach to collaborative research
The BlueBRIDGE approach to collaborative researchBlue BRIDGE
 
2011 03-provenance-workshop-edingurgh
2011 03-provenance-workshop-edingurgh2011 03-provenance-workshop-edingurgh
2011 03-provenance-workshop-edingurghJun Zhao
 
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Blue BRIDGE
 
Reproducible, Open Data Science in the Life Sciences
Reproducible, Open  Data Science in the  Life SciencesReproducible, Open  Data Science in the  Life Sciences
Reproducible, Open Data Science in the Life SciencesEamonn Maguire
 
The Symbiotic Nature of Provenance and Workflow
The Symbiotic Nature of Provenance and WorkflowThe Symbiotic Nature of Provenance and Workflow
The Symbiotic Nature of Provenance and WorkflowEric Stephan
 
Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014Monica Munoz-Torres
 

Ähnlich wie Reproducibility 1 (20)

Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017
 
The beauty of workflows and models
The beauty of workflows and modelsThe beauty of workflows and models
The beauty of workflows and models
 
Research Objects for FAIRer Science
Research Objects for FAIRer Science Research Objects for FAIRer Science
Research Objects for FAIRer Science
 
Aussois bda-mdd-2018
Aussois bda-mdd-2018Aussois bda-mdd-2018
Aussois bda-mdd-2018
 
ISMB Workshop 2014
ISMB Workshop 2014ISMB Workshop 2014
ISMB Workshop 2014
 
Mtsr2015 goble-keynote
Mtsr2015 goble-keynoteMtsr2015 goble-keynote
Mtsr2015 goble-keynote
 
Apollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of GenomesApollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
 
The FAIRDOM Commons for Systems Biology
The FAIRDOM Commons for Systems BiologyThe FAIRDOM Commons for Systems Biology
The FAIRDOM Commons for Systems Biology
 
Introduction to FAIRDOM
Introduction to FAIRDOMIntroduction to FAIRDOM
Introduction to FAIRDOM
 
Reproducibility (and the R*) of Science: motivations, challenges and trends
Reproducibility (and the R*) of Science: motivations, challenges and trendsReproducibility (and the R*) of Science: motivations, challenges and trends
Reproducibility (and the R*) of Science: motivations, challenges and trends
 
Pine education-platform
Pine education-platformPine education-platform
Pine education-platform
 
Virtual research environments for implementing long tail open science
Virtual research environments for implementing long tail open scienceVirtual research environments for implementing long tail open science
Virtual research environments for implementing long tail open science
 
The BlueBRIDGE approach to collaborative research
The BlueBRIDGE approach to collaborative researchThe BlueBRIDGE approach to collaborative research
The BlueBRIDGE approach to collaborative research
 
FAIRer Research
FAIRer ResearchFAIRer Research
FAIRer Research
 
UKON 2014
UKON 2014UKON 2014
UKON 2014
 
2011 03-provenance-workshop-edingurgh
2011 03-provenance-workshop-edingurgh2011 03-provenance-workshop-edingurgh
2011 03-provenance-workshop-edingurgh
 
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
 
Reproducible, Open Data Science in the Life Sciences
Reproducible, Open  Data Science in the  Life SciencesReproducible, Open  Data Science in the  Life Sciences
Reproducible, Open Data Science in the Life Sciences
 
The Symbiotic Nature of Provenance and Workflow
The Symbiotic Nature of Provenance and WorkflowThe Symbiotic Nature of Provenance and Workflow
The Symbiotic Nature of Provenance and Workflow
 
Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014
 

Mehr von Khalid Belhajjame

Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsLineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsKhalid Belhajjame
 
Privacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eSciencePrivacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eScienceKhalid Belhajjame
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsKhalid Belhajjame
 
Linking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsLinking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsKhalid Belhajjame
 
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...Khalid Belhajjame
 
Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Khalid Belhajjame
 

Mehr von Khalid Belhajjame (11)

Provenance witha purpose
Provenance witha purposeProvenance witha purpose
Provenance witha purpose
 
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsLineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
 
Privacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eSciencePrivacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eScience
 
Irpb workshop
Irpb workshopIrpb workshop
Irpb workshop
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objects
 
Anr cair meeting feb 2016
Anr cair meeting feb 2016Anr cair meeting feb 2016
Anr cair meeting feb 2016
 
Linking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsLinking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scripts
 
Tapp 2014 (belhajjame)
Tapp 2014 (belhajjame)Tapp 2014 (belhajjame)
Tapp 2014 (belhajjame)
 
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
 
Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)
 
Edbt 2010, Belhajjame
Edbt 2010, BelhajjameEdbt 2010, Belhajjame
Edbt 2010, Belhajjame
 

Kürzlich hochgeladen

Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 

Kürzlich hochgeladen (20)

Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 

Reproducibility 1

  • 2. “Science is built upon the founda0ons of theory and experiment validated and improved through open, transparent communica0on. With the increasingly central role of computa0on in scien0fic discovery, this means communica0ng all details of the computa0ons needed for others to replicate the experiment. V. Stodden, D. H. Bailey, J. M. Borwein, R. J. LeVeque, W. Rider, and W. Stein. Setting the default to reproducible: Reproducibility in computational and experimental mathematics. Khalid Belhajjame @ PoliWeb Workshop, 2014 2
  • 3. basic studies on cancer are unreliable, with grim consequences for producing new medicines in the future Khalid Belhajjame @ PoliWeb Workshop, 2014 3
  • 4. The research result, obtained by Stapel and co-­‐workers Roos Vonk (Radboud University) and Marcel Zeelenberg (nl) (Tilburg University), showing that meat eaters are more selfish than vegetarians, which was widely publicized in Dutch media is suspected to be based on faked data. Khalid Belhajjame @ PoliWeb Workshop, 2014 4
  • 5. ¡ ReplicaEon means conducEng studies with independent: § InvesEgators § Data, § methods, § Laboratories, § Instruments. ¡ ReplicaEon is the ulEmate standard for strengthening evidence and trust in scienEfic findings. ¡ However, replicaEon is most of the Eme not possible: expensive (Eme and money), opportunisEc Khalid Belhajjame @ PoliWeb Workshop, 2014 5
  • 6. Khalid Belhajjame @ PoliWeb Workshop, 2014 6 Way too expensive Reproducible Research: Make data and code available so that others Replication may reproduce findings Scholarly Article, is not enough Reproducibility (Re)useless
  • 7. Workability Khalid Belhajjame @ PoliWeb Workshop, 2014 7 Cost Reproducibility Level (Almost) Nothing Replicability Reproducibility
  • 8. ¡ The huge increases in performance both at the level of hardware and soVware, meant that highly complex analysis are possible. ¡ However, these same advances meant a higher risk of generaEng results that cannot be reproduced. Khalid Belhajjame @ PoliWeb Workshop, 2014 8
  • 9. ¡ Researchers in experimental biology use carefully lab notebooks to document different aspects of their experiments. ¡ This is not the case for computaEonal scienEsts who tend to run their analysis with no clear record of the exact process they followed or intermediary datasets (results) they used and generated. ¡ It is therefore possible that numerous published results may be unreliable or even completely invalid. Khalid Belhajjame @ PoliWeb Workshop, 2014 9
  • 10. ¡ OVen, there is no record of the process (workflow) that produced the published computaEonal results in scholarly communicaEons. ¡ Even the code is missing, or underwent changes. § It cannot be used to process the data referred to, (if we are lucky). Khalid Belhajjame @ PoliWeb Workshop, 2014 10
  • 11. “The reproducible research movement recognizes that tradi0onal scien0fic research and publica0on prac0ces now fall short …, and encourages all those involved i n the produc0on of computa0onal science ... to facilitate and prac0ce really reproducible research.” We witnessed recently the emergence of a number of methods and tools for enabling reproducibility V. Stodden, D. H. Bailey, J. M. Borwein, R. J. LeVeque, W. Rider, and W. Stein. Setting the default to reproducible: Reproducibility in computational and experimental mathematics. Khalid Belhajjame @ PoliWeb Workshop, 2014 11
  • 12. Khalid Belhajjame @ PoliWeb Workshop, 2014 12 System-­‐Level Reproducibility Reprozip Burrito ES3 Scripting oriented Reproducibility IPython Knitr IJulia Workflow oriented reproducibility Galaxy Taverna Vistrails Article Centered Reproducibility SOLE DEEP SHARE Investigation oriented Reproducibility ISA Research Object FuGE
  • 13. Packing Experiments AUTHORS Computational Environment E Execution p’ Experiment ReproZip p Provenance Tree Capture of Provenance Khalid Belhajjame @ PoliWeb Workshop, 2014 13
  • 14. Packing Experiments AUTHORS Computational Environment E Execution p’ Experiment ReproZip Capture of Provenance p • command-line arguments • working directory • files read • files written … process p’ Khalid Belhajjame @ PoliWeb Workshop, 2014 14
  • 15. Packing Experiments AUTHORS Computational Environment E Experiment ReproZip Capture of Provenance Description of data Description of experiment Description of environment Khalid Belhajjame @ PoliWeb Workshop, 2014 15 Execution Provenance Tree Identification of Necessary Components Input and output files Executable programs and steps Environment variables, dependencies, …
  • 16. Packing Experiments AUTHORS Computational Environment E Experiment ReproZip Capture of Provenance Description of data Description of experiment Description of environment
  • 17. Khalid Belhajjame @ PoliWeb Workshop, 2014 16 Execution Provenance Tree Identification of Necessary Components Input and output files Executable programs and steps Environment variables, dependencies, … VisTrails Workflow Specification of Workflow Reproducible Package Figure taken from Chirigati et al., 2012
  • 18. Khalid Belhajjame @ PoliWeb Workshop, 2014 17 System-­‐Level Reproducibility Reprozip Burrito ES3 Scripting oriented Reproducibility IPython Knitr IJulia Workflow oriented reproducibility Galaxy Taverna Vistrails Article Centered Reproducibility SOLE DEEP SHARE Investigation oriented Reproducibility ISA Research Object FuGE
  • 19. ¡ IPython provides a rich architecture for interacEve compuEng with: § A browser-­‐based notebook with support for code, text, mathemaEcal expressions, inline plots and other rich media. § Support for interacEve data visualizaEon and use of GUI toolkits. Khalid Belhajjame @ PoliWeb Workshop, 2014 18
  • 20. Khalid Belhajjame @ PoliWeb Workshop, 2014 19
  • 21. Khalid Belhajjame @ PoliWeb Workshop, 2014 20 System-­‐Level Reproducibility Reprozip Burrito ES3 Scripting oriented Reproducibility IPython Knitr IJulia Workflow oriented reproducibility Galaxy Taverna Vistrails Article Centered Reproducibility SOLE DEEP SHARE Investigation oriented Reproducibility ISA Research Object FuGE
  • 22. ¡ Inputs to computaEonal science are not linked with its outputs. § Inputs: Large quanEEes of data, complex data manipulaEon and/or numerical simulaEon use of large and oVen distributed soVware stacks. § Outputs: Research papers (text-­‐based, non-­‐interacEve) ¡ Authors and Readers § approach computaEonal § science from opposite direcEons ¡ The objecEve of SOLE is to link research papers with auxiliary resources that have been uElized, e.g., datasets, soVware programs, files, etc. Khalid Belhajjame @ PoliWeb Workshop, 2014 21
  • 23. Khalid Belhajjame @ PoliWeb Workshop, 2014 22 System-­‐Level Reproducibility Reprozip Burrito ES3 Scripting oriented Reproducibility IPython Knitr IJulia Workflow oriented reproducibility Galaxy Taverna Vistrails Article Centered Reproducibility SOLE DEEP SHARE Investigation oriented Reproducibility ISA Research Object FuGE
  • 24. ¡ Assists users to submit the structured content via simple templates and an internal authoring tool ¡ Performs value-­‐ added semanEc annotaEon of the experimental metadata Khalid Belhajjame @ PoliWeb Workshop, 2014 23
  • 25.
  • 26. Duplicate Detection Reproducibility Summarization Combating Decay of Khalid Belhajjame @ PoliWeb Workshop, 2014 25
  • 28. ¡ Data driven analysis pipelines ¡ Systematic gathering of data and analysis tools into computational solutions for scientific problem-solving ¡ Tools for automating frequently performed data intensive activities ¡ Provenance for the resulting datasets § The method followed § The resources used § The datasets used Khalid Belhajjame @ PoliWeb Workshop, 2014 27
  • 29. GWAS, Pharmacogenomics Association study of Nevirapine-­‐induced skin rash in Thai Population Trypanosomiasis (sleeping sickness parasite) in African Cattle Astronomy HelioPhysics Library Doc Preservation Systems Biology of Micro-­‐ Organisms Observing Systems Simulation Experiments JPL, NASA BioDiversity Invasive Species Modelling [Credit Carole A. Goble] Khalid Belhajjame @ PoliWeb Workshop, 2014 28
  • 30. ¡ Scientific workflows are primarily used to specify and enact in silico experiments ¡ However, they can also be used as a a means to document the experiment that the scientist ran, and even repurpose it! Khalid Belhajjame @ PoliWeb Workshop, 2014 Kegg pathway query Kegg pathway query chromosome17 chromosome37 Detect common pathways Common pathways Scientific workflows Increasingly adopted in modern sciences. Transparent documentation of experimental methods Repeatable and configurable 29
  • 31. ¡ A decayed or reduced ability to be executed or produce the same results ¡ To better understand workflow decay, we conducted an empirical analysis to identify the causes of workflow decay. ¡ To do so, we analyzed a sample of real workflows to determine if they suffer from decay and the reasons that caused their decay Khalid Belhajjame @ PoliWeb Workshop, 2014 30
  • 32. ¡ Taverna workflows from myExperiment.org § Taverna 1 § Taverna 2 ¡ Selection process § By the creation year § By the creator § By the domain ¡ Software environment § Taverna 2.3 ¡ Experiment metadata § June-July 2012 § 4 researchers Khalid Belhajjame @ PoliWeb Workshop, 2014 31
  • 33. Number of Taverna 1 workflows from 2007 to 2011 2007 2008 2009 2010 2011 Tested 12 10 10 10 4* Total 74 341 101 26 13 Number of Taverna 2 workflows from 2009 to 2012 2009 2010 2011 2012 Tested 12 10 15 9 Total 97 308 289 184 Khalid Belhajjame @ PoliWeb Workshop, 2014 32
  • 34. Khalid Belhajjame @ PoliWeb Workshop, 2014 33
  • 35. ¡ 75% of the 92 tested workflows failed to be either executed or produce the same result (if testable) ¡ Those from early years (2007-2009) had 91% failure rate Khalid Belhajjame @ PoliWeb Workshop, 2014 Taverna 1 Taverna 2 34
  • 36. ¡ Manual analysis § By the validation report from Taverna workbench § By interpreting experiment results reported by Taverna ¡ Identified 4 categories of causes § Missing example data § Missing execution environment § Insufficient descriptions about workflows § Volatile third-party Resources ¡ Other unconsidered possible factors § Changes in the local operating environment (hardware, OS, middleware, compiler, etc) Khalid Belhajjame @ PoliWeb Workshop, 2014 35
  • 37. Causes Refined Causes Examples Third party resources are not available Underlying dataset, particularly those locally hosted in-­‐house dataset, is no longer available Khalid Belhajjame @ PoliWeb Workshop, 2014 Researcher hosting the data changed institution, server is no longer available Services are deprecated DDBJ web services are not longer provided despite the fact that they are used in many myExperiment workflows Third party resources are available but not accessible Data is available but identified using different IDs than the ones known to the user Due to scalability reasons the input data is superseded by new one making the workflow not executable or providing wrong results Data is available but permission, certificate, or network to access it is needed Cannot get the input, which is a security token that can only be obtained by a registered user of ChemiSpider Services are available but need permission, certificate, or network to access and invoke them The security policies of the execution framework are updated due to new hosting institution rules Third party resources have changed Services are still available by using the same identifiers but their functionality have changed The web services are updated 36
  • 38. ¡ 50% of the decay was caused by volatility of 3rd-party resource § Unavailable § Inaccessible § Updated ¡ Missing example data § Unable to re-run ¡ Missing execution environment § Such as local plugins ¡ Insufficient metadata § Such as any required dependency libraries or permission information Khalid Belhajjame @ PoliWeb Workshop, 2014 37
  • 40. ¡ Some services that compose workflows are annotated using concepts from domain ontologies ¡ Such annotaEons can be used to repair workflow § IdenEfy available services that can play the same role as an unavailable service within a workflow. Khalid Belhajjame @ PoliWeb Workshop, 2014 39
  • 41. Task ontology: captures information about the action carried out by service operations within a domain of interest, e.g., Sequence_alignment and Protein_identification Domain ontology: captures information about the application domains covered by operation parameters, e.g., Protein_record and DNA_sequence Khalid Belhajjame @ PoliWeb Workshop, 2014 40
  • 42. Task replaceability: For an operation op2 to be able to substitute an operation op1, op2 must fulfil a task that is equivalent to or subsumes the task op1 performs: Khalid Belhajjame @ PoliWeb Workshop, 2014 41
  • 43. Parameter replaceability: To be compatible the domain of the output must be the same as or subconcept of the domain of the subsequent input. Khalid Belhajjame @ PoliWeb Workshop, 2014 42
  • 44. While the method just presented is sound, its practical applicability is hindered by the following facts § Semantic annotations of web services are scarce. § Our experience suggests that a large proportion of existing semantic annotations suffer from inaccuracies § As a result, a substitute that is discovered for replacing an unavailable operation using such annotations may turn out to be unsuitable, and, inversely, a suitable substitute may be discarded. Khalid Belhajjame @ PoliWeb Workshop, 2014 43
  • 45. Existing Workflow Specifications Provenance traces of missing operations Khalid Belhajjame @ PoliWeb Workshop, 2014 44
  • 46. Formally, let wf1 be a workflow in which the operation op1 is unavailable. The operation op2 can replace the operation op1 in terms of its inputs and outputs if: Khalid Belhajjame @ PoliWeb Workshop, 2014 45
  • 47. ¡ In addition to the compatibility in terms of inputs and outputs, we have to check that the candidate substitute performs a task compatible with that of the unavailable operation. ¡ To perform this test, we exploit the following observation. An operation op2 is able to replace the operation op1 in terms of task, if for every possible input instances that op1 is able to consume, op2 delivers the same output as that obtained by invoking op1. ¡ To perform the above test, however, we will have to call the missing operation op1! ¡ A solution that we adopt for overcoming the above problem makes use of workflow provenance logs. These are traces that contain intermediate data that were used as input and delivered as output by the constituent operations of a workflow when enacted. Khalid Belhajjame @ PoliWeb Workshop, 2014 46
  • 48. § An operation op2 may be compatible in terms of task with op1 if: op2 delivers the same results that op1 delivered in past execuEons, that are logged within provenance logs, when fed using the same input values. § Notice that we say may be compatible. This is because we may not be able to compare the outputs obtained for every possible input value of the operation op1. Khalid Belhajjame @ PoliWeb Workshop, 2014 47
  • 49. ¡ The condiEon that we have described for checking the suitability of an operation as a substitute for another one may be stronger than is required in practice. ¡ There are various parameter representations that are adopted in bioinformatics. ¡ Because of representation mismatch, a service operation that performs a task similar to the missing operation may be found to be unsuitable. Khalid Belhajjame @ PoliWeb Workshop, 2014 48
  • 50. Example of values delivered by two operaEons using the same input value Value1 Value2 CosSym(value1,value2) = 0.007 Khalid Belhajjame @ PoliWeb Workshop, 2014 49
  • 51. To overcome this problem, we use a two step process when comparing the values of parameters: 1. Given a parameter value, we derive its representaEon. 2. If the representaEon is associated with a key ahribute (idenEfier), extract the value of such an ahribute If two parameter values are associated with idenEfiers, then they are compared by comparing their idenEfiers. Khalid Belhajjame @ PoliWeb Workshop, 2014 50
  • 52. Example of values delivered by two operaEons using the same input value Value1 Value2 Fasta Format Uniprot Format Khalid Belhajjame @ PoliWeb Workshop, 2014 51
  • 54. ¡ ScienEfic workflows are increasingly used by scienEsts as a means for specifying and enacEng their experiments. ¡ They tend to be data intensive ¡ The data sets obtained as a result of their enactment can be stored in public repositories to be queried, analyzed and used to feed the execuEon of other workflows. Khalid Belhajjame @ PoliWeb Workshop, 2014 53
  • 55. ¡ The datasets obtained as a result of workflow execuEon oVen contain duplicates. ¡ As a result: § The analysis and interpretaEon of workflow results may become tedious. § The presence of duplicates also unnecessarily increases the size of workflow results. Khalid Belhajjame @ PoliWeb Workshop, 2014 54
  • 56. ¡ Research in duplicate record detecEon has been acEve for more than three decades. § Elmagarmid et al., 2007 conducted a comprehensive survey of the topics. ¡ We do not aim to design yet another algorithm for comparing and matching records. ¡ Rather, we invesEgate how provenance traces produced as a result of workflow execuEons can be used to guide the detecEon of duplicate records in workflow results. Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Du-­‐plicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1–16,2007. Khalid Belhajjame @ PoliWeb Workshop, 2014 55
  • 57. ¡ A data driven workflow can be defined as a directed graph: wf = hN, Ei ¡ A node represent an analysis operaEon, which has a set of input and output parameters. hop, Iop, Oopi 2 N hhop, oi, hop0, iii 2 E ¡ The edges are dataflow dependencies: Khalid Belhajjame @ PoliWeb Workshop, 2014 56
  • 58. The execuEon of workflows gives rise to provenance trace, which we capture using two relaEons. ¡ Transforma5on: to specify that the execuEon of an operaEon took as input a given ordered set of records and generated another ordered set of records. op, o1, ro1 , . . . , op, om, rom op, i1, ri1 , . . . , op, in, rin OutBop InBop ¡ Transfer: to specify transfer of records along the edges of the workflow. op , i , r op, o, r Khalid Belhajjame @ PoliWeb Workshop, 2014 57
  • 59. To guide the detecEon of duplicates in workflow results we exploit the following fact: ¡ An operaEon that is known to be determinisEc produces idenEcal output bindings given the same input binding. deterministic op OutBop InBop T OutBop InBop T id OutBop, OutBop Khalid Belhajjame @ PoliWeb Workshop, 2014 58
  • 60. Provenance-­‐Guided Detection of Duplicates: Example IdentifyProtein GetGOTerm Ri Ro R’i R’o 1. The set of records Ri that are bound to the input parameter of the starEng operaEon are compared to idenEfy duplicate records. The result of this phase is a parEEon of disjoint sets of idenEcal records. i o i’ o’ Ri R1i Rni Khalid Belhajjame @ PoliWeb Workshop, 2014 59
  • 61. Provenance-­‐Guided Detection of Duplicates: Example IdentifyProtein Ri Ro R’i R’o 2. The sets of records Ro, R’i GetGOTerm and R’o are parEEoned into sets of idenEcal records based on the parEEoning of Ri. For example: Ro R1o Rno Rio ro Ro s.t. ri Rii , IdentifyProtein, o, ro IdentifyProtein, i, ri i o i’ o’ Khalid Belhajjame @ PoliWeb Workshop, 2014 60
  • 62. Provenance-­‐Guided Detection of Duplicates: Example ¡ In the example just described, the operaEons that compose the workflow have exactly one input and one output parameter. § However, the algorithm we developed supports operaEons with mulEple input and output parameters. ¡ NoEce that we assumes that the analysis operaEons that compose the workflow are determinisEc. This is not always the case. § This raises the quesEon as to how to determine that a given operaEon is determinisEc. Khalid Belhajjame @ PoliWeb Workshop, 2014 61
  • 63. To verify the determinism of operaEons, we use an approach whereby operaEons are probed. 1. Given an operaEon op, we select examples values that can be used by the inputs of op, and invoke op using those values mulEple Emes. 2. If op produces idenEcal output values given idenEcal input values, then it is likely to be determinisEc, otherwise, it is not determinisEc. Khalid Belhajjame @ PoliWeb Workshop, 2014 62
  • 64. To support duplicates detecEon in collecEon based workflows we need to be able to: ¡ Iden5fy when two collec5ons are iden5cal Two collecEons Ri and Rj are idenEcal if they are of the same size and there is a bijecEve mapping: that maps each record ri in Ri to a record rj in Rj such that ri and rj are idenEcal ¡ Iden5fy duplicates records between two collec5ons that are known to be iden5cal IdenEfy a bijecEve mapping that maps every ri in Ri to an idenEcal rj in Rj. map : Ri Rj Khalid Belhajjame @ PoliWeb Workshop, 2014 63
  • 66. ¡ Overwhelming for users who are not the developers ¡ Abstractions required for reporting ¡ Lineage queries result in very long trails Khalid Belhajjame @ PoliWeb Workshop, 2014 65
  • 67. ¡ a.k.a. Shims D. Hull et al ¡ Dealing with data and protocol heterogeneities ¡ Local organization of data ~ 60% Garijo D., Alper. P., Belhajjame K. et al Khalid Belhajjame @ PoliWeb Workshop, 2014 66
  • 68. Process-Wise and Data- Wise abstractions ¡ Sub-workflows § Not always a significant unit of function (e.g. aesthetic purposes) ¡ Bookmarked data links § Cluster the output signature § Further complicates workflow ¡ Components § Library dependent Khalid Belhajjame @ PoliWeb Workshop, 2014 67
  • 69. ¡ A graph model for representing workflows ¡ Graph re-write rules for summarization IF performs certain function THEN re-write WF graph ! !!!!!! motifs reduction-primitives Khalid Belhajjame @ PoliWeb Workshop, 2014 68
  • 70. ¡ Domain Independent categorization § Data-Oriented Nature § Resource/Implementation- Oriented Nature ¡ Captured In a lightweight OWL Ontology http://purl.org/net/wf-­‐motifs Khalid Belhajjame @ PoliWeb Workshop, 2014 69
  • 71. Pure Dataflows W= N,E! Operation and Port Nodes N = (Nop U Np)! ! Dataflow edges E = (Eopèp U Epèp U Epèop )! ! Khalid Belhajjame @ PoliWeb Workshop, 2014 70
  • 72. DataRetrieva l DataMovingl motifs(color_pathway_by_objects) = {m1:DataRetrieval}! motifs(Get_Image_From_URL_2) = {m2:DataMoving}! ! Khalid Belhajjame @ PoliWeb Workshop, 2014 71
  • 73. ¡ Collapse (Up/Down) ¡ Eliminate Khalid Belhajjame @ PoliWeb Workshop, 2014 72
  • 74. Khalid Belhajjame @ PoliWeb Workshop, 2014 73
  • 75. Khalid Belhajjame @ PoliWeb Workshop, 2014 74
  • 76. Khalid Belhajjame @ PoliWeb Workshop, 2014 75
  • 77. ¡ Strategies as a set of rules for summarization ¡ Two sample strategies based on an empirical analysis of workflows ¡ Reporting: § Process: Significant activities (Retrieval, Analysis, Visualization) § Data: § Reduced cardinality § Stripped of protocol specific payload/formatting Khalid Belhajjame @ PoliWeb Workshop, 2014 76
  • 78. ¡ By-Eliminate § Minimal annotation effort § Single rule ¡ By Collapse § More specific annotation § Multiple rules Khalid Belhajjame @ PoliWeb Workshop, 2014 77
  • 79. Workflow Designer Taverna Workbench Motif Ontology WF Summary WF Description Summarizer Summarization Rules Khalid Belhajjame @ PoliWeb Workshop, 2014 78
  • 80. ¡ 30 Workflows from the Taverna system ¡ Entire dataset queries accessible from http://www.myexperiment.org/packs/467.html ¡ Manual Annotation using Motif Vocabulary Khalid Belhajjame @ PoliWeb Workshop, 2014 79
  • 81. By-Collapse ¡ Causal Ordering of operations ¡ Reduced depth By-Elimination Khalid Belhajjame @ PoliWeb Workshop, 2014 80
  • 82. Khalid Belhajjame @ PoliWeb Workshop, 2014 81
  • 83. Khalid Belhajjame @ PoliWeb Workshop, 2014 82
  • 84. Khalid Belhajjame @ PoliWeb Workshop, 2014 83
  • 85. Khalid Belhajjame @ PoliWeb Workshop, 2014 84
  • 86. ¡ Establishing Trust, but also understanding and reusability, in Computational Science is more than ever needed ¡ Reproducibility seems to be a cost-effective solution ¡ A number of tools and methods have been developed for doing so. ¡ However, …. that is not enough ¡ Changing our ways (culture) of doing science is more challenging Khalid Belhajjame @ PoliWeb Workshop, 2014 85
  • 87. ¡ Pinar Alper ¡ Óscar Corcho ¡ Fernando ChirigaE ¡ Juliana Freire ¡ David De Roure ¡ Yolanda Gil ¡ Daniel Garijo ¡ Carole Goble ¡ David Koop ¡ SEan Soiland-­‐Reyes ¡ Paolo Missier ¡ Jun Zhao ¡ and many others … Khalid Belhajjame @ PoliWeb Workshop, 2014 86