Workflow systems support the design, configuration and execution of repetitive, multi-step pipelines and analytics, well established in many disciplines, notably biology and chemistry, but less so in biodiversity and ecology. From an experimental perspective workflows are a means to handle the work of accessing an ecosystem of software and platforms, manage data and security, and handle errors. From a reporting perspective they are a means to accurately document methodology for reproducibility, comparison, exchange and reuse, and to trace the provenance of results for review, credit, workflow interoperability and impact analysis. Workflows operate in an evolving ecosystem and are assemblages of components in that ecosystem; their provenance trails are snapshots of intermediate and final results. Taking a lifecycle perspective, what are the challenges in workflow design and use with different stakeholders? What needs to be tackled in evolution, resilience, and preservation? And what are the “mitigate or adapt” strategies adopted by workflow systems in the face of changes in the ecosystem/environment, for example when tools are depreciated or datasets become inaccessible in the face of funding shortfalls?
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
1. Workflows, Provenance & Reporting
A Lifecycle Perspective
Professor Carole Goble FREng FBCS
The University of Manchester, UK
carole.goble@manchester.ac.uk
3rd
– 6th
September 2013, Rome, Italy
2. The Scientific and Technical Ecosystem
Mobilising Big and Broad Data
• Streaming
• Sweeps through models
• Integrative analysis
• Results synthesis
• Heavy compute
Interoperability, plugging together
• Multi step chains, Multi software / data
• Mixed resources / platforms
• Incompatibility smoothing
• Trans-disciplinary, Alien processes
[DataONE]
3. BioSTIF
inputs:
data, parameters,
configurations
outputs
Workflow nutshell
• A series of automated /
interactive data
analysis steps
• Process data at scale
• Import data / codes
from one’s own
research and/or from
existing libraries
• Pipelines & analytic
and synthesis
procedures
• Chains of components
• Bridges between
resources
• Shield from change
and operational
complexity
• Releasing capacity
Services
Resources
7. Workflows: maturing approach
Underpin integrative
platforms.
Established in many
disciplines, notably chemistry
and biology, esp. ‘omics:
assembly, synthesis,
annotation, analytics.
Overlaps with metagenomics,
phylogenetics and genetic
ecology
Powering service based
science and science as a
service
http://www.globus.org/genomics/solution
Sandve, Nekrutenko, Taylor, Hovig
Ten simple rules for reproducible in silico research, PLoS Comp Bio submitted
8. Ecological Niche modelling, population
modelling, Metagenomics and Phylogenetics
‘omics pipelines and analytic workflows
http://www.biovel.eu
Community Cyberinfrastructure for Advanced
Marine Microbial Ecology Research and
Analysis http://camera.calit2.net/index.shtm
Combine species occurrence data with global
climate, terrain and land cover information, to
identify environmental correlates of species
ranges. http://www.lifemapper.org/species
BioDiversity
9. Taxonomic Data Refinement
www.biovel.eu
• Synonym expansion
• Taxonomic name resolution
• Occurrence retrieval
• Spell checking
• Geographic and taxonomic cleaning
• Temporal refinement
• Data processing log
[Matthias Obst, INTECOL 2013]
10. Data Operations in Workflows in the Wild
Analysis of 260 publicly available workflows in Taverna, WINGS, Galaxy and Vistrails
Garijo et al Common Motifs in Scientific Workflows: An Empirical Analysis, in press, FGCS
11. Large Scale
Ecological Niche
Modeling Workflow
.
Step 1: Explorative modeling
-Use unfiltered data
-Use fixed parameters: Mahalonobis distance (Farber
and Kadmon 2003)
-Native projections
-Test the model, distribution of points, number of points
Step 2: Deep modeling
-Filtering environmentally unique points with BioClim
algorithm (Nix 1986)
-ENM with Support Vector Machine (Cristianini & Shawe-
Taylor 2000) and Maximum Entropy (Phillips 2004)
-Parameter optimization (if necessary) on the model test
results
-2 masks (model generate, model project)
Data discoveryData discovery
Data assembly, cleaning,
and refinement
Data assembly, cleaning,
and refinement
Ecological Niche
Modeling
Ecological Niche
Modeling
Statistical analysisStatistical analysis
Analytical cycle
Pilumnus hirtellus
Enclosed sea problem (Ready et al., 2010)
[Matthias Obst, INTECOL 2013]
13. Repeated model sweeps
Ten insect species were modelled:
European spruce bark beetle – Ips
typographus L.
Bordered white moth (syn. pine looper) -
Bupalus piniarius L., (syn. B. piniaria L.)
Pine-tree lappet - Dendrolimus pini L.
Mottled umber - Erannis defoliaria Clerck
Nun moth - Lymantria monacha L.
Winter moth - Operopthera brumata L.
Pine beauty moth - Panolis flammea Den.
& Schiff
Green oak tortrix - Tortrix viridana L.
European pine sawfly – Neodiprion sertifer
Geoffr.
Common pine sawfly – Diprion pini L.
Tortrix viridana Image by Kimmo & Seppo
Silvonen Lymantria monacha
data
configuration
parameters
steps Päivi Lyytikäinen-Saarenmaa
presentation, INTECOL 2013
15. Provenance
the link between computation and results
W3C PROV model standard
record for reporting
compare diffs/discrepancies
provenance analytics
track changes, adapt
partial repeat/reproduce
carry attributions
compute credits
compute data quality/trust
select data to keep/release
optimisation and debugging
d1
S0
d2
S1
w
S2
y
S4
df
d1'
S0
d2
S1
z w
S'2
y'
S4
df'
(i) Trace A (ii) Trace B
PDIFF: comparing provenance traces to
diagnose divergence across experimental
results [Woodman et al, 2011]
18. Summary: Infrastructure Productivity
CustomiseCustomise
ProcessProcess
CustomiseCustomise
ProcessProcess
CustomiseCustomise
EnvironmentEnvironment
Legacy, others and your own software,
datasets, services, codes, and platforms.
optimise and manage use of computing
infrastructure, HPC, clouds and platforms
WFMS
middleware
WFMS
middleware
Support the design, config. and execution of
workflows. manage utility actions for data,
logging, security, compute, errors…shield
incompatibilities / complexity / change
Parameterised, integrative, multi-step
(data) pipelines, analytics, computational
protocols. That can be repetitively reused.
dependency-rich interoperability.
WorkflowWorkflow
AppsApps
Domain/task specific apps that incorporate
(an ecosystem of) workflows
Integrate
19. Summary: User Productivity: Capability Raising
AccessAccess
Framework to access and leverage
heterogeneous legacy applications, services,
datasets and codes.
Shielding from complexity.
CustomiseCustomise
Rapid development: Flexibility,
Extensibility, Adaptability, Reuse.
Reusable Workflow Components
ProcessProcess
Automated plumbing + Interaction
Systematic, repetitive and unbiased analysis
and processing and error handling
Ensembles, comparisons, “what ifs”
CustomiseCustomise
Rapid development: Flexibility,
Extensibility, Adaptability, Reuse.
Reusable Workflow Components
ProcessProcess
Automated plumbing + Interaction
Systematic, repetitive and unbiased analysis
and processing and error handling
Ensembles, comparisons, “what ifs”
CustomiseCustomise
Rapid development: Flexibility,
Extensibility, Adaptability, Reuse.
Reusable Workflow Components
AccessAccess
Framework to access and leverage
heterogeneous legacy applications, services,
datasets and codes and combine with yours.
Shielding from complexity.
ProcessProcess
Integration, Reusable workflows/components
Automated plumbing + Interaction
Systematic, repetitive and unbiased analysis
Ensembles, comparisons, “what ifs”
Process reporting. Citation tracking.
Reproducibility, Provenance, Audit. Quality
Control. Standard Operating Procedures.RecordRecord
CustomiseCustomise
Rapid development: Flexibility,
Extensibility, Adaptability, Reuse. Reusable
Workflow Components
20. Workflow Commodities
building cohorts, capturing traits,
explicit reporting, clear instructions
• Workflow templates
• Workflow sets
• Libraries of sub workflow parts
• Design practices for mix, match
and reuse
• Future proofed design predicting
need to adapt
• Discovery and exchange
• Workflow engineers
• Workflow custodians
23. Katy’s student’s 200 hours
Tracking where data went
Workflow Commodities
getting credit, capability,
engineers and custodians
24. Application Building
user variety, outcome focused
• Right apps, right users.
• Commodity apps:
– Web. Spreadsheets. R.
• Customisation
• Mixed workflow / scripting
• Deployment / Portability
– Web based / desktop
– Virtualised deployments
– Cloud hosted service
– A cloud-enabled local host
• Local ownership
• Capability building
WorkflowVisibility
BioDiversity
Low
ConceptKnowledge
High
Technology/InfrastructureDomainScientist
Technicalspecialists
ComputationalScientist
Custom
Specific
Apps
General
Toolkits
Policy
makers
Low
High
Versatility
25. Who are the users?
• Policy makers?
• Biodiversity researcher?
• Computational scientist?
• Tool developer?
• Service provider?
• Infrastructure provider?
• Digital custodian?
26. Workflow management systems
• Integrated into community frameworks,
coupled into tools
• Virtualised (Web) Services
• Scaling, Optimisation
• Interoperability, Using provenance
• No one workflow language/system
• Specialisation & its cost
• Plug-ins for common community
platforms and resources
• Mitigating and adapting to changes in
infrastructures and resources.
• Sustainability and engineering
Generic
Specific
http://www.erflow.eu/
27. Population dynamics
The life cycle of infrastructures
• Dynamics: Mitigate, Adapt,
Disperse, Die
• Standard and maintained
prog. interfaces (APIs)
• Standard formats and ids
• Stability, reliability, repair
• Interoperability
• Semantic descriptions
• Sustainability of services
and infrastructure
• Instrument resources for
citation & microattribution
• Coupled services and
infrastructure.
29. Summary
Scale.
Standards data formats, programmatic interfaces.
Governance.
Workflow commodities
Design practices
Credit
A seamless, pluggable service.
Scale. Adaptability. Specific-Generic tension.
Putting provenance to use for data credit.
Embedding workflows in common applications
Integration into reporting and publishing lifecycles
The Technical Environment: Challenging Areas and Promising Technologies Workflows, provenance and reporting: a lifecycle perspective Workflow systems support the design, configuration and execution of repetitive, multi-step pipelines and analytics, well established in many disciplines, notably biology and chemistry, but less so in biodiversity and ecology. From an experimental perspective workflows are a means to handle the work of accessing an ecosystem of software and platforms, manage data and security, and handle errors. From a reporting perspective they are a means to accurately document methodology for reproducibility, comparison, exchange and reuse, and to trace the provenance of results for review, credit, workflow interoperability and impact analysis. Workflows operate in an evolving ecosystem and are assemblages of components in that ecosystem; their provenance trails are snapshots of intermediate and final results. Taking a lifecycle perspective, what are the challenges in workflow design and use with different stakeholders? What needs to be tackled in evolution, resilience, and preservation? And what are the “mitigate or adapt” strategies adopted by workflow systems in the face of changes in the ecosystem/environment, for example when tools are depreciated or datasets become inaccessible in the face of funding shortfalls? Bio: Carole Goble is a full professor in Computer Science at the University of Manchester, UK, and a partner of the Software Sustainability Institute UK. She has an international reputation in Semantic technologies, Distributed computing and Social Computing for scientific collaboration through eLabs. She directs the myGrid project, which produces the widely-used open source Taverna workflow management system; myExperiment, a social web site for sharing scientific workflows; the BioDiversityCatalogue of web services ; and the SEEK for storing, sharing and preserving Systems Biology outcomes, which is part of the ERANet e-infrastructure for EU-based Systems Biology. Her technical infrastructure underpins the EU BioVeL Project e-Laboratory. In 2008 Carole was awarded the Microsoft Jim Gray award for outstanding contributions to e-Science. In 2010 she was elected a Fellow of the Royal Academy of Engineering. In 2012 she was nominated for the Benjamin Franklin award for open science in Biology. She serves on the UK BBSRC funding agency governance Council and is the Deputy Director of the UK's Node of the ESFRI ELIXIR programme.
Katy Willis talk on Wednesday shows the value of automation of data integration standardised pipelines auto record of experiment and set-up report & variant reuse Systematically capture, coordinate, run and record the steps buffered infrastructure platform libraries, plugins Infrastructure components, services infrastructure
aimed at different layers of the software stack “ The Many Faces of IT as Service”, Foster, Tuecke, 2005 “ Provisioning” – reservation to configuration to … … make sure resource will do what I want it to do, with the right qualities of service Virtualization = separation of concerns between provider & consumer of “content” Client and service Service provider and resource provider Provisioning = assemble & configure resources to meet user needs Management = sustain desired qualities of service despite dynamic environment
Just in time interoperability by papering over the cracks.
Scale of data – from Matthias talk. Geographic: we can build models in China and project it into Europe Taxonomic: we can build models for plants (phytoplankton), animals (birds), and in one year hopefully even microbial communities Environmental: sea, land, still very difficult for lakes and rivers
Analysis factories Typical variations in workflows Local and Global workflow population variations Micro and Macro level
Came up in policy session reporting perspective accurately document methodology for reproducibility , comparison, exchange and reuse trace the provenance of results for review, credit, workflow interoperability and impact analysis
Simplify Track Versions and retractions Error propagation Contributions and credits Fix Workflow repair, alternate component discovery, Black box annotation Rerun and Replay Partial reproducibility: Replay some of the workflow A verifiable, reviewable trace in people terms Analyse Calculate data quality & trust, Decide what data to keep or release Compare to find differences and discrepancies S. Woodman, H. Hiden, P. Watson, P. Missier Achieving Reproducibility by Combining Provenance with Service and Workflow Versioning. In: The 6th Workshop on Workflows in Support of Large-Scale Science . 2011, Seattle
Workflow templates Workflow sets Libraries of sub workflow parts Design practices for mix, match and reuse Future proofed design: mitigate or adapt Discovery and exchange Life cycle management Curation Packaging. Credit and publishing. Workflow engineers Workflow custodians
Workflow templates Workflow sets Libraries of sub workflow parts Design practices for mix, match and reuse Future proofed design: mitigate or adapt Discovery and exchange Life cycle management Curation Packaging. Credit and publishing. Workflow engineers Workflow custodians
Future proofed design: mitigate or adapt Discovery and exchange Life cycle management Curation Packaging. Credit and publishing. Workflow engineers Workflow custodians
Local level or eu hosted
Reducing sensitivity, robustness to loss SHIWA and ER Flow Factories
Reducing Mortality, Invasion, Predatory Black boxes Poor metadata Incompatibility of data formats and identifiers. Poor awareness or adherence to standards. Poor methodology Unrepeatable or unknown experimental method. Black boxes. Incorrect interpretations and poor quality. Poor service / tool / resource ethic Service decay, service palpability & complexity, service reliability & stability, poor diagnostics. GEO, GEOSS, Ecosystems, earth observations NextData c2012.org Encyclopedia of Life Global BioDiversity Informatics Conference www.gbic2012.org Dawn and Cynthia Parr (EOL)
A virtual machine (VM) is a software implementation of a machine (i.e. a computer) that executes programs like a physical machine. Virtual machines are separated into two major classifications, based on their use and degree of correspondence to any real machine: System Zhao, Gomez-Perez, Belhajjame, Klyne, Garcia-Cuesta, Garrido, Hettne, Roos, De Roure and Goble. Why workflows break - Understanding and combating decay in Taverna workflows, 8th Intl Conf e-Science 2012 Reproducibility success is proportional to the number of dependent components and your control over them” Many reasons why. Change / Availability Updates to public datasets, changes to services / codes Availability/Access to components / execution environment Platform differences on simulations, code ports Volatile third-party resources (50%): Not available, available but inaccessible, changed Prevent, Detect, Repair
Logbook data Capacity, services, collaboration Variation, diversity and change at all levels Modularity Plugins Separate Services from underlying infrastructure Ensure Service Networks are built using standard Web 2.0 technologies Separation of applications, workflows and VREs from the services