SlideShare ist ein Scribd-Unternehmen logo
1 von 73
Programming scientific data pipelines with
the Taverna workflow management system
              Dr. Paolo Missier
         School of Computing Science
           Newcastle University, UK



              Newcastle, Feb. 2011




        With thanks to the myGrid team in
        Manchester for contributing their
                material and time
Outline
        Objective:
                  to provide a practical introduction to workflow
                        systems for scientific applications

    •   Workflows in context: lifecycle and the workflow eco-system
    •   Workflows for data integration
    •   The user experience: from services and scripts to workflows
    •   Extensibility I: importing services
        – using R scripts
    • Extensibility II: plugins
    • The Taverna computation model: a closer look
        – functional model, dataflow parallelism
    • Performance




2
Workflows in science
    High level programming models for scientific applications


                              • Specification of service / components
                                execution orchestration

                              • Handles cross cutting concerns like
                                error handling, service invocation,
                                data movement, data streaming,
                                provenance tracking…..

                              • A workflow is a specification
                                configured for each run




3
What are Workflows used for?
    EarthSciences   Life Sciences




4
Taverna
    • First released 2004
    • Current version Taverna 2.2
    • Currently 1500+ users per month, 350+ organizations, ~40
      countries, 80,000+ downloads across versions

    • Freely available, open source LGPL
    • Windows, Mac OS, and Linux

    •   http://www.taverna.org.uk
    •   User and developer workshops
    •   Documentation
    •   Public Mailing list and direct email support


           http://www.taverna.org.uk/introduction/taverna-in-use/

5
Who else is in this space?
                          Trident




                                             Triana
              VisTrails
                                                          Kepler




    Taverna
                                                      Pegasus (ISI)




6
Who else is in this space?
                          Trident




                                             Triana
              VisTrails
                                                          Kepler




    Taverna
                                                      Pegasus (ISI)




6
Example: the BioAID workflow
    Purpose:
    The workflow extracts protein names from documents retrieved from
    MedLine based on a user Query (cf Apache Lucene syntax).
    The protein names are filtered by checking if there exists a valid UniProt
    ID for the given protein name.

    Credits:
    - Marco Roos (workflow),
    - text mining services by Sophia Katrenko and Edgar Meij (AID), and
    Martijn Schuemie (BioSemantics, Erasmus University Rotterdam).



    Available from myExperiment:
    http://www.myexperiment.org/workflows/154.html




7
The workflows eco-system in myGrid
A process-centric science lifecycle
The workflows eco-system in myGrid
A process-centric science lifecycle




Service discovery
   and import
The workflows eco-system in myGrid
A process-centric science lifecycle




Service discovery
   and import       Data
                                   Metadata        Methods
                    - inputs
                                   - provenance    - the workflow
                    - parameters
                                   - annotations
                    - results
The workflows eco-system in myGrid
A process-centric science lifecycle




Service discovery
   and import       Data
                                   Metadata        Methods
                    - inputs
                                   - provenance    - the workflow
                    - parameters
                                   - annotations
                    - results
Workflow as data integrator
Workflow as data integrator

  QTL
genomic
regions



  genes
 in QTL



metabolic
pathways
 (KEGG)
Workflow as data integrator

  QTL
genomic
regions



  genes
 in QTL



metabolic
pathways
 (KEGG)
Taverna computational model (very briefly)
                                  List-structured
                                  KEGG gene ids:           geneIDs            pathways
                                  [ [ mmu:26416 ],           •
                                                                     •
                                                                              •                   •
                                    [ mmu:328788 ] ]                                  •
                                                                                                   •
                                                             •                                    •
                                                                     •                        •



                                                            geneIDs               pathways
                                                                 •
                                                                         •
                                                                                  •                   •
                                                                                          •
                                                                                                       •
                                                                 •                                    •
                                                                         •                        •



                                           [ path:mmu04010 MAPK signaling,
                                             path:mmu04370 VEGF signaling ]

[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...],
  [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
Taverna computational model (very briefly)
                                  List-structured
                                  KEGG gene ids:

                                  [ [ mmu:26416 ],
                                    [ mmu:328788 ] ]

                                            • Collection processing
                                            • Simple type system
                                              • no record / tuple structure
                                            • data driven computation
                                              • with optional processor synchronisation
                                            • parallel processor activation
                                              • greedy (no scheduler)

                                           [ path:mmu04010 MAPK signaling,
                                             path:mmu04370 VEGF signaling ]

[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...],
  [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
From services and scripts to workflows
     • the BioAID workflow again:
       – http://www.myexperiment.org/workflows/154.html
     • overall composition:
     • 15 beanshell and other local scripts
       – mostly for data formatting
     • 4 WSDL-based service operations:

                           operation         service

                        getUniprotID   synsetServer

                        queryToArray   tokenize

                        apply          applyCRFService

                        search         SearcherWSService




11
Service composition requires adapters
Example: SBML model optimisation workflow -- designed by Peter Li
http://www.myexperiment.org/workflows/1201
Service composition requires adapters
Example: SBML model optimisation workflow -- designed by Peter Li
http://www.myexperiment.org/workflows/1201
Service composition requires adapters
Example: SBML model optimisation workflow -- designed by Peter Li
http://www.myexperiment.org/workflows/1201

                                                                   String[] lines = inStr.split("n");
                                                                   StringBuffer sb = new StringBuffer();
                                                                   for(i = 1; i < lines.length -1; i++)
                                                                   {
                                                                     String str = lines[i];
                                                                     str = str.replaceAll("<result>", "");
                                                                     str = str.replaceAll("</result>", "");
                                                                     sb.append(str.trim() + "n");
                                                                   }

                                                                   String outStr = sb.toString();


                                      Url -> content (built-in shell script)


                                           import java.util.regex.Pattern;
                                           import java.util.regex.Matcher;

                                           sb = new StringBuffer();
                                           p = "CHEBI:[0-9]+";

                                           Pattern pattern = Pattern.compile(p);
                                           Matcher matcher = pattern.matcher(sbrml);
                                           while (matcher.find())
                                           {
                                             sb.append("urn:miriam:obo.chebi:" + matcher.group() + ",");
                                           }
                                           String out = sb.toString();
                                           //Clean up
                                           if(out.endsWith(","))
                                             out = out.substring(0, out.length()-1);

                                           chebiIds = out.split(",");
Building workflows from existing services
     • Large collection of available services
       – default but extensible palette of services in the workbench
       – mostly third party
       – All the major providers: NCBI, DDBJ, EBI …




              A plethora of providers:



        For an example of how
        to build a simple
        workflow, please follow
        Exercise 3 from this
        tutorial


13
Incorporating R scripts into Taverna
                                  Requirements for using R in a local installation:
                                  - install R from main archive site:
                                      http://cran.r-project.org/
                                  - install Rserve:
                                      http://www.rforge.net/Rserve/
                                  - start Rserve locally:
                                      - start the R console and type the commands:
                                          library(Rserve)
                                          Rserve(args="--no-save")

     Taverna can display graphical output from R

     The following R script simply produces a png image that is
     displayed on the Taverna output:
      png(g);
      plot(rnorm(1:100));
      dev.off();

     To use it, create an R Taverna workflow with output port g
        - of type png image
     See also: http://www.mygrid.org.uk/usermanual1.7/rshell_processor.html
14
Integration between Taverna and eScience Central
     • An example of integration between
       – Taverna workflows (desktop)
       – the eScience Central cloud environment
     • Facilitated by Taverna’s plugin architecture

     • See http://www.cs.man.ac.uk/~pmissier/T-eSC-integration.svg




15
Plugin: Excel spreadsheets as workflow input
     • Third-party plugin code can later be bundled in a distribution
     • Ex.: importing input data from a spreadsheet
       – see: http://www.myexperiment.org/workflows/1417.html
       – and example input spreadsheet: http://www.myexperiment.org/files/410.html




16
Plugin: Excel spreadsheets as workflow input
     • Third-party plugin code can later be bundled in a distribution
     • Ex.: importing input data from a spreadsheet
       – see: http://www.myexperiment.org/workflows/1417.html
       – and example input spreadsheet: http://www.myexperiment.org/files/410.html




16
Plugin: Excel spreadsheets as workflow input
     • Third-party plugin code can later be bundled in a distribution
     • Ex.: importing input data from a spreadsheet
       – see: http://www.myexperiment.org/workflows/1417.html
       – and example input spreadsheet: http://www.myexperiment.org/files/410.html




16
Taverna Model of Computation: a closer look
     • Arcs between two ports define data dependencies
       – processors with inputs on all their (connected) ports are ready
       – no active scheduling: admission control is simply by the size of threads pool
       – processors fire as soon as they are ready and there are available threads in
         the pool
     • No control structures
       – no explicit branching or loop constructs
     • but dependencies between processors can be added:
       – end(P1) ➔ begin(P2)



                                                    coordination link semantics:
                                                    “fetch_annotations can only start after
                                                    ImprintOutputAnnotator has completed”

                                                         Typical pattern:
                                                         writer ➔ reader
                                                         (eg to external DB)

17
List processing model
     • Consider the gene-enzymes workflow from the previous demo:




     Values can be either atomic or (nested) lists
       - values are of simple types (string, number,...)
       - but also mime types for images (see R example above)

     What happens if the input to our workflow is a list of gene IDs?
        geneID = [ mmu:26416, mmu:19094 ]

     we need to declare the input geneID to be of depth 1
     - depth n in general, for a generic n-deep list
18
Implicit iteration over lists
     Demo:
     – reload KEGG-genes-enzymes-atomicInput.t2flow
     – declare input geneID to be of depth 1
     – input two genes, run




      – Each processor is activated once for each element in the list
          – this is because each is designed to accept an atomic value
      – the result is a nested list of results, one for each gene in the input list

19
Functional model for collection processing /1
     Simple processing:
     service expects atomic values,
     receives atomic values


         v1    v2    v3

         X1    X2    X3

               P
         Y1         Y2

          w1        w2




20
Functional model for collection processing /1
     Simple processing:                       Simple iteration:
     service expects atomic values,           service expects atomic values,
     receives atomic values                   receives input list

                                                                   v = [v1 ... vn]
         v1    v2    v3                v = [v1 ... vn]        v1                     vn

         X1    X2    X3                      X                X                      X



         Y1
               P
                    Y2                       Y
                                              P
                                                         ➠    P1

                                                              Y
                                                                          ...        Pn

                                                                                     Y

          w1        w2                w = [w1 ... wn]         w1                     wn

                                                                   w = [w1 ... wn]




20
Functional model for collection processing /1
     Simple processing:                           Simple iteration:
     service expects atomic values,               service expects atomic values,
     receives atomic values                       receives input list

                                                                       v = [v1 ... vn]
         v1    v2    v3                    v = [v1 ... vn]        v1                     vn

         X1                      ad = 1
               X2    X3                          X                X                      X



         Y1
               P
                    Y2
                                 dd = 0

                                                 Y
                                                  P
                                                             ➠    P1

                                                                  Y
                                                                              ...        Pn

                                                                                         Y
                                 δ=1
          w1        w2                    w = [w1 ... wn]         w1                     wn

                                                                       w = [w1 ... wn]




20
Functional model for collection processing /1
       Simple processing:                             Simple iteration:
       service expects atomic values,                 service expects atomic values,
       receives atomic values                         receives input list

                                                                              v = [v1 ... vn]
            v1   v2    v3                    v = [v1 ... vn]             v1                     vn

            X1                     ad = 1
                 X2    X3                            X                   X                      X



            Y1
                 P
                      Y2
                                   dd = 0            P

                                                     Y
                                                                     ➠   P1

                                                                         Y
                                                                                     ...        Pn

                                                                                                Y
                                    δ=1
            w1        w2                    w = [w1 ... wn]              w1                     wn

                                                                              w = [w1 ... wn]
                                             v = [[...], ...[...]]
     Extension:
     service expects atomic        ad =2
                                                     X
     values,
     receives input nested list    dd = 0            P

                                                     Y
                                   δ=2
20
                                             w = [[..] ...[...]]
Functional model /2



     The simple iteration model                v = [[...], ...[...]]
     generalises by induction to a
                                     ad =n
     generic δ=n-m                                     X

                                     dd = m            P

                                                       Y
                                     δ = n-m ≥ 0
                                               w = [[..] ...[...]] - depth = n-m




21
Functional model /2



     The simple iteration model                             v = [[...], ...[...]]
     generalises by induction to a
                                                  ad =n
     generic δ=n-m                                                  X

                                                  dd = m            P

                                                                    Y
                                                  δ = n-m ≥ 0
                                                            w = [[..] ...[...]] - depth = n-m
           This leads to a recursive functional
           formulation for simple collection
           processing:


            v = a1 . . . an
                               (P v) if l = 0
            (evall P v) =
                               (map (evall − 1 P ) v) if l > 0


21
Functional model - multiple inputs /3
                                        v2 = [v21 ... v2k]

     v1 = [v11 ... vin]     X1      X2     X3         v3 = [v31 ... v3m]

                                    P
                                                                   dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1
                                    Y
                                                                   dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

                                                                   dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1
             w = [ [w11 ... w1n],
                     ...
                  [wm1 ...wmn] ]




22
Functional model - multiple inputs /3
                                        v2 = [v21 ... v2k]

     v1 = [v11 ... vin]     X1      X2     X3         v3 = [v31 ... v3m]

                                    P
                                                                   dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1
                                    Y
                                                                   dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

                                                                   dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1
             w = [ [w11 ... w1n],
                     ...
                  [wm1 ...wmn] ]




         Cross-product involving v1 and v3 (but not v2):
               v1 ⊗ v3 = [ [ <v1i, v3j> | j:1..m ] | i:1..n ] // cross product
               and including v2: [ [ <v1i, v2, v3j> | j:1..m ] | i:1..n ]

22
Generalised cross product
     Binary product, δ = 1:   a × b = [[ ai , bj ]|bj ← b]|ai ← a]
                              (eval2 P a, b ) = (map (eval1 P ) a × b)




23
Generalised cross product
     Binary product, δ = 1:             a × b = [[ ai , bj ]|bj ← b]|ai ← a]
                                     (eval2 P a, b ) = (map (eval1 P ) a × b)

     Generalized to arbitrary depths:
                           
                           [[(vi , wj )|wj ← w]|vi ← v]
                                                                   if   d1   > 0, d2   >0
                           
                           [(v , w)|v ← v]
                               i        i                           if   d1   > 0, d2   =0
     (v, d1 ) ⊗ (w, d2 ) =
                           [(v, wj )|wj ← w]
                                                                   if   d1   = 0, d2   >0
                           
                           
                            (v, w)                                  if   d1   = 0, d2   =0
     ...and to n operands:    ⊗i:1...n (vi , di )




23
Generalised cross product
     Binary product, δ = 1:               a × b = [[ ai , bj ]|bj ← b]|ai ← a]
                                         (eval2 P a, b ) = (map (eval1 P ) a × b)

     Generalized to arbitrary depths:
                           
                           [[(vi , wj )|wj ← w]|vi ← v]
                                                                      if   d1   > 0, d2   >0
                           
                           [(v , w)|v ← v]
                               i        i                              if   d1   > 0, d2   =0
     (v, d1 ) ⊗ (w, d2 ) =
                           [(v, wj )|wj ← w]
                                                                      if   d1   = 0, d2   >0
                           
                           
                            (v, w)                                     if   d1   = 0, d2   =0
     ...and to n operands:     ⊗i:1...n (vi , di )

     Finally: general functional semantics for collection-based processing

                  (evall P       (v1 , d1 ), . . . , (vn , dn ) )
                              (P v1 , . . . , vn )                   if l = 0
                       =
                              (map (evall−1 P ) ⊗i:1...n vi , di )   if l > 0
23
Parallelism in the dataflow model
     • The data-driven model with implicit iterations provides
       opportunities for parallel processing of workflows
     two types of parallelism:
       • intra-processor: implicit iteration over list data
       • inter-processor: pipelining

                                  [ id1, id2, id3, ...]


                          SFH1         SFH2          SFH3       implicit assumption
                           ...           ...              ...   of independence
                                                                amongst the threads
                                                                that operate on
                                                                elements of a list
                         getDS1       getDS2         getDS3


                                 [ DS1, DS2, DS3, ...]

24
Exploiting latent parallelism



     [ a, b, c,...]

     [ (echo_1 a),     (echo_1 b),   (echo_1 c)]


     (echo_2 (echo_1 a))
                      (echo_2 (echo_1 b))
                             (echo_2 (echo_1 c))




           See also:
           http://www.myexperiment.org/workflows/1372.html

25
Performance - experimental setup
                           • previous version of Taverna engine used as baseline
                           • objective: to measure incremental improvement

                                   list generator




                                               Parameters:
     multiple parallel pipelines


                                               - byte size of list elements (strings)
                                               - size of input list
                                               - length of linear chain


                                               main insight: when the workflow is designed for
                                               pipelining, parallelism is exploited effectively




26
Performance study: Experimental setup - I
     • Programmatically generated dataflows

     – the “T-towers”

     parameters:
     - size of the lists involved
     - length of the paths
     - includes one cross product




27
caGrid workflow for performance analysis
     Goal: perform cancer diagnosis using microarray analysis
     - learn a model for lymphoma type prediction based on
     samples from different lymphoma types


     source: caGrid




28   http://www/myexperiment.org/workflows/746
caGrid workflow for performance analysis
     Goal: perform cancer diagnosis using microarray analysis
     - learn a model for lymphoma type prediction based on
     samples from different lymphoma types

                                               lymphoma samples
     source: caGrid                            ➔ hybridization data




28   http://www/myexperiment.org/workflows/746
caGrid workflow for performance analysis
     Goal: perform cancer diagnosis using microarray analysis
     - learn a model for lymphoma type prediction based on
     samples from different lymphoma types

                                               lymphoma samples
     source: caGrid                            ➔ hybridization data




        process microarray
      data as training dataset




28   http://www/myexperiment.org/workflows/746
caGrid workflow for performance analysis
     Goal: perform cancer diagnosis using microarray analysis
     - learn a model for lymphoma type prediction based on
     samples from different lymphoma types

                                               lymphoma samples
     source: caGrid                            ➔ hybridization data




        process microarray
      data as training dataset
                                          learn
                                   predictive model



28   http://www/myexperiment.org/workflows/746
caGrid workflow for performance analysis
     Goal: perform cancer diagnosis using microarray analysis
     - learn a model for lymphoma type prediction based on
     samples from different lymphoma types

                                               lymphoma samples
     source: caGrid                            ➔ hybridization data




28   http://www/myexperiment.org/workflows/746
Results I - Memory usage

                                           shorter execution
                                           time due to pipelining
                   T2 main memory
                   data management


                 T2 embedded Derby back-end



                             T1 baseline




     list size: 1,000 strings of 10K chars each
     no intra-processor parallelism (1 thread/processor)


29
Results I - Memory usage

                                           shorter execution
                                           time due to pipelining
                   T2 main memory
                   data management


                 T2 embedded Derby back-end



                             T1 baseline




     list size: 1,000 strings of 10K chars each
     no intra-processor parallelism (1 thread/processor)


29
Results II - Available processors pool




     pipelining in T2 makes up for smaller pools of threads/processor




30
Results III - Bounded main memory usage
     Separation of data and process spaces ensures scalable data management




                       varying data element size:
                       10K, 25K, 100K chars




31
Ongoing effort: Taverna on the Cloud
     • Early experiments on running multiple instances of Taverna
       workflows in a cloud environment
     • Coarse-grained cloud deployment: workflow-at-a-time
       – data partitioning ➔ each partition is allocated to a workflow instance




      For more details, please see: Paul Fisher, ECCB talk slides, October, 2010

32
Summary
     •   Workflows: high-level programming paradigm
     •   Bridges the gap between scientists and developers
     •   Many workflow models available (commercial/open source)
     •   Taverna implements a dataflow model
         – has proven useful for a broad variety of scientific applications


       Strengths:
     • Rapid prototyping given a base of third-party or own services
     • Explicit modelling of data integration processes
     • Extensibility:
         – for workflow designers: easy to import third-party services (SOAP, REST)
         – accepts scripts in a variety of languages
         – for developers: easy to add functionality using a plugin model
     • Good potential for parallelisation
     • Early experiments on cloud deployment: workflow-at-a-time
         – ongoing study for finer-grain deployment of portions of the workflow
33
     Back to the start
ADDITIONAL MATERIAL
     • Provenance of workflow data
     • Provenance and Trust of Web data




34
Example workflow (Taverna)


chr: 17                          QTL →
start: 28500000
end: 3000000                  Ensembl Genes



                  Ensembl Gene →          Ensembl Gene →
                    Uniprot Gene            Entrez Gene



                  Uniprot Gene →           Entrez Gene →
                    Kegg Gene               Kegg Gene


                               merge gene IDs



                             Gene → Pathway            path:mmu04210 Apoptosis,
                                                       path:mmu04010 MAPK, ...
Baseline provenance of a workflow run
                         QTL →                                mmu:12575
                      Ensembl Genes
                                                                          v1     ...   vn               w

     Ensembl Gene →                     Ensembl Gene →
       Uniprot Gene                       Entrez Gene
                                                                      path:mmu04012
                                                          exec
     Uniprot Gene →                     Entrez Gene →                     a1     ...   an         b1    ...   bm
       Kegg Gene                         Kegg Gene
                                                                                                  mmu:26416

                       merge gene IDs




                                                                 path:mmu04010   y11                    ymn
                      Gene → Pathway
                                                                                            ...

                                                         path:mmu04010→derives_from→mmu:26416
                                                         path:mmu04012→derives_from→mmu:12575

         • The graph encodes all direct data dependency relations
         • Baseline query model: compute paths amongst sets of nodes
           • Transitive closure over data dependency relations
36
Motivation for fine-grained provenance
                                List-structured
                                KEGG gene ids:

                                [ [ mmu:26416 ],
                                  [ mmu:328788 ] ]




                                                  [ path:mmu04010 MAPK signaling,
                                                    path:mmu04370 VEGF signaling ]

[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...],
  [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
Motivation for fine-grained provenance
                                List-structured
                                KEGG gene ids:

                                [ [ mmu:26416 ],
                                  [ mmu:328788 ] ]



                                                       geneIDs                pathways
                                                         •
                                                             •
                                                                              •           •
                                                                                  •
                                                                                           •
                                                         •                                •
                                                             •                        •




                                                  [ path:mmu04010 MAPK signaling,
                                                    path:mmu04370 VEGF signaling ]

[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...],
  [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
Motivation for fine-grained provenance
                                List-structured
                                KEGG gene ids:

                                [ [ mmu:26416 ],
                                  [ mmu:328788 ] ]



                                                       geneIDs                pathways
                                                         •
                                                             •
                                                                              •           •
                                                                                  •
                                                                                           •
                                                         •                                •
                                                             •                        •




                                                  [ path:mmu04010 MAPK signaling,
                                                    path:mmu04370 VEGF signaling ]

[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...],
  [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
Motivation for fine-grained provenance
                                List-structured
                                KEGG gene ids:

                                [ [ mmu:26416 ],
                                  [ mmu:328788 ] ]



                                                       geneIDs                pathways
                                                         •
                                                             •
                                                                              •           •
                                                                                  •
                                                                                           •
                                                         •                                •
                                                             •                        •




                                                  [ path:mmu04010 MAPK signaling,
                                                    path:mmu04370 VEGF signaling ]

[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...],
  [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
Motivation for fine-grained provenance
                                List-structured
                                KEGG gene ids:

                                [ [ mmu:26416 ],
                                  [ mmu:328788 ] ]



                                                       geneIDs                pathways
                                                         •
                                                             •
                                                                              •           •
                                                                                  •
                                                                                           •
                                                         •                                •
                                                             •                        •




                                                  [ path:mmu04010 MAPK signaling,
                                                    path:mmu04370 VEGF signaling ]

[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...],
  [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
Efficient query processing: main result
Workflow graph                                        Provenance graph


           X                      X                 v1   ...    vn              w

           Q                      R
           Y                      Y

                                                    a1    ...   an         b1   ...   bm


                     X1 X2 X3

                              P
                                                         y11         ...        ymn
                              Y

      y = [ [y11 ... y1n],
              ...
            [ym1 ... ymn] ]

    • Query the provenance of individual collections elements
    • But, avoid computing transitive closures on the provenance graph
    • Use workflow graph as index instead
    • Exploit workflow model semantics to statically predict dependencies on individual tree
    elements
    • This results in substantial performance improvement for typical queries
Efficient query processing: main result
Workflow graph                                        Provenance graph
                  [1]                  []

              X                    X                v1   ...    vn              w

              Q                    R
              Y                    Y

                                                    a1    ...   an         b1   ...   bm
                        []


                        X1 X2 X3            [n]
        [1]
                              P
                                                         y11         ...        ymn
                              Y

      y = [ [y11 ... y1n],
              ...
            [ym1 ... ymn] ]

    • Query the provenance of individual collections elements
    • But, avoid computing transitive closures on the provenance graph
    • Use workflow graph as index instead
    • Exploit workflow model semantics to statically predict dependencies on individual tree
    elements
    • This results in substantial performance improvement for typical queries
Efficient query processing: main result
Workflow graph                                        Provenance graph
                  [1]                  []

              X                    X                v1   ...    vn              w

              Q                    R
              Y                    Y

                                                    a1    ...   an         b1   ...   bm
                        []


                        X1 X2 X3            [n]
        [1]
                              P
                                                         y11         ...        ymn
                              Y

      y = [ [y11 ... y1n],
              ...
            [ym1 ... ymn] ]

    • Query the provenance of individual collections elements
    • But, avoid computing transitive closures on the provenance graph
    • Use workflow graph as index instead
    • Exploit workflow model semantics to statically predict dependencies on individual tree
    elements
    • This results in substantial performance improvement for typical queries
Trust and provenance for Web data
     • Testimonials: http://www.w3.org/2005/Incubator/prov/
        – "At the toolbar (menu, whatever) associated with a document there is a button marked
          "Oh, yeah?". You press it when you lose that feeling of trust. " - Tim Berners-Lee, Web
          Design Issues, September 1997
        – Provenance is the number one issue we face when publishing government data as
          linked data for data.gov.uk" - John Sheridan, UK National Archives, data.gov.uk,
          February 2010




                                                                              how exactly is
                                                                              provenance-based quality
                                                                              checking going to work?




                                                                            Upcoming W3C Working Group
                                                                            on Provenance for Web data

                                                                            - a European initiative: chaired
                                                                            by Luc Moreau (Southampton),
                                                                            Paul Groth (NL)


39
Provenance graphs and belief networks
       Intuition:
       As news propagate, so do trust and quality judgments about them
     • Is there a principled way to model this?
     • Idea: explore conceptual similarities between provenance graphs and
       belief networks (i.e. Bayesian networks)

       Standard Bayesian network example:




40
From process graph to provenance graph

                                      dx


          A1                  C1
     P1        d1        S1           d1’        P2           d2   S2   d2’




                    data production         data publishing




41
From process graph to provenance graph
                                                      Quality Control points
                                      dx


          A1                  C1
     P1        d1        S1           d1’        P2           d2        S2     d2’




                    data production         data publishing




41
From process graph to provenance graph
                                                                      Quality Control points
                                                 dx


               A1                       C1
          P1           d1          S1            d1’             P2          d2         S2        d2’




                            data production             data publishing

                                                                             provenance
                                       “used”                                graph for dx
                                                            P(dx)
                                                                                  “was generated by”
        d2’           S2          d2            P2

                                                             d1’             S1         d1        P1

                    “published”

     “was published by”
                                                                             C1                   A1
41
                                                       curator
                                                                                      author
From provenance graph to belief network
                                                                                provenance
                                               “used”                           graph for dx
                                                                   P(dx)
                                                                                     “was generated by”
         d2’          S2                  d2            P2

                                                                    d1’         S1          d1         P1

                    “published”

      “was published by”
                                                                                C1                     A1
                                                              curator
                                                                                          author

                    CPT
                                                QCP     CPT
        CPT
               P1             A1


                    d1                   S1             C1

                                                              - assume judgments are available at QCPs
          Pdx         A2                 d1’                  - Where do the remaining conditional probabilities
                                                                come from?

                         d2                    S2
                                                              - can judgments be
 42
27                                                              propagated here?
                                   d2’

Weitere ähnliche Inhalte

Ähnlich wie Internal seminar @Newcastle University, Feb 2011

Mik Black bioinformatics symposium
Mik Black bioinformatics symposiumMik Black bioinformatics symposium
Mik Black bioinformatics symposiumguest5e6f31
 
2011-06-08 Taverna workflow system
2011-06-08 Taverna workflow system2011-06-08 Taverna workflow system
2011-06-08 Taverna workflow systemStian Soiland-Reyes
 
Observability, Distributed Tracing, and Open Source: The Missing Primer
Observability, Distributed Tracing, and Open Source: The Missing PrimerObservability, Distributed Tracing, and Open Source: The Missing Primer
Observability, Distributed Tracing, and Open Source: The Missing PrimerVMware Tanzu
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolHong ChangBum
 
Eudat user forum-london-11march2013-biovel-v3
Eudat user forum-london-11march2013-biovel-v3Eudat user forum-london-11march2013-biovel-v3
Eudat user forum-london-11march2013-biovel-v3Alex Hardisty
 
How e-infrastructure can contribute to Linked Germplasm Data
How e-infrastructure can contribute to Linked Germplasm DataHow e-infrastructure can contribute to Linked Germplasm Data
How e-infrastructure can contribute to Linked Germplasm DataStoitsis Giannis
 
Instrumentation and measurement
Instrumentation and measurementInstrumentation and measurement
Instrumentation and measurementDr.M.Prasad Naidu
 
Observability with Spring-based distributed systems
Observability with Spring-based distributed systemsObservability with Spring-based distributed systems
Observability with Spring-based distributed systemsRakuten Group, Inc.
 
(ATS4-APP04) Instrument Service Overview
(ATS4-APP04) Instrument Service Overview(ATS4-APP04) Instrument Service Overview
(ATS4-APP04) Instrument Service OverviewBIOVIA
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaDataStax Academy
 
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Databricks
 
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuantUniversity
 
Introduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqIntroduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqEnis Afgan
 
Control and monitor_microservices_with_microprofile
Control and monitor_microservices_with_microprofileControl and monitor_microservices_with_microprofile
Control and monitor_microservices_with_microprofileRudy De Busscher
 
WhyR? Analiza sentymentu
WhyR? Analiza sentymentuWhyR? Analiza sentymentu
WhyR? Analiza sentymentuŁukasz Grala
 

Ähnlich wie Internal seminar @Newcastle University, Feb 2011 (20)

Mik Black bioinformatics symposium
Mik Black bioinformatics symposiumMik Black bioinformatics symposium
Mik Black bioinformatics symposium
 
Mik Black bioinformatics symposium
Mik Black bioinformatics symposiumMik Black bioinformatics symposium
Mik Black bioinformatics symposium
 
Managing Variability in Workflow with Feature Model Composition Operators
Managing Variability in Workflow with  Feature Model Composition OperatorsManaging Variability in Workflow with  Feature Model Composition Operators
Managing Variability in Workflow with Feature Model Composition Operators
 
2011-06-08 Taverna workflow system
2011-06-08 Taverna workflow system2011-06-08 Taverna workflow system
2011-06-08 Taverna workflow system
 
Observability, Distributed Tracing, and Open Source: The Missing Primer
Observability, Distributed Tracing, and Open Source: The Missing PrimerObservability, Distributed Tracing, and Open Source: The Missing Primer
Observability, Distributed Tracing, and Open Source: The Missing Primer
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo Protocol
 
Eudat user forum-london-11march2013-biovel-v3
Eudat user forum-london-11march2013-biovel-v3Eudat user forum-london-11march2013-biovel-v3
Eudat user forum-london-11march2013-biovel-v3
 
How e-infrastructure can contribute to Linked Germplasm Data
How e-infrastructure can contribute to Linked Germplasm DataHow e-infrastructure can contribute to Linked Germplasm Data
How e-infrastructure can contribute to Linked Germplasm Data
 
Instrumentation and measurement
Instrumentation and measurementInstrumentation and measurement
Instrumentation and measurement
 
Observability with Spring-based distributed systems
Observability with Spring-based distributed systemsObservability with Spring-based distributed systems
Observability with Spring-based distributed systems
 
PREDIcT
PREDIcTPREDIcT
PREDIcT
 
(ATS4-APP04) Instrument Service Overview
(ATS4-APP04) Instrument Service Overview(ATS4-APP04) Instrument Service Overview
(ATS4-APP04) Instrument Service Overview
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
 
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
 
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
 
Data integration
Data integrationData integration
Data integration
 
Introduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqIntroduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-Seq
 
Control and monitor_microservices_with_microprofile
Control and monitor_microservices_with_microprofileControl and monitor_microservices_with_microprofile
Control and monitor_microservices_with_microprofile
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 
WhyR? Analiza sentymentu
WhyR? Analiza sentymentuWhyR? Analiza sentymentu
WhyR? Analiza sentymentu
 

Mehr von Paolo Missier

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 

Mehr von Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 

Kürzlich hochgeladen

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 

Kürzlich hochgeladen (20)

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 

Internal seminar @Newcastle University, Feb 2011

  • 1. Programming scientific data pipelines with the Taverna workflow management system Dr. Paolo Missier School of Computing Science Newcastle University, UK Newcastle, Feb. 2011 With thanks to the myGrid team in Manchester for contributing their material and time
  • 2. Outline Objective: to provide a practical introduction to workflow systems for scientific applications • Workflows in context: lifecycle and the workflow eco-system • Workflows for data integration • The user experience: from services and scripts to workflows • Extensibility I: importing services – using R scripts • Extensibility II: plugins • The Taverna computation model: a closer look – functional model, dataflow parallelism • Performance 2
  • 3. Workflows in science High level programming models for scientific applications • Specification of service / components execution orchestration • Handles cross cutting concerns like error handling, service invocation, data movement, data streaming, provenance tracking….. • A workflow is a specification configured for each run 3
  • 4. What are Workflows used for? EarthSciences Life Sciences 4
  • 5. Taverna • First released 2004 • Current version Taverna 2.2 • Currently 1500+ users per month, 350+ organizations, ~40 countries, 80,000+ downloads across versions • Freely available, open source LGPL • Windows, Mac OS, and Linux • http://www.taverna.org.uk • User and developer workshops • Documentation • Public Mailing list and direct email support http://www.taverna.org.uk/introduction/taverna-in-use/ 5
  • 6. Who else is in this space? Trident Triana VisTrails Kepler Taverna Pegasus (ISI) 6
  • 7. Who else is in this space? Trident Triana VisTrails Kepler Taverna Pegasus (ISI) 6
  • 8. Example: the BioAID workflow Purpose: The workflow extracts protein names from documents retrieved from MedLine based on a user Query (cf Apache Lucene syntax). The protein names are filtered by checking if there exists a valid UniProt ID for the given protein name. Credits: - Marco Roos (workflow), - text mining services by Sophia Katrenko and Edgar Meij (AID), and Martijn Schuemie (BioSemantics, Erasmus University Rotterdam). Available from myExperiment: http://www.myexperiment.org/workflows/154.html 7
  • 9. The workflows eco-system in myGrid A process-centric science lifecycle
  • 10. The workflows eco-system in myGrid A process-centric science lifecycle Service discovery and import
  • 11. The workflows eco-system in myGrid A process-centric science lifecycle Service discovery and import Data Metadata Methods - inputs - provenance - the workflow - parameters - annotations - results
  • 12. The workflows eco-system in myGrid A process-centric science lifecycle Service discovery and import Data Metadata Methods - inputs - provenance - the workflow - parameters - annotations - results
  • 13. Workflow as data integrator
  • 14. Workflow as data integrator QTL genomic regions genes in QTL metabolic pathways (KEGG)
  • 15. Workflow as data integrator QTL genomic regions genes in QTL metabolic pathways (KEGG)
  • 16. Taverna computational model (very briefly) List-structured KEGG gene ids: geneIDs pathways [ [ mmu:26416 ], • • • • [ mmu:328788 ] ] • • • • • • geneIDs pathways • • • • • • • • • • [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ] [ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
  • 17. Taverna computational model (very briefly) List-structured KEGG gene ids: [ [ mmu:26416 ], [ mmu:328788 ] ] • Collection processing • Simple type system • no record / tuple structure • data driven computation • with optional processor synchronisation • parallel processor activation • greedy (no scheduler) [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ] [ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
  • 18. From services and scripts to workflows • the BioAID workflow again: – http://www.myexperiment.org/workflows/154.html • overall composition: • 15 beanshell and other local scripts – mostly for data formatting • 4 WSDL-based service operations: operation service getUniprotID synsetServer queryToArray tokenize apply applyCRFService search SearcherWSService 11
  • 19. Service composition requires adapters Example: SBML model optimisation workflow -- designed by Peter Li http://www.myexperiment.org/workflows/1201
  • 20. Service composition requires adapters Example: SBML model optimisation workflow -- designed by Peter Li http://www.myexperiment.org/workflows/1201
  • 21. Service composition requires adapters Example: SBML model optimisation workflow -- designed by Peter Li http://www.myexperiment.org/workflows/1201 String[] lines = inStr.split("n"); StringBuffer sb = new StringBuffer(); for(i = 1; i < lines.length -1; i++) { String str = lines[i]; str = str.replaceAll("<result>", ""); str = str.replaceAll("</result>", ""); sb.append(str.trim() + "n"); } String outStr = sb.toString(); Url -> content (built-in shell script) import java.util.regex.Pattern; import java.util.regex.Matcher; sb = new StringBuffer(); p = "CHEBI:[0-9]+"; Pattern pattern = Pattern.compile(p); Matcher matcher = pattern.matcher(sbrml); while (matcher.find()) { sb.append("urn:miriam:obo.chebi:" + matcher.group() + ","); } String out = sb.toString(); //Clean up if(out.endsWith(",")) out = out.substring(0, out.length()-1); chebiIds = out.split(",");
  • 22. Building workflows from existing services • Large collection of available services – default but extensible palette of services in the workbench – mostly third party – All the major providers: NCBI, DDBJ, EBI … A plethora of providers: For an example of how to build a simple workflow, please follow Exercise 3 from this tutorial 13
  • 23. Incorporating R scripts into Taverna Requirements for using R in a local installation: - install R from main archive site: http://cran.r-project.org/ - install Rserve: http://www.rforge.net/Rserve/ - start Rserve locally: - start the R console and type the commands: library(Rserve) Rserve(args="--no-save") Taverna can display graphical output from R The following R script simply produces a png image that is displayed on the Taverna output: png(g); plot(rnorm(1:100)); dev.off(); To use it, create an R Taverna workflow with output port g - of type png image See also: http://www.mygrid.org.uk/usermanual1.7/rshell_processor.html 14
  • 24. Integration between Taverna and eScience Central • An example of integration between – Taverna workflows (desktop) – the eScience Central cloud environment • Facilitated by Taverna’s plugin architecture • See http://www.cs.man.ac.uk/~pmissier/T-eSC-integration.svg 15
  • 25. Plugin: Excel spreadsheets as workflow input • Third-party plugin code can later be bundled in a distribution • Ex.: importing input data from a spreadsheet – see: http://www.myexperiment.org/workflows/1417.html – and example input spreadsheet: http://www.myexperiment.org/files/410.html 16
  • 26. Plugin: Excel spreadsheets as workflow input • Third-party plugin code can later be bundled in a distribution • Ex.: importing input data from a spreadsheet – see: http://www.myexperiment.org/workflows/1417.html – and example input spreadsheet: http://www.myexperiment.org/files/410.html 16
  • 27. Plugin: Excel spreadsheets as workflow input • Third-party plugin code can later be bundled in a distribution • Ex.: importing input data from a spreadsheet – see: http://www.myexperiment.org/workflows/1417.html – and example input spreadsheet: http://www.myexperiment.org/files/410.html 16
  • 28. Taverna Model of Computation: a closer look • Arcs between two ports define data dependencies – processors with inputs on all their (connected) ports are ready – no active scheduling: admission control is simply by the size of threads pool – processors fire as soon as they are ready and there are available threads in the pool • No control structures – no explicit branching or loop constructs • but dependencies between processors can be added: – end(P1) ➔ begin(P2) coordination link semantics: “fetch_annotations can only start after ImprintOutputAnnotator has completed” Typical pattern: writer ➔ reader (eg to external DB) 17
  • 29. List processing model • Consider the gene-enzymes workflow from the previous demo: Values can be either atomic or (nested) lists - values are of simple types (string, number,...) - but also mime types for images (see R example above) What happens if the input to our workflow is a list of gene IDs? geneID = [ mmu:26416, mmu:19094 ] we need to declare the input geneID to be of depth 1 - depth n in general, for a generic n-deep list 18
  • 30. Implicit iteration over lists Demo: – reload KEGG-genes-enzymes-atomicInput.t2flow – declare input geneID to be of depth 1 – input two genes, run – Each processor is activated once for each element in the list – this is because each is designed to accept an atomic value – the result is a nested list of results, one for each gene in the input list 19
  • 31. Functional model for collection processing /1 Simple processing: service expects atomic values, receives atomic values v1 v2 v3 X1 X2 X3 P Y1 Y2 w1 w2 20
  • 32. Functional model for collection processing /1 Simple processing: Simple iteration: service expects atomic values, service expects atomic values, receives atomic values receives input list v = [v1 ... vn] v1 v2 v3 v = [v1 ... vn] v1 vn X1 X2 X3 X X X Y1 P Y2 Y P ➠ P1 Y ... Pn Y w1 w2 w = [w1 ... wn] w1 wn w = [w1 ... wn] 20
  • 33. Functional model for collection processing /1 Simple processing: Simple iteration: service expects atomic values, service expects atomic values, receives atomic values receives input list v = [v1 ... vn] v1 v2 v3 v = [v1 ... vn] v1 vn X1 ad = 1 X2 X3 X X X Y1 P Y2 dd = 0 Y P ➠ P1 Y ... Pn Y δ=1 w1 w2 w = [w1 ... wn] w1 wn w = [w1 ... wn] 20
  • 34. Functional model for collection processing /1 Simple processing: Simple iteration: service expects atomic values, service expects atomic values, receives atomic values receives input list v = [v1 ... vn] v1 v2 v3 v = [v1 ... vn] v1 vn X1 ad = 1 X2 X3 X X X Y1 P Y2 dd = 0 P Y ➠ P1 Y ... Pn Y δ=1 w1 w2 w = [w1 ... wn] w1 wn w = [w1 ... wn] v = [[...], ...[...]] Extension: service expects atomic ad =2 X values, receives input nested list dd = 0 P Y δ=2 20 w = [[..] ...[...]]
  • 35. Functional model /2 The simple iteration model v = [[...], ...[...]] generalises by induction to a ad =n generic δ=n-m X dd = m P Y δ = n-m ≥ 0 w = [[..] ...[...]] - depth = n-m 21
  • 36. Functional model /2 The simple iteration model v = [[...], ...[...]] generalises by induction to a ad =n generic δ=n-m X dd = m P Y δ = n-m ≥ 0 w = [[..] ...[...]] - depth = n-m This leads to a recursive functional formulation for simple collection processing: v = a1 . . . an (P v) if l = 0 (evall P v) = (map (evall − 1 P ) v) if l > 0 21
  • 37. Functional model - multiple inputs /3 v2 = [v21 ... v2k] v1 = [v11 ... vin] X1 X2 X3 v3 = [v31 ... v3m] P dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1 Y dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0 dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1 w = [ [w11 ... w1n], ... [wm1 ...wmn] ] 22
  • 38. Functional model - multiple inputs /3 v2 = [v21 ... v2k] v1 = [v11 ... vin] X1 X2 X3 v3 = [v31 ... v3m] P dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1 Y dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0 dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1 w = [ [w11 ... w1n], ... [wm1 ...wmn] ] Cross-product involving v1 and v3 (but not v2): v1 ⊗ v3 = [ [ <v1i, v3j> | j:1..m ] | i:1..n ] // cross product and including v2: [ [ <v1i, v2, v3j> | j:1..m ] | i:1..n ] 22
  • 39. Generalised cross product Binary product, δ = 1: a × b = [[ ai , bj ]|bj ← b]|ai ← a] (eval2 P a, b ) = (map (eval1 P ) a × b) 23
  • 40. Generalised cross product Binary product, δ = 1: a × b = [[ ai , bj ]|bj ← b]|ai ← a] (eval2 P a, b ) = (map (eval1 P ) a × b) Generalized to arbitrary depths:  [[(vi , wj )|wj ← w]|vi ← v]  if d1 > 0, d2 >0  [(v , w)|v ← v] i i if d1 > 0, d2 =0 (v, d1 ) ⊗ (w, d2 ) = [(v, wj )|wj ← w]  if d1 = 0, d2 >0   (v, w) if d1 = 0, d2 =0 ...and to n operands: ⊗i:1...n (vi , di ) 23
  • 41. Generalised cross product Binary product, δ = 1: a × b = [[ ai , bj ]|bj ← b]|ai ← a] (eval2 P a, b ) = (map (eval1 P ) a × b) Generalized to arbitrary depths:  [[(vi , wj )|wj ← w]|vi ← v]  if d1 > 0, d2 >0  [(v , w)|v ← v] i i if d1 > 0, d2 =0 (v, d1 ) ⊗ (w, d2 ) = [(v, wj )|wj ← w]  if d1 = 0, d2 >0   (v, w) if d1 = 0, d2 =0 ...and to n operands: ⊗i:1...n (vi , di ) Finally: general functional semantics for collection-based processing (evall P (v1 , d1 ), . . . , (vn , dn ) ) (P v1 , . . . , vn ) if l = 0 = (map (evall−1 P ) ⊗i:1...n vi , di ) if l > 0 23
  • 42. Parallelism in the dataflow model • The data-driven model with implicit iterations provides opportunities for parallel processing of workflows two types of parallelism: • intra-processor: implicit iteration over list data • inter-processor: pipelining [ id1, id2, id3, ...] SFH1 SFH2 SFH3 implicit assumption ... ... ... of independence amongst the threads that operate on elements of a list getDS1 getDS2 getDS3 [ DS1, DS2, DS3, ...] 24
  • 43. Exploiting latent parallelism [ a, b, c,...] [ (echo_1 a), (echo_1 b), (echo_1 c)] (echo_2 (echo_1 a)) (echo_2 (echo_1 b)) (echo_2 (echo_1 c)) See also: http://www.myexperiment.org/workflows/1372.html 25
  • 44. Performance - experimental setup • previous version of Taverna engine used as baseline • objective: to measure incremental improvement list generator Parameters: multiple parallel pipelines - byte size of list elements (strings) - size of input list - length of linear chain main insight: when the workflow is designed for pipelining, parallelism is exploited effectively 26
  • 45. Performance study: Experimental setup - I • Programmatically generated dataflows – the “T-towers” parameters: - size of the lists involved - length of the paths - includes one cross product 27
  • 46. caGrid workflow for performance analysis Goal: perform cancer diagnosis using microarray analysis - learn a model for lymphoma type prediction based on samples from different lymphoma types source: caGrid 28 http://www/myexperiment.org/workflows/746
  • 47. caGrid workflow for performance analysis Goal: perform cancer diagnosis using microarray analysis - learn a model for lymphoma type prediction based on samples from different lymphoma types lymphoma samples source: caGrid ➔ hybridization data 28 http://www/myexperiment.org/workflows/746
  • 48. caGrid workflow for performance analysis Goal: perform cancer diagnosis using microarray analysis - learn a model for lymphoma type prediction based on samples from different lymphoma types lymphoma samples source: caGrid ➔ hybridization data process microarray data as training dataset 28 http://www/myexperiment.org/workflows/746
  • 49. caGrid workflow for performance analysis Goal: perform cancer diagnosis using microarray analysis - learn a model for lymphoma type prediction based on samples from different lymphoma types lymphoma samples source: caGrid ➔ hybridization data process microarray data as training dataset learn predictive model 28 http://www/myexperiment.org/workflows/746
  • 50. caGrid workflow for performance analysis Goal: perform cancer diagnosis using microarray analysis - learn a model for lymphoma type prediction based on samples from different lymphoma types lymphoma samples source: caGrid ➔ hybridization data 28 http://www/myexperiment.org/workflows/746
  • 51. Results I - Memory usage shorter execution time due to pipelining T2 main memory data management T2 embedded Derby back-end T1 baseline list size: 1,000 strings of 10K chars each no intra-processor parallelism (1 thread/processor) 29
  • 52. Results I - Memory usage shorter execution time due to pipelining T2 main memory data management T2 embedded Derby back-end T1 baseline list size: 1,000 strings of 10K chars each no intra-processor parallelism (1 thread/processor) 29
  • 53. Results II - Available processors pool pipelining in T2 makes up for smaller pools of threads/processor 30
  • 54. Results III - Bounded main memory usage Separation of data and process spaces ensures scalable data management varying data element size: 10K, 25K, 100K chars 31
  • 55. Ongoing effort: Taverna on the Cloud • Early experiments on running multiple instances of Taverna workflows in a cloud environment • Coarse-grained cloud deployment: workflow-at-a-time – data partitioning ➔ each partition is allocated to a workflow instance For more details, please see: Paul Fisher, ECCB talk slides, October, 2010 32
  • 56. Summary • Workflows: high-level programming paradigm • Bridges the gap between scientists and developers • Many workflow models available (commercial/open source) • Taverna implements a dataflow model – has proven useful for a broad variety of scientific applications Strengths: • Rapid prototyping given a base of third-party or own services • Explicit modelling of data integration processes • Extensibility: – for workflow designers: easy to import third-party services (SOAP, REST) – accepts scripts in a variety of languages – for developers: easy to add functionality using a plugin model • Good potential for parallelisation • Early experiments on cloud deployment: workflow-at-a-time – ongoing study for finer-grain deployment of portions of the workflow 33 Back to the start
  • 57. ADDITIONAL MATERIAL • Provenance of workflow data • Provenance and Trust of Web data 34
  • 58. Example workflow (Taverna) chr: 17 QTL → start: 28500000 end: 3000000 Ensembl Genes Ensembl Gene → Ensembl Gene → Uniprot Gene Entrez Gene Uniprot Gene → Entrez Gene → Kegg Gene Kegg Gene merge gene IDs Gene → Pathway path:mmu04210 Apoptosis, path:mmu04010 MAPK, ...
  • 59. Baseline provenance of a workflow run QTL → mmu:12575 Ensembl Genes v1 ... vn w Ensembl Gene → Ensembl Gene → Uniprot Gene Entrez Gene path:mmu04012 exec Uniprot Gene → Entrez Gene → a1 ... an b1 ... bm Kegg Gene Kegg Gene mmu:26416 merge gene IDs path:mmu04010 y11 ymn Gene → Pathway ... path:mmu04010→derives_from→mmu:26416 path:mmu04012→derives_from→mmu:12575 • The graph encodes all direct data dependency relations • Baseline query model: compute paths amongst sets of nodes • Transitive closure over data dependency relations 36
  • 60. Motivation for fine-grained provenance List-structured KEGG gene ids: [ [ mmu:26416 ], [ mmu:328788 ] ] [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ] [ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
  • 61. Motivation for fine-grained provenance List-structured KEGG gene ids: [ [ mmu:26416 ], [ mmu:328788 ] ] geneIDs pathways • • • • • • • • • • [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ] [ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
  • 62. Motivation for fine-grained provenance List-structured KEGG gene ids: [ [ mmu:26416 ], [ mmu:328788 ] ] geneIDs pathways • • • • • • • • • • [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ] [ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
  • 63. Motivation for fine-grained provenance List-structured KEGG gene ids: [ [ mmu:26416 ], [ mmu:328788 ] ] geneIDs pathways • • • • • • • • • • [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ] [ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
  • 64. Motivation for fine-grained provenance List-structured KEGG gene ids: [ [ mmu:26416 ], [ mmu:328788 ] ] geneIDs pathways • • • • • • • • • • [ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ] [ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
  • 65. Efficient query processing: main result Workflow graph Provenance graph X X v1 ... vn w Q R Y Y a1 ... an b1 ... bm X1 X2 X3 P y11 ... ymn Y y = [ [y11 ... y1n], ... [ym1 ... ymn] ] • Query the provenance of individual collections elements • But, avoid computing transitive closures on the provenance graph • Use workflow graph as index instead • Exploit workflow model semantics to statically predict dependencies on individual tree elements • This results in substantial performance improvement for typical queries
  • 66. Efficient query processing: main result Workflow graph Provenance graph [1] [] X X v1 ... vn w Q R Y Y a1 ... an b1 ... bm [] X1 X2 X3 [n] [1] P y11 ... ymn Y y = [ [y11 ... y1n], ... [ym1 ... ymn] ] • Query the provenance of individual collections elements • But, avoid computing transitive closures on the provenance graph • Use workflow graph as index instead • Exploit workflow model semantics to statically predict dependencies on individual tree elements • This results in substantial performance improvement for typical queries
  • 67. Efficient query processing: main result Workflow graph Provenance graph [1] [] X X v1 ... vn w Q R Y Y a1 ... an b1 ... bm [] X1 X2 X3 [n] [1] P y11 ... ymn Y y = [ [y11 ... y1n], ... [ym1 ... ymn] ] • Query the provenance of individual collections elements • But, avoid computing transitive closures on the provenance graph • Use workflow graph as index instead • Exploit workflow model semantics to statically predict dependencies on individual tree elements • This results in substantial performance improvement for typical queries
  • 68. Trust and provenance for Web data • Testimonials: http://www.w3.org/2005/Incubator/prov/ – "At the toolbar (menu, whatever) associated with a document there is a button marked "Oh, yeah?". You press it when you lose that feeling of trust. " - Tim Berners-Lee, Web Design Issues, September 1997 – Provenance is the number one issue we face when publishing government data as linked data for data.gov.uk" - John Sheridan, UK National Archives, data.gov.uk, February 2010 how exactly is provenance-based quality checking going to work? Upcoming W3C Working Group on Provenance for Web data - a European initiative: chaired by Luc Moreau (Southampton), Paul Groth (NL) 39
  • 69. Provenance graphs and belief networks Intuition: As news propagate, so do trust and quality judgments about them • Is there a principled way to model this? • Idea: explore conceptual similarities between provenance graphs and belief networks (i.e. Bayesian networks) Standard Bayesian network example: 40
  • 70. From process graph to provenance graph dx A1 C1 P1 d1 S1 d1’ P2 d2 S2 d2’ data production data publishing 41
  • 71. From process graph to provenance graph Quality Control points dx A1 C1 P1 d1 S1 d1’ P2 d2 S2 d2’ data production data publishing 41
  • 72. From process graph to provenance graph Quality Control points dx A1 C1 P1 d1 S1 d1’ P2 d2 S2 d2’ data production data publishing provenance “used” graph for dx P(dx) “was generated by” d2’ S2 d2 P2 d1’ S1 d1 P1 “published” “was published by” C1 A1 41 curator author
  • 73. From provenance graph to belief network provenance “used” graph for dx P(dx) “was generated by” d2’ S2 d2 P2 d1’ S1 d1 P1 “published” “was published by” C1 A1 curator author CPT QCP CPT CPT P1 A1 d1 S1 C1 - assume judgments are available at QCPs Pdx A2 d1’ - Where do the remaining conditional probabilities come from? d2 S2 - can judgments be 42 27 propagated here? d2’

Hinweis der Redaktion

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. workflow walk through:\n* run using 10 max hits and default inputs*\n*open processors and beanshell boxes as a preview of what the workflow contains*\ndata dependencies. no control dependencies\nstructural nesting / modularization\nsimple input is a query string.\nprocessors are either service operation invocations, or shell-type scripts\nexecution is all local except the calls out to the services. shell interpreters are local as well\nnote areas of the workbench.\nzoom into any of the nested workflows\n- show intermediate values\n
  11. in scope: \n- design: features available through the workbench\nexecution: local mode and server-based execution\nBiocatalogue, myExperiment if time permits or on demand\n\n
  12. in scope: \n- design: features available through the workbench\nexecution: local mode and server-based execution\nBiocatalogue, myExperiment if time permits or on demand\n\n
  13. in scope: \n- design: features available through the workbench\nexecution: local mode and server-based execution\nBiocatalogue, myExperiment if time permits or on demand\n\n
  14. Taverna workflows are essentially programmable service orchestrations\nTaverna as a data integration model\n
  15. Taverna workflows are essentially programmable service orchestrations\nTaverna as a data integration model\n
  16. \n
  17. \n
  18. \n
  19. \n\n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. Key observation: one can add your services but then there is very little support on how to connect its ports\n -- no type system, for example\n\n
  27. \n
  28. the task becomes a processor when it is added to a workflow\nthe processor has one port for each operation \n
  29. the task becomes a processor when it is added to a workflow\nthe processor has one port for each operation \n
  30. the task becomes a processor when it is added to a workflow\nthe processor has one port for each operation \n
  31. have an Rserve running locally. start it like so:\nlibrary(Rserve)\nRserve(args=&quot;--no-save&quot;) \n\nstart a new workflow. Add R script to it with this content:\npng(g);\nplot(rnorm(1:100));\ndev.off();\n\nOR: load R-simple-graphics.t2flow\n
  32. load the example weather workflow in T2.2.0:\nexample_workflow_for_rest_and_xpath_activities_650957.t2flow\n\n -- won&amp;#x2019;t work in earlier versions as these are new plugins\n
  33. show BIOAid plugin in my 2.1.2\n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. run workflow: spreadsheed_data_import_example_492836.t2flow\nin 2.2.0\n
  43. run workflow: spreadsheed_data_import_example_492836.t2flow\nin 2.2.0\n
  44. \n
  45. \n
  46. \n
  47. reload KEGG-genes-enzymes-atomicInput.t2flow\ndeclare input geneID to be of depth 1\ninput two genes: \n mmu:26416\n mmu:328788\n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. Demo: show this workflow in action:\ngeneratedLargeList.t2flow\nI1 = 10 \nlist size = 10\n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. On a typical QTL region with ~150k base pairs, one execution of this workflow finds about 50 Ensembl genes. \nThese could correspond to about 30 genes in the UniProt database, and 60 in the NCBI Etrez genes database. \nEach gene may be involved in a number of pathways, for example the mouse genes Mapk13 (mmu:26415) and \nCdkn1a (mmu:12575) participate to 13 and 9 pathways, respectively. \n\n
  75. \n
  76. \n
  77. \n
  78. \n
  79. \n
  80. \n
  81. \n
  82. \n
  83. \n
  84. http://www.w3.org/2005/Incubator/prov/XGR-prov-20101214/#News_Aggregator_Scenario\n
  85. \n
  86. \n
  87. \n
  88. \n
  89. \n
  90. \n
  91. \n