SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Downloaden Sie, um offline zu lesen
, reloaded



           Paolo Missier, Stian Soiland-Reyes, Stuart Owen, Alex Nenadic,
                        Ian Dunlop, Alan Williams, Carole Goble
                               School of Computer Science
                               University of Manchester, UK

         Wei Tan
                                                                       Tom Oinn
Argonne National Laboratory
     Argonne, IL, USA                                                   EBI, UK


                                     SSDBM
                               Heidelberg, Germany
                               June 30 - July 2, 2010
Taverna: scientific workflow management
Goal: perform cancer diagnosis using microarray analysis
- learn a model for lymphoma type prediction based on
samples from different lymphoma types


source: caGrid




http://www/myexperiment.org/workflows/746                  2
Taverna: scientific workflow management
Goal: perform cancer diagnosis using microarray analysis
- learn a model for lymphoma type prediction based on
samples from different lymphoma types

                                          lymphoma samples
source: caGrid                            ➔ hybridization data




http://www/myexperiment.org/workflows/746                        2
Taverna: scientific workflow management
Goal: perform cancer diagnosis using microarray analysis
- learn a model for lymphoma type prediction based on
samples from different lymphoma types

                                          lymphoma samples
source: caGrid                            ➔ hybridization data




   process microarray
 data as training dataset




http://www/myexperiment.org/workflows/746                        2
Taverna: scientific workflow management
Goal: perform cancer diagnosis using microarray analysis
- learn a model for lymphoma type prediction based on
samples from different lymphoma types

                                          lymphoma samples
source: caGrid                            ➔ hybridization data




   process microarray
 data as training dataset
                                     learn
                              predictive model


http://www/myexperiment.org/workflows/746                        2
Taverna: scientific workflow management
Goal: perform cancer diagnosis using microarray analysis
- learn a model for lymphoma type prediction based on
samples from different lymphoma types

                                          lymphoma samples
source: caGrid                            ➔ hybridization data




http://www/myexperiment.org/workflows/746                        2
In this (short) talk...




• Taverna for data-intensive science
  – architectural goals... and how they are being achieved


• Performance figures on benchmark workflows




                                                             3
Target -ilities
• Scalability
  – size of input data / size of input collections

• Configurability
  – each processor in the workflow individually tunable to
    adjust to the underlying platform

• Extensibility
  – plugin architecture to accommodate specific service
    groups
     • caGrid, BioMoby

  – extensible set of operators that define the semantics of the
    model


                                                                   4
Opportunities for parallel processing
• intra-processor: implicit iteration over list data
• inter-processor: pipelining
               [ id1, id2, id3, ...]




                                                5
Opportunities for parallel processing
• intra-processor: implicit iteration over list data
• inter-processor: pipelining
                  [ id1, id2, id3, ...]


           SFH1        SFH2          SFH3

            ...          ...              ...



         getDS1       getDS2         getDS3




                                                5
Opportunities for parallel processing
• intra-processor: implicit iteration over list data
• inter-processor: pipelining
                   [ id1, id2, id3, ...]


           SFH1         SFH2          SFH3

            ...           ...              ...



         getDS1        getDS2         getDS3


                  [ DS1, DS2, DS3, ...]




                                                 5
Configurable processor behaviour


                                    allocates concurrent
                                                  threads to list elements
             Default dispatch stack




Request
                   Parallelise

                   Provenance
                                                 captures service
                 Error bounce                    invocation events
                    Failover




                                      Response
                     Retry                       persistent server error:
                                                 tries each available
                  Stop/pause
                                                 service implementation
                     Invoke                      in turn

                                                 transient errors:
          interacts with                         retries one service
          underlying activity                    invocation
          - service operation
          - shell script execution
Extensible processor semantics
          

            Default dispatch stack
Request



                  Parallelise

                  Provenance

                Error bounce

                   Failover




                                     Response
                    Retry

                 Stop/pause

                    Invoke




                                                                         7
Extensible processor semantics
          

            Default dispatch stack                            Default dispatch stack                        Extended dispatch stack




                                                   Request




                                                                                                  Request
Request



                  Parallelise
                                                Enhancing dataflow model                                           Parallelise
                                                with limited form of loops to
                                                                Parallelise
                  Provenance                                                                                      Error bounce
                                                model busy waiting
                                                              Error bounce
                Error bounce                    (asynchronous service                                           Loop/Branching
                                                interaction)     Failover
                   Failover                                                                                         Failover




                                     Response
                                                                      Retry




                                                                                       Response




                                                                                                                                      Response
                    Retry                                                                                            Retry
                                                                     Invoke
                 Stop/pause                                                                                         Invoke
                    Invoke
                                                                       (a)                                           (b)




                                                                                                                                 7
Extensible processor semantics
          

            Default dispatch stack                                                     Default dispatch stack                               Extended dispatch stack




                                                                   Request




                                                                                                                                  Request
Request



                  Parallelise
                                                         Enhancing dataflow model                                                                  Parallelise
                                                         with limited form of loops to
                                                                         Parallelise
                  Provenance                                                                                                                      Error bounce
                                                         model busy waiting
                                                                       Error bounce
                Error bounce                             (asynchronous service                                                                  Loop/Branching
                                                         interaction)     Failover
                   Failover                                                                                                                         Failover




                                          Response
                                                                                                      Retry




                                                                                                                       Response




                                                                                                                                                                      Response
                    Retry                                                                                                                            Retry
                                                                                                      Invoke
                 Stop/pause                                                                                                                         Invoke
                                                         ."4&04&             7';'3%
                    Invoke                                    ;(4)4&0;=;"2.'4

                                                        '&&'./304&5,%&              +"#,-               (a)                                          (b)
                                !"#$#%&'()(*+,-%$.*//$0*1

                                                          2*(34/*56#%.8&9
                                                                                              +"#,-           &670

          Busy waiting                                         !"#$)*                          80&$9:5$;0%(<&


          using an
                                                                                            ;0%(<&    '&&'./304&5,%&


          explicit                                              +"#,-

          workflow                                        ./0.12&'&(%

                                                     %&'&(%     '&&'./304&5,%&
          pattern                2*(34/*5678&.8&9
                                                                    +"#$%&'&(%

                                      !"#$%&'&(%                        )%$-"40

                                                                        ,%$-"40


                                                                             &0%&

                                                                        2(..0%%



                                                                                                                                                                 7
Extensible processor semantics
          

            Default dispatch stack                                                     Default dispatch stack                                   Extended dispatch stack




                                                                   Request




                                                                                                                                     Request
Request



                  Parallelise
                                                         Enhancing dataflow model                                                                           Parallelise
                                                         with limited form of loops to
                                                                         Parallelise
                  Provenance                                                                                                                              Error bounce
                                                         model busy waiting
                                                                       Error bounce
                Error bounce                             (asynchronous service                                                                          Loop/Branching
                                                         interaction)     Failover
                   Failover                                                                                                                                      Failover




                                          Response
                                                                                                      Retry




                                                                                                                          Response




                                                                                                                                                                                            Response
                    Retry                                                                                                                                          Retry
                                                                                                      Invoke
                 Stop/pause                                                                                                                                       Invoke
                                                         ."4&04&             7';'3%
                    Invoke                                    ;(4)4&0;=;"2.'4

                                                        '&&'./304&5,%&              +"#,-               (a)                                                         (b)
                                !"#$#%&'()(*+,-%$.*//$0*1

                                                          2*(34/*56#%.8&9
                                                                                              +"#,-           &670                             (3/360      4"7&)7&

          Busy waiting                                         !"#$)*                          80&$9:5$;0%(<&                                    /1787&)/9/":437

          using an
                                                                                            ;0%(<&    '&&'./304&5,%&                           3&&3456)7&.$0&    !"#$%


          explicit                                              +"#,-                                                                                                     !"#$%

          workflow                                        ./0.12&'&(%

                                                     %&'&(%     '&&'./304&5,%&
                                                                                                                       Busy waiting                                  45)4;:&3&10

                                                                                                                                                                3&&3456)7&.$0&     0&3&10
          pattern                                                                                                      using an
                                 2*(34/*5678&.8&9
                                                                    +"#$%&'&(%                                         loop                                 !"#$%           &'()

                                      !"#$%&'&(%                        )%$-"40

                                                                        ,%$-"40
                                                                                                                       processor                            *)&+,-.+/)012&

                                                                                                                                                         /)012&     3&&3456)7&.$0&


                                                                             &0%&

                                                                        2(..0%%



                                                                                                                                                                                      7
Performance experimental setup
• previous version of Taverna engine used as baseline
• objective: to measure incremental improvement

                                 list generator
                                                      Parameters:
                                                      - byte size of list elements (strings)
                                                      - size of input list
                                                      - length of linear chain
   multiple parallel pipelines




                                          main insight: when the workflow is designed for
                                          pipelining, parallelism is exploited effectively




                                                                                               8
I - Memory usage

                                      shorter execution
                                      time due to pipelining
              T2 main memory
              data management


            T2 main embedded Derby back-end



                        T1 baseline




list size: 1,000 strings of 10K chars each
no intra-processor parallelism (1 thread/processor)


                                                                   9
I - Memory usage

                                      shorter execution
                                      time due to pipelining
              T2 main memory
              data management


            T2 main embedded Derby back-end



                        T1 baseline




list size: 1,000 strings of 10K chars each
no intra-processor parallelism (1 thread/processor)


                                                                   9
Available processors pool




pipelining in T2 makes up for smaller pools of threads/processor




                                                                   10
Bounded main memory usage
Separation of data and process spaces ensures scalable data management




                  varying data element size:
                  10K, 25K, 100K chars




                                                                         11
Summary
• Taverna 2 re-engineered for scalability on data-
  intensive pipelined dataflows

• separation of data and process spaces
• generic stack pattern for principled extensions
  – limited while loop, if-then-else
  – provenance capture
  – ... open plugin architecture accommodates further
    extensions



      For further details and a nice chat:
               please visit our poster!!


                                                           12

Weitere ähnliche Inhalte

Ähnlich wie Paper presentation: Taverna, reloaded

Leveraging The Open Provenance Model as a Multi-Tier Model for Global Climate...
Leveraging The Open Provenance Model as a Multi-Tier Model for Global Climate...Leveraging The Open Provenance Model as a Multi-Tier Model for Global Climate...
Leveraging The Open Provenance Model as a Multi-Tier Model for Global Climate...Eric Stephan
 
Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010Aditya Varun Chadha
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the DatabusAmy W. Tang
 
[RAT資安小聚] Study on Automatically Evading Malware Detection
[RAT資安小聚] Study on Automatically Evading Malware Detection[RAT資安小聚] Study on Automatically Evading Malware Detection
[RAT資安小聚] Study on Automatically Evading Malware DetectionAj MaChInE
 
20111104 s4 overview
20111104 s4 overview20111104 s4 overview
20111104 s4 overviewLeo Neumeyer
 
Building Distributed Systems in Scala
Building Distributed Systems in ScalaBuilding Distributed Systems in Scala
Building Distributed Systems in ScalaAlex Payne
 
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012Shirshanka Das
 
Wolstencroft K - Workflows on the Cloud: scaling for national service
Wolstencroft K - Workflows on the Cloud: scaling for national serviceWolstencroft K - Workflows on the Cloud: scaling for national service
Wolstencroft K - Workflows on the Cloud: scaling for national serviceJan Aerts
 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"Pinar Alper
 
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...netvis
 
Web Sphere Problem Determination Ext
Web Sphere Problem Determination ExtWeb Sphere Problem Determination Ext
Web Sphere Problem Determination ExtRohit Kelapure
 
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...Jan Aerts
 
#lspe: Dynamic Scaling
#lspe: Dynamic Scaling #lspe: Dynamic Scaling
#lspe: Dynamic Scaling steveshah
 
Rails Application Optimization Techniques & Tools
Rails Application Optimization Techniques & ToolsRails Application Optimization Techniques & Tools
Rails Application Optimization Techniques & Toolsguest05c09d
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyUri Laserson
 
Testing concurrent java programs - Sameer Arora
Testing concurrent java programs - Sameer AroraTesting concurrent java programs - Sameer Arora
Testing concurrent java programs - Sameer AroraIndicThreads
 
Service Mesh vs. Frameworks: Where to put the resilience?
Service Mesh vs. Frameworks: Where to put the resilience?Service Mesh vs. Frameworks: Where to put the resilience?
Service Mesh vs. Frameworks: Where to put the resilience?Michael Hofmann
 
Service Mesh vs. Frameworks: Where to put the resilience?
Service Mesh vs. Frameworks: Where to put the resilience?Service Mesh vs. Frameworks: Where to put the resilience?
Service Mesh vs. Frameworks: Where to put the resilience?Michael Hofmann
 

Ähnlich wie Paper presentation: Taverna, reloaded (20)

Leveraging The Open Provenance Model as a Multi-Tier Model for Global Climate...
Leveraging The Open Provenance Model as a Multi-Tier Model for Global Climate...Leveraging The Open Provenance Model as a Multi-Tier Model for Global Climate...
Leveraging The Open Provenance Model as a Multi-Tier Model for Global Climate...
 
Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the Databus
 
2013-01-17 Research Object
2013-01-17 Research Object2013-01-17 Research Object
2013-01-17 Research Object
 
[RAT資安小聚] Study on Automatically Evading Malware Detection
[RAT資安小聚] Study on Automatically Evading Malware Detection[RAT資安小聚] Study on Automatically Evading Malware Detection
[RAT資安小聚] Study on Automatically Evading Malware Detection
 
20111104 s4 overview
20111104 s4 overview20111104 s4 overview
20111104 s4 overview
 
Building Distributed Systems in Scala
Building Distributed Systems in ScalaBuilding Distributed Systems in Scala
Building Distributed Systems in Scala
 
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
 
Wolstencroft K - Workflows on the Cloud: scaling for national service
Wolstencroft K - Workflows on the Cloud: scaling for national serviceWolstencroft K - Workflows on the Cloud: scaling for national service
Wolstencroft K - Workflows on the Cloud: scaling for national service
 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"
 
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
 
Web Sphere Problem Determination Ext
Web Sphere Problem Determination ExtWeb Sphere Problem Determination Ext
Web Sphere Problem Determination Ext
 
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
 
#lspe: Dynamic Scaling
#lspe: Dynamic Scaling #lspe: Dynamic Scaling
#lspe: Dynamic Scaling
 
Rails Application Optimization Techniques & Tools
Rails Application Optimization Techniques & ToolsRails Application Optimization Techniques & Tools
Rails Application Optimization Techniques & Tools
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive Biology
 
Testing concurrent java programs - Sameer Arora
Testing concurrent java programs - Sameer AroraTesting concurrent java programs - Sameer Arora
Testing concurrent java programs - Sameer Arora
 
C4Bio paper talk
C4Bio paper talkC4Bio paper talk
C4Bio paper talk
 
Service Mesh vs. Frameworks: Where to put the resilience?
Service Mesh vs. Frameworks: Where to put the resilience?Service Mesh vs. Frameworks: Where to put the resilience?
Service Mesh vs. Frameworks: Where to put the resilience?
 
Service Mesh vs. Frameworks: Where to put the resilience?
Service Mesh vs. Frameworks: Where to put the resilience?Service Mesh vs. Frameworks: Where to put the resilience?
Service Mesh vs. Frameworks: Where to put the resilience?
 

Mehr von Paolo Missier

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?Paolo Missier
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
 
Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 

Mehr von Paolo Missier (20)

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 

Kürzlich hochgeladen

Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Paige Cruz
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingScyllaDB
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentationyogeshlabana357357
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfdanishmna97
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...FIDO Alliance
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...ScyllaDB
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...ScyllaDB
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireExakis Nelite
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxFIDO Alliance
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)Wonjun Hwang
 
How to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in PakistanHow to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in Pakistandanishmna97
 
الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهMohamed Sweelam
 
Navigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiNavigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiRaviKumarDaparthi
 
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...SOFTTECHHUB
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024Lorenzo Miniero
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdfMuhammad Subhan
 

Kürzlich hochgeladen (20)

Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
 
How to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in PakistanHow to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in Pakistan
 
الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهله
 
Navigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiNavigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi Daparthi
 
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 

Paper presentation: Taverna, reloaded

  • 1. , reloaded Paolo Missier, Stian Soiland-Reyes, Stuart Owen, Alex Nenadic, Ian Dunlop, Alan Williams, Carole Goble School of Computer Science University of Manchester, UK Wei Tan Tom Oinn Argonne National Laboratory Argonne, IL, USA EBI, UK SSDBM Heidelberg, Germany June 30 - July 2, 2010
  • 2. Taverna: scientific workflow management Goal: perform cancer diagnosis using microarray analysis - learn a model for lymphoma type prediction based on samples from different lymphoma types source: caGrid http://www/myexperiment.org/workflows/746 2
  • 3. Taverna: scientific workflow management Goal: perform cancer diagnosis using microarray analysis - learn a model for lymphoma type prediction based on samples from different lymphoma types lymphoma samples source: caGrid ➔ hybridization data http://www/myexperiment.org/workflows/746 2
  • 4. Taverna: scientific workflow management Goal: perform cancer diagnosis using microarray analysis - learn a model for lymphoma type prediction based on samples from different lymphoma types lymphoma samples source: caGrid ➔ hybridization data process microarray data as training dataset http://www/myexperiment.org/workflows/746 2
  • 5. Taverna: scientific workflow management Goal: perform cancer diagnosis using microarray analysis - learn a model for lymphoma type prediction based on samples from different lymphoma types lymphoma samples source: caGrid ➔ hybridization data process microarray data as training dataset learn predictive model http://www/myexperiment.org/workflows/746 2
  • 6. Taverna: scientific workflow management Goal: perform cancer diagnosis using microarray analysis - learn a model for lymphoma type prediction based on samples from different lymphoma types lymphoma samples source: caGrid ➔ hybridization data http://www/myexperiment.org/workflows/746 2
  • 7. In this (short) talk... • Taverna for data-intensive science – architectural goals... and how they are being achieved • Performance figures on benchmark workflows 3
  • 8. Target -ilities • Scalability – size of input data / size of input collections • Configurability – each processor in the workflow individually tunable to adjust to the underlying platform • Extensibility – plugin architecture to accommodate specific service groups • caGrid, BioMoby – extensible set of operators that define the semantics of the model 4
  • 9. Opportunities for parallel processing • intra-processor: implicit iteration over list data • inter-processor: pipelining [ id1, id2, id3, ...] 5
  • 10. Opportunities for parallel processing • intra-processor: implicit iteration over list data • inter-processor: pipelining [ id1, id2, id3, ...] SFH1 SFH2 SFH3 ... ... ... getDS1 getDS2 getDS3 5
  • 11. Opportunities for parallel processing • intra-processor: implicit iteration over list data • inter-processor: pipelining [ id1, id2, id3, ...] SFH1 SFH2 SFH3 ... ... ... getDS1 getDS2 getDS3 [ DS1, DS2, DS3, ...] 5
  • 12. Configurable processor behaviour  allocates concurrent threads to list elements Default dispatch stack Request Parallelise Provenance captures service Error bounce invocation events Failover Response Retry persistent server error: tries each available Stop/pause service implementation Invoke in turn transient errors: interacts with retries one service underlying activity invocation - service operation - shell script execution
  • 13. Extensible processor semantics  Default dispatch stack Request Parallelise Provenance Error bounce Failover Response Retry Stop/pause Invoke 7
  • 14. Extensible processor semantics  Default dispatch stack Default dispatch stack Extended dispatch stack Request Request Request Parallelise Enhancing dataflow model Parallelise with limited form of loops to Parallelise Provenance Error bounce model busy waiting Error bounce Error bounce (asynchronous service Loop/Branching interaction) Failover Failover Failover Response Retry Response Response Retry Retry Invoke Stop/pause Invoke Invoke (a) (b) 7
  • 15. Extensible processor semantics  Default dispatch stack Default dispatch stack Extended dispatch stack Request Request Request Parallelise Enhancing dataflow model Parallelise with limited form of loops to Parallelise Provenance Error bounce model busy waiting Error bounce Error bounce (asynchronous service Loop/Branching interaction) Failover Failover Failover Response Retry Response Response Retry Retry Invoke Stop/pause Invoke ."4&04& 7';'3% Invoke ;(4)4&0;=;"2.'4 '&&'./304&5,%& +"#,- (a) (b) !"#$#%&'()(*+,-%$.*//$0*1 2*(34/*56#%.8&9 +"#,- &670 Busy waiting !"#$)* 80&$9:5$;0%(<& using an ;0%(<& '&&'./304&5,%& explicit +"#,- workflow ./0.12&'&(% %&'&(% '&&'./304&5,%& pattern 2*(34/*5678&.8&9 +"#$%&'&(% !"#$%&'&(% )%$-"40 ,%$-"40 &0%& 2(..0%% 7
  • 16. Extensible processor semantics  Default dispatch stack Default dispatch stack Extended dispatch stack Request Request Request Parallelise Enhancing dataflow model Parallelise with limited form of loops to Parallelise Provenance Error bounce model busy waiting Error bounce Error bounce (asynchronous service Loop/Branching interaction) Failover Failover Failover Response Retry Response Response Retry Retry Invoke Stop/pause Invoke ."4&04& 7';'3% Invoke ;(4)4&0;=;"2.'4 '&&'./304&5,%& +"#,- (a) (b) !"#$#%&'()(*+,-%$.*//$0*1 2*(34/*56#%.8&9 +"#,- &670 (3/360 4"7&)7& Busy waiting !"#$)* 80&$9:5$;0%(<& /1787&)/9/":437 using an ;0%(<& '&&'./304&5,%& 3&&3456)7&.$0& !"#$% explicit +"#,- !"#$% workflow ./0.12&'&(% %&'&(% '&&'./304&5,%& Busy waiting 45)4;:&3&10 3&&3456)7&.$0& 0&3&10 pattern using an 2*(34/*5678&.8&9 +"#$%&'&(% loop !"#$% &'() !"#$%&'&(% )%$-"40 ,%$-"40 processor *)&+,-.+/)012& /)012& 3&&3456)7&.$0& &0%& 2(..0%% 7
  • 17. Performance experimental setup • previous version of Taverna engine used as baseline • objective: to measure incremental improvement list generator Parameters: - byte size of list elements (strings) - size of input list - length of linear chain multiple parallel pipelines main insight: when the workflow is designed for pipelining, parallelism is exploited effectively 8
  • 18. I - Memory usage shorter execution time due to pipelining T2 main memory data management T2 main embedded Derby back-end T1 baseline list size: 1,000 strings of 10K chars each no intra-processor parallelism (1 thread/processor) 9
  • 19. I - Memory usage shorter execution time due to pipelining T2 main memory data management T2 main embedded Derby back-end T1 baseline list size: 1,000 strings of 10K chars each no intra-processor parallelism (1 thread/processor) 9
  • 20. Available processors pool pipelining in T2 makes up for smaller pools of threads/processor 10
  • 21. Bounded main memory usage Separation of data and process spaces ensures scalable data management varying data element size: 10K, 25K, 100K chars 11
  • 22. Summary • Taverna 2 re-engineered for scalability on data- intensive pipelined dataflows • separation of data and process spaces • generic stack pattern for principled extensions – limited while loop, if-then-else – provenance capture – ... open plugin architecture accommodates further extensions For further details and a nice chat: please visit our poster!! 12