SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Downloaden Sie, um offline zu lesen
Provenance and data differencing for
       workflow reproducibility analysis




       Paolo Missier
School of Computing Science
  Newcastle University, UK



  Humboldt University, Berlin
      March 4, 2013
Provenance Metadata
    Provenance refers to the sources of information, including entities and processes, involving
    in producing or delivering an artifact (*)

    Provenance is a description of how things came to be, and how they came to be in the state
    they are in today (*)


    Why does it matter?

    • To establish quality, relevance, trust
    • To track information attribution through complex transformations
    • To describe one’s experiment to others, for understanding / reuse
    • To provide evidence in support of scientific claims
    • To enable post hoc process analysis for debugging, improvement, evolution




2   (*) Definitions proposed by the W3C Incubator Group on provenance:
    http://www.w3.org/2005/Incubator/prov/wiki/What_Is_Provenance
A colourful provenance graph
      Remote past                                                                                                                                                                 Recent past

    Editing phase
                                                                                   wasDerivedFrom



                                                                            used
                                                              paper3                    reading
                                                                                                                   wasGeneratedBy


                                                         specializationOf                                    specializationOf
                             Bob-1                                                       Bob                                                             Bob-2

                                                                                                          type=person
                                                      type=person                  actedOnBehalfOf
           role=author                                                                                    role=main_editor                                               role=author
                                                      role=jr_editor
                                                                                         Alice                                                  wasAssociatedWith
                                              wasAttributedTo                                              role=editor
                       wasAssociatedWith                                           wasAssociatedWith

                                         wasGeneratedBy         draft       used                       wasGeneratedBy          draft         used                   wasGeneratedBy          draft
                    used    drafting                                                 commenting                                                          editing
                                                                 v1                                                          comments                                                        v2

                                                           distribution=internal                                         wasDerivedFrom
                                                           status=draft                                                                                                 distribution=internal
                                                           version=0.1                                                                                                  status=draft
                                                                                                                                                                        version=0.1




     Publishing phase
                                                      type=person
                                                      role=headOfPublication                                                       wasDerivedFrom

                                                                                      pub                                                                    wasGeneratedBy            pub
                                                                                                                                             guideline
                      actedOnBehalfOf                                              guidelines                                                                                       guidelines
           Alice                            Charlie                                                                                           update
                                                                                       v1                                                                                               v2

                                                                                                  distribution=public
        role=issuer                    wasAssociatedWith                                          status=draft
                                                                                                  version=1.0                wasAssociatedWith

                    draft      used                        wasGeneratedBy
                                          publication                                 WD1                                          w3c:              type=institution
                     v2
                                                                                                                                consortium
3
Motivation: Reproducibility in e-science
    • Setting: Collaborative, Open Science
       – Increasing rate of data sharing in science


    • The stick: both journals and funders demand that data be uploaded
       – Multiple data journals, data repositories emerging
    • The carrot: data is given a DOI and is citable, scientists get credit
         • Thomson’s Data Citation Index
         • Dryad data repository for biosciences(*)
         • The DataBib repository of research data
         • NSF Data Preservation projects: DataONE
            • best practices document: notebooks.dataone.org/bestpractices/
         •... and many others

     (*) As of Jan 27, 2013, Dryad contains 2585 data packages and 7097 data files, associated
     with articles in 187 journals.
4
General problems
    • Quality assurance
      – from non-malicious errors in method or data, all the way to fraud
      – ... leading to retractions in scientific publications
          • see eg http://retractionwatch.wordpress.com/


    • Repeatability
      – If I replicate your experiment / repeat your process on the same data, will I get the
        same results?


    • Reproducibility -- a more general notion

       The ability for a third party who has access to the description of the
       original experiment and its results to reproduce those results, using a
       possibly different setting, with the goal to to confirm or dispute the
       original experimenter’s claims.




5
Specifically, in e-science...
    • Experimental method → scripts, programs, workflows

    • Publication = results + {program, workflow} + evidence of results

    • Repeatability, reproducibility
       – will I be able to run (my version of) your workflow on (my version of) your input
         and compare my results to yours?


    • Evidence of result: provenance of {program, workflow} execution

    • Side note: portability issues are out of scope
       – VMs often solve the problem, with some limitations
          • not when workflows depend on third party services
          • only for limited size data dependencies



                          Main issue: Workflow evolution and decay

6
Mapping the reproducibility space
                                                        Environmental variations

                                                       ED                   ED ! ED'
                                                       wfs                  wfs ! wfs'
                     Experimental
                      variations


                     repeatability       wf          results
                                         d         confirmation

                                                                     disfunctional
                                       wf ! wf'      method            workflow     non-functioning
                                          d         variation     - service updates    workflow
                                                                   - state changes

                                          wf                                          exception
                                                      data          divergence
                                        d ! d'                                         analysis,
                                                    variation        analysis
                                                                                      debugging

                                       wf ! wf'     data and
                     reproducibility    d ! d'       method
                                                    variation

                                                                           decay region


    Goal: to help scientists understand the effect of workflow / data / dependencies
    evolution on workflow execution results
    Approach: compare provenance traces generated during the runs: PDIFF


    P. Missier, S. Woodman, H Hiden, P. Watson. Provenance and data differencing for workflow
7   reproducibility analysis, Concurrency Computat.: Pract. Exper., 2013. In press.
Mapping the reproducibility space
                                                        Environmental variations

                                                       ED                   ED ! ED'
                                                       wfs                  wfs ! wfs'
                     Experimental
                      variations


                     repeatability       wf          results
                                         d         confirmation

                                                                     disfunctional
                                       wf ! wf'      method            workflow     non-functioning
                                          d         variation     - service updates    workflow
                                                                   - state changes

                                          wf                                          exception
                                                      data          divergence
                                        d ! d'                                         analysis,
                                                    variation        analysis
                                                                                      debugging

                                       wf ! wf'     data and
                     reproducibility    d ! d'       method
                                                    variation

                                                                           decay region


    Goal: to help scientists understand the effect of workflow / data / dependencies
    evolution on workflow execution results
    Approach: compare provenance traces generated during the runs: PDIFF


    P. Missier, S. Woodman, H Hiden, P. Watson. Provenance and data differencing for workflow
7   reproducibility analysis, Concurrency Computat.: Pract. Exper., 2013. In press.
Decay
    • Workflows that have external dependencies are harder to maintain
       – they may become disfunctional or break altogether




     Zhao, Jun, Jose Gomez-Perez, Khalid Belhajjame, Graham Klyne, and Et Al. “Why Workflows Break -
       Understanding and Combating Decay in Taverna Workflows.” In Procs. E-science Conference.
8      Chicago, 2012.
Decay
    • Workflows that have external dependencies are harder to maintain
       – they may become disfunctional or break altogether




     Zhao, Jun, Jose Gomez-Perez, Khalid Belhajjame, Graham Klyne, and Et Al. “Why Workflows Break -
       Understanding and Combating Decay in Taverna Workflows.” In Procs. E-science Conference.
8      Chicago, 2012.
Workflows and provenance traces
    Workflow (structure): directed graph W=(T,E)
    T: set of tasks (computational units)
    P: set of (input, output) ports associated to each task t ∈ T
    E ⊂ T X T: graph edges representing data dependencies

    ⟨ti.pA,tj.pB⟩ ∈ E: data produced by ti on port pA ∈ P is routed to port pB ∈ P of tj



    Execution trace: tr = exec(W, d, ED, wfms)
    A: activities
    D: data items
    R = { used, genBy } relations: used ⊂ A × D × P, genBy ⊂ D × A × P


    Workflow inputs: tr.I = {d ∈ D|a ∈ A, p ∈ P     (d,a,p) ∉ genBy}.

    Workflow outputs: tr.O = {d ∈ D|a ∈ A, p ∈ P     (a,d,p) ∉ used}.

9
Workflow evolution
                            tr = exec(W, d, ED, wfms)
     Each of the elements in an execution may evolve (semi) independently
     from the others:
                trt = exec t (Wi , ED j , dh , wfms k ), with i, j, h, k < t
                    W        ED         d        wfms



               t1    W1      ED1        d1       wfms1     tr1 = exec1(W1,ED1,d1,wfms1)


               t2    W2                                   tr2 = exec2(W2,ED1,d1,wfms1)

               t3            ED3        d3                tr3 = exec3(W2,ED3,d3,wfms1)


               t4                                wfms4     tr4 = exec4(W2,ED3,d3,wfms4)

               t5            ED5                          tr5 = exec5(W2,ED5,d3,wfms4)



      Repeatability:
      • Can trt be computed again at some time t’>t?
      • Requires saving EDt but may be impractical (eg large DB state)
10
Reproducibility
     Can a new version trt’ of trt be computed at some later time t’ > t, after one
     of more of the elements has changed?

                 trt = exec t (Wi , ED j , dh , wfms k ), with i, j, h, k < t
                tr t = exec t (Wi , ED j , dh , wfms k )
                        W        ED        d        wfms



                  t1    W1       ED1       d1       wfms1    tr1 = exec1(W1,ED1,d1,wfms1)


                  t2    W2                                   tr2 = exec2(W2,ED1,d1,wfms1)


                  t3             ED3       d3                tr3 = exec3(W2,ED3,d3,wfms1)


                  t4                                wfms4    tr4 = exec4(W2,ED3,d3,wfms4)

                  t5             ED5                         tr5 = exec5(W2,ED5,d3,wfms4)


     Potential issues:
               • Wi may not run new EDj’
               • Wi may not run with wfmsk’
               • Wi’ may not run with dh’
11
               • ...
Data divergence analysis using provenance
     • All work done with reference to the e-Science Central WFMS
     • Assumption: workflow WFj (new version) runs to completion
        – thus it produces a new provenance trace
        – however, it may be disfunctional relative to WFi (the original)


     • Example: only input data changes: d != d’, WFj == WFi
               tr t = exec t (W, ED, d, wfms), tr t = exec t (W, ED, d , wfms)


                        S0         S1




                             S2         S3




                                  S4




     Note: results may diverge even when the input datasets are identical, for example when one
     or more of the services exhibits non-deterministic behaviour, or depends on external state
12   that has changed between executions.
Reproducibility requires comparing datasets
     • Experimenters may validate results by deliberately altering the
       experimental settings (Wi′, dj′)
     • The outcomes will not be identical, but
       – are they similar enough to the original to conclude that the experiment was
         successfully reproduced?

                               ∆D(trt.O, trt′.O)

     • Data comparison is type- and format-dependent in general
     • Example:
       – workflow output: a classification model computed using model builders
       – two models may be different but statistically equivalent

     • e-Science Central accommodates user-defined data diff blocks
       – these are just Java-based workflow blocks



                          if ∆D(trt.O, trt′.O) > threshold:
                          why are results diverging?
13
Provenance traces for two runs
                                                               d1            d2                    d1'            d2'



                                                               S0            S1                    S0             S1
     S0                       S1

     P0                   P0                    d3             z             w    d3               z              w'

          P0             P1                                    P0            P1                    P0             P1
                                                S3                   S2           S3                     S2
               S2                   S3

               P0                   P0
                                                x                        y        x                          y'
                    P0             P1
                          S4             used       P0              P1
                                                                                       P0               P1
                                                         S4                                  S4
                                                                    genBy
                                                          df                                 df'


                                                         (i) Trace A                        (ii) Trace B




14
Delta graphs
     A graph obtained as a result of traces “diff”
     which can be used to explain observed differences in workflow outputs, in
     terms of differences throughout the two executions.
                                                                                This is the simplest
                       d1            d2                    d1'            d2'   possible delta “graph”!

                       S0            S1                    S0             S1



        d3
                                                                                   d2 , d 2
                       z             w    d3               z              w'

                       P0            P1                    P0             P1
       S3                    S2           S3                     S2                 w, w

        x                        y        x                          y'
                                                                                     y, y
            P0              P1
                                               P0               P1
                 S4                                  S4

                                                                                  dF , d F
                  df                                 df'


                                                                                 (iii) Delta tree
                 (i) Trace A                        (ii) Trace B
15
More involved workflow differences


                                      WA




     sv2

                                     WB



           • S0 is followed by S0 in WA but not in WB ;

           • S3 is preceded by S3 in WB but not in WA ;

           • S2 in WA is replaced by a new version, S2v2 , in WB ;

           • S1 in WA is replaced by S5 in WB .
16
The corresponding traces
     tr t = exec t (W, ED, d, wfms), tr t = exec t (W , ED , d, wfms)
                        d0                                          d0




                         S                                      Sv2




                        d1                                          d1'



                        S0                                          S0



                                             d2                                        d2



                        S0'                  S1     S3'                                S5




                w                  h         k      w'              h'                 k'

                              P0        P1                           P0               P1
               S3                  S2               S3                         S2v2



               y                   z                                           z'
                                                    y'

                   P0             P1
                                                          P0              P1
                        S4                                     S4



                         x
17                                                             x'

                    (i) Trace A                            (ii) Trace B
Delta graph computed by PDIFF
                                 S, Sv2
                            (version change)


                                 d1 , d 1




     S0'                                     S0'
                      S3'




                                                                                      S1 , S 5
                                                   S0 , S 0                     (service repl.)
           S0 , S 3
                                                    h, h                              k, k

            w, w                              P0 branch of S2                     P1 branch of S2
                                                                   S2 , S2v2
                                                           (version change)

            y, y                                                         z, z
18            P0 branch of S4         x, x             P1 branch of S4
Summary
     • Setting:
       – scientific results computed using workflows
       – openness / data sharing has potential to accelerate science
       – but requires results validation and reproducibility


     • Problem: reproducibility is hard to achieve
       – workflow decay
       – evolution of data, workflow spec, dependencies, wf engine


     • Goal: support divergence analysis

     • Approach: PDIFF -- comparing provenance traces generated during
       the runs




19
Selected references
     Zhao, Jun, Jose Gomez-Perez, Khalid Belhajjame, Graham Klyne, and Et Al. Why Workflows
        Break - Understanding and Combating Decay in Taverna Workflows. In Procs. E-science
        Conference. Chicago, 2012.

     Cohen-Boulakia S, Leser U. Search, adapt, and reuse: the future of scientific workflows.
     SIGMOD Rec. Sep 2011; 40(2):6–16, doi:http://doi.acm.org/10.1145/2034863.2034865

     Peng RD,Dominici F., Zeger SL. ReproducibleEpidemiologicResearch. American Journal of
     Epidemiology 2006; 163(9):783–789, doi:10.1093/aje/kwj093.

     Drummond C. Science, Replicability is not Reproducibility: Nor is it Good Science. Procs. 4th
     workshop on Evaluation Methods for Machine Learning In conjunction with ICML 2009,
     Montreal, Canada, 2009.

     Peng R. Reproducible Research in Computational Science. Science Dec 2011; 334(6060):
     1226–1127

     Schwab M, Karrenbach M, Claerbout J. Making Scientific Computations Reproducible.
     Computing in Science Engineering 2000; 2(6):61–67

     P. Missier, S. Woodman, H Hiden, P. Watson. Provenance and data differencing for workflow
     reproducibility analysis, Concurrency Computat.: Pract. Exper., 2013. In press.

     Mesirov J. Accessible Reproducible Research. Science 2010; 327

20

Weitere ähnliche Inhalte

Mehr von Paolo Missier

Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...Paolo Missier
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthPaolo Missier
 

Mehr von Paolo Missier (20)

Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for Health
 

Repro pdiff-talk (invited, Humboldt University, Berlin)

  • 1. Provenance and data differencing for workflow reproducibility analysis Paolo Missier School of Computing Science Newcastle University, UK Humboldt University, Berlin March 4, 2013
  • 2. Provenance Metadata Provenance refers to the sources of information, including entities and processes, involving in producing or delivering an artifact (*) Provenance is a description of how things came to be, and how they came to be in the state they are in today (*) Why does it matter? • To establish quality, relevance, trust • To track information attribution through complex transformations • To describe one’s experiment to others, for understanding / reuse • To provide evidence in support of scientific claims • To enable post hoc process analysis for debugging, improvement, evolution 2 (*) Definitions proposed by the W3C Incubator Group on provenance: http://www.w3.org/2005/Incubator/prov/wiki/What_Is_Provenance
  • 3. A colourful provenance graph Remote past Recent past Editing phase wasDerivedFrom used paper3 reading wasGeneratedBy specializationOf specializationOf Bob-1 Bob Bob-2 type=person type=person actedOnBehalfOf role=author role=main_editor role=author role=jr_editor Alice wasAssociatedWith wasAttributedTo role=editor wasAssociatedWith wasAssociatedWith wasGeneratedBy draft used wasGeneratedBy draft used wasGeneratedBy draft used drafting commenting editing v1 comments v2 distribution=internal wasDerivedFrom status=draft distribution=internal version=0.1 status=draft version=0.1 Publishing phase type=person role=headOfPublication wasDerivedFrom pub wasGeneratedBy pub guideline actedOnBehalfOf guidelines guidelines Alice Charlie update v1 v2 distribution=public role=issuer wasAssociatedWith status=draft version=1.0 wasAssociatedWith draft used wasGeneratedBy publication WD1 w3c: type=institution v2 consortium 3
  • 4. Motivation: Reproducibility in e-science • Setting: Collaborative, Open Science – Increasing rate of data sharing in science • The stick: both journals and funders demand that data be uploaded – Multiple data journals, data repositories emerging • The carrot: data is given a DOI and is citable, scientists get credit • Thomson’s Data Citation Index • Dryad data repository for biosciences(*) • The DataBib repository of research data • NSF Data Preservation projects: DataONE • best practices document: notebooks.dataone.org/bestpractices/ •... and many others (*) As of Jan 27, 2013, Dryad contains 2585 data packages and 7097 data files, associated with articles in 187 journals. 4
  • 5. General problems • Quality assurance – from non-malicious errors in method or data, all the way to fraud – ... leading to retractions in scientific publications • see eg http://retractionwatch.wordpress.com/ • Repeatability – If I replicate your experiment / repeat your process on the same data, will I get the same results? • Reproducibility -- a more general notion The ability for a third party who has access to the description of the original experiment and its results to reproduce those results, using a possibly different setting, with the goal to to confirm or dispute the original experimenter’s claims. 5
  • 6. Specifically, in e-science... • Experimental method → scripts, programs, workflows • Publication = results + {program, workflow} + evidence of results • Repeatability, reproducibility – will I be able to run (my version of) your workflow on (my version of) your input and compare my results to yours? • Evidence of result: provenance of {program, workflow} execution • Side note: portability issues are out of scope – VMs often solve the problem, with some limitations • not when workflows depend on third party services • only for limited size data dependencies Main issue: Workflow evolution and decay 6
  • 7. Mapping the reproducibility space Environmental variations ED ED ! ED' wfs wfs ! wfs' Experimental variations repeatability wf results d confirmation disfunctional wf ! wf' method workflow non-functioning d variation - service updates workflow - state changes wf exception data divergence d ! d' analysis, variation analysis debugging wf ! wf' data and reproducibility d ! d' method variation decay region Goal: to help scientists understand the effect of workflow / data / dependencies evolution on workflow execution results Approach: compare provenance traces generated during the runs: PDIFF P. Missier, S. Woodman, H Hiden, P. Watson. Provenance and data differencing for workflow 7 reproducibility analysis, Concurrency Computat.: Pract. Exper., 2013. In press.
  • 8. Mapping the reproducibility space Environmental variations ED ED ! ED' wfs wfs ! wfs' Experimental variations repeatability wf results d confirmation disfunctional wf ! wf' method workflow non-functioning d variation - service updates workflow - state changes wf exception data divergence d ! d' analysis, variation analysis debugging wf ! wf' data and reproducibility d ! d' method variation decay region Goal: to help scientists understand the effect of workflow / data / dependencies evolution on workflow execution results Approach: compare provenance traces generated during the runs: PDIFF P. Missier, S. Woodman, H Hiden, P. Watson. Provenance and data differencing for workflow 7 reproducibility analysis, Concurrency Computat.: Pract. Exper., 2013. In press.
  • 9. Decay • Workflows that have external dependencies are harder to maintain – they may become disfunctional or break altogether Zhao, Jun, Jose Gomez-Perez, Khalid Belhajjame, Graham Klyne, and Et Al. “Why Workflows Break - Understanding and Combating Decay in Taverna Workflows.” In Procs. E-science Conference. 8 Chicago, 2012.
  • 10. Decay • Workflows that have external dependencies are harder to maintain – they may become disfunctional or break altogether Zhao, Jun, Jose Gomez-Perez, Khalid Belhajjame, Graham Klyne, and Et Al. “Why Workflows Break - Understanding and Combating Decay in Taverna Workflows.” In Procs. E-science Conference. 8 Chicago, 2012.
  • 11. Workflows and provenance traces Workflow (structure): directed graph W=(T,E) T: set of tasks (computational units) P: set of (input, output) ports associated to each task t ∈ T E ⊂ T X T: graph edges representing data dependencies ⟨ti.pA,tj.pB⟩ ∈ E: data produced by ti on port pA ∈ P is routed to port pB ∈ P of tj Execution trace: tr = exec(W, d, ED, wfms) A: activities D: data items R = { used, genBy } relations: used ⊂ A × D × P, genBy ⊂ D × A × P Workflow inputs: tr.I = {d ∈ D|a ∈ A, p ∈ P (d,a,p) ∉ genBy}. Workflow outputs: tr.O = {d ∈ D|a ∈ A, p ∈ P (a,d,p) ∉ used}. 9
  • 12. Workflow evolution tr = exec(W, d, ED, wfms) Each of the elements in an execution may evolve (semi) independently from the others: trt = exec t (Wi , ED j , dh , wfms k ), with i, j, h, k < t W ED d wfms t1 W1 ED1 d1 wfms1 tr1 = exec1(W1,ED1,d1,wfms1) t2 W2 tr2 = exec2(W2,ED1,d1,wfms1) t3 ED3 d3 tr3 = exec3(W2,ED3,d3,wfms1) t4 wfms4 tr4 = exec4(W2,ED3,d3,wfms4) t5 ED5 tr5 = exec5(W2,ED5,d3,wfms4) Repeatability: • Can trt be computed again at some time t’>t? • Requires saving EDt but may be impractical (eg large DB state) 10
  • 13. Reproducibility Can a new version trt’ of trt be computed at some later time t’ > t, after one of more of the elements has changed? trt = exec t (Wi , ED j , dh , wfms k ), with i, j, h, k < t tr t = exec t (Wi , ED j , dh , wfms k ) W ED d wfms t1 W1 ED1 d1 wfms1 tr1 = exec1(W1,ED1,d1,wfms1) t2 W2 tr2 = exec2(W2,ED1,d1,wfms1) t3 ED3 d3 tr3 = exec3(W2,ED3,d3,wfms1) t4 wfms4 tr4 = exec4(W2,ED3,d3,wfms4) t5 ED5 tr5 = exec5(W2,ED5,d3,wfms4) Potential issues: • Wi may not run new EDj’ • Wi may not run with wfmsk’ • Wi’ may not run with dh’ 11 • ...
  • 14. Data divergence analysis using provenance • All work done with reference to the e-Science Central WFMS • Assumption: workflow WFj (new version) runs to completion – thus it produces a new provenance trace – however, it may be disfunctional relative to WFi (the original) • Example: only input data changes: d != d’, WFj == WFi tr t = exec t (W, ED, d, wfms), tr t = exec t (W, ED, d , wfms) S0 S1 S2 S3 S4 Note: results may diverge even when the input datasets are identical, for example when one or more of the services exhibits non-deterministic behaviour, or depends on external state 12 that has changed between executions.
  • 15. Reproducibility requires comparing datasets • Experimenters may validate results by deliberately altering the experimental settings (Wi′, dj′) • The outcomes will not be identical, but – are they similar enough to the original to conclude that the experiment was successfully reproduced? ∆D(trt.O, trt′.O) • Data comparison is type- and format-dependent in general • Example: – workflow output: a classification model computed using model builders – two models may be different but statistically equivalent • e-Science Central accommodates user-defined data diff blocks – these are just Java-based workflow blocks if ∆D(trt.O, trt′.O) > threshold: why are results diverging? 13
  • 16. Provenance traces for two runs d1 d2 d1' d2' S0 S1 S0 S1 S0 S1 P0 P0 d3 z w d3 z w' P0 P1 P0 P1 P0 P1 S3 S2 S3 S2 S2 S3 P0 P0 x y x y' P0 P1 S4 used P0 P1 P0 P1 S4 S4 genBy df df' (i) Trace A (ii) Trace B 14
  • 17. Delta graphs A graph obtained as a result of traces “diff” which can be used to explain observed differences in workflow outputs, in terms of differences throughout the two executions. This is the simplest d1 d2 d1' d2' possible delta “graph”! S0 S1 S0 S1 d3 d2 , d 2 z w d3 z w' P0 P1 P0 P1 S3 S2 S3 S2 w, w x y x y' y, y P0 P1 P0 P1 S4 S4 dF , d F df df' (iii) Delta tree (i) Trace A (ii) Trace B 15
  • 18. More involved workflow differences WA sv2 WB • S0 is followed by S0 in WA but not in WB ; • S3 is preceded by S3 in WB but not in WA ; • S2 in WA is replaced by a new version, S2v2 , in WB ; • S1 in WA is replaced by S5 in WB . 16
  • 19. The corresponding traces tr t = exec t (W, ED, d, wfms), tr t = exec t (W , ED , d, wfms) d0 d0 S Sv2 d1 d1' S0 S0 d2 d2 S0' S1 S3' S5 w h k w' h' k' P0 P1 P0 P1 S3 S2 S3 S2v2 y z z' y' P0 P1 P0 P1 S4 S4 x 17 x' (i) Trace A (ii) Trace B
  • 20. Delta graph computed by PDIFF S, Sv2 (version change) d1 , d 1 S0' S0' S3' S1 , S 5 S0 , S 0 (service repl.) S0 , S 3 h, h k, k w, w P0 branch of S2 P1 branch of S2 S2 , S2v2 (version change) y, y z, z 18 P0 branch of S4 x, x P1 branch of S4
  • 21. Summary • Setting: – scientific results computed using workflows – openness / data sharing has potential to accelerate science – but requires results validation and reproducibility • Problem: reproducibility is hard to achieve – workflow decay – evolution of data, workflow spec, dependencies, wf engine • Goal: support divergence analysis • Approach: PDIFF -- comparing provenance traces generated during the runs 19
  • 22. Selected references Zhao, Jun, Jose Gomez-Perez, Khalid Belhajjame, Graham Klyne, and Et Al. Why Workflows Break - Understanding and Combating Decay in Taverna Workflows. In Procs. E-science Conference. Chicago, 2012. Cohen-Boulakia S, Leser U. Search, adapt, and reuse: the future of scientific workflows. SIGMOD Rec. Sep 2011; 40(2):6–16, doi:http://doi.acm.org/10.1145/2034863.2034865 Peng RD,Dominici F., Zeger SL. ReproducibleEpidemiologicResearch. American Journal of Epidemiology 2006; 163(9):783–789, doi:10.1093/aje/kwj093. Drummond C. Science, Replicability is not Reproducibility: Nor is it Good Science. Procs. 4th workshop on Evaluation Methods for Machine Learning In conjunction with ICML 2009, Montreal, Canada, 2009. Peng R. Reproducible Research in Computational Science. Science Dec 2011; 334(6060): 1226–1127 Schwab M, Karrenbach M, Claerbout J. Making Scientific Computations Reproducible. Computing in Science Engineering 2000; 2(6):61–67 P. Missier, S. Woodman, H Hiden, P. Watson. Provenance and data differencing for workflow reproducibility analysis, Concurrency Computat.: Pract. Exper., 2013. In press. Mesirov J. Accessible Reproducible Research. Science 2010; 327 20