SlideShare ist ein Scribd-Unternehmen logo
1 von 76
Downloaden Sie, um offline zu lesen
Granular workflow provenance in Taverna




                     Paolo Missier
            Information Management Group
School of Computer Science, University of Manchester, UK




    Symposium on Provenance in Scientific Workflows
              Salt Lake City, Oct. 2008
                                                           1
Outline
• Collection values in [bioinformatics] workflows are important
• Granular provenance over collections: model and issues
• Measuring “provenance friendliness” of dataflows
• Increasing friendliness of existing dataflows
• Extending the Open Provenance Model graph to describe
  granular data derivations


• Provenance service architecture - brief description




                                                            2
Example (Taverna) dataflow


QTL -> genes -> Kegg pathways




                                IPAW'08 – Salt Lake City, Utah, June 2008
Example (Taverna) dataflow




      IPAW'08 – Salt Lake City, Utah, June 2008
Collections example: from genes to SNPs
• See myexperiment.org: http://www.myexperiment.org/workflows/166




                                                               4
Collections example: from genes to SNPs
• See myexperiment.org: http://www.myexperiment.org/workflows/166




                                                     gene ->
                                                  genomic region




                                                                   4
Collections example: from genes to SNPs
• See myexperiment.org: http://www.myexperiment.org/workflows/166




                                                            gene ->
                                                         genomic region



                                         extend region




                                                                          4
Collections example: from genes to SNPs
• See myexperiment.org: http://www.myexperiment.org/workflows/166




                                                            gene ->
                                                         genomic region



                                         extend region



                                                          retrieve SNPs in
                                                             the region




                                                                             4
Collections example: from genes to SNPs
• See myexperiment.org: http://www.myexperiment.org/workflows/166




                                                            gene ->
                                                         genomic region



                                         extend region



                                                          retrieve SNPs in
                                                             the region


                                                         rearrange SNP
                                                            details




                                                                             4
Collections example: from genes to SNPs
• See myexperiment.org: http://www.myexperiment.org/workflows/166


[ ENSG00000139618 , ENSG00000083093 ]



                                                            gene ->
                                                         genomic region



                                         extend region



                                                          retrieve SNPs in
                                                             the region
[[<1,23554512,16,rs45585833>,
 <1,23554712,16,rs45594034>,
...                                                      rearrange SNP
],
[<1,31820153,13,ENSSNP10730823>,                            details
 <1,31818497,13,ENSSNP10730820>,
...
]]


                                                                             4
Computational model for collections




        Depth mismatch between declared / offered type:
        type(P4:X1) = s but type(a) = list(s)
        type(P4:X2) = type(c) = list(s)
        type(P4:X3) = s but type(c) = list(s)

        Execution at P4:

        Y = (map P1 <(a ⊗ b) , c>) // cross product

        Y = [ (P1 <a1,b1,c>) ... (P1 <an,bm,c>) ]
                                                    5
Collections and iterations

              Processor signatures


                 l(s) → l(s)




                 l(s) → l(s)



                 s→s




                 s → l(s)



                 s→s


                        6
Collections and iterations

   [139618, 83093]   Processor signatures


                        l(s) → l(s)




                        l(s) → l(s)



                        s→s




                        s → l(s)



                        s→s


                               6
Collections and iterations

   [139618, 83093]   Processor signatures


                        l(s) → l(s)
   [139618, 83093]




                        l(s) → l(s)



                        s→s




                        s → l(s)



                        s→s


                               6
Collections and iterations

               [139618, 83093]    Processor signatures


                                     l(s) → l(s)
               [139618, 83093]




                                     l(s) → l(s)
[16,13]    [23520984, 31786617]



                                     s→s




                                     s → l(s)



                                     s→s


                                            6
Collections and iterations

                      [139618, 83093]   Processor signatures


                                           l(s) → l(s)
                      [139618, 83093]




                                           l(s) → l(s)
[16,13]          [23520984, 31786617]



                                           s→s


    [16,13]   [23560179, 31871809]

                                           s → l(s)



                                           s→s


                                                  6
Collections and iterations

                      [139618, 83093]        Processor signatures


                                                 l(s) → l(s)
                      [139618, 83093]




                                                 l(s) → l(s)
[16,13]          [23520984, 31786617]



                                                 s→s

                                        Dot product
    [16,13]   [23560179, 31871809]

                                                 s → l(s)



                                                 s→s


                                                        6
Collections and iterations

                                                    [139618, 83093]        Processor signatures


                                                                               l(s) → l(s)
                                                    [139618, 83093]




                                                                               l(s) → l(s)
                              [16,13]          [23520984, 31786617]



                                                                               s→s

                                                                      Dot product
<16, 23560179,..>                 [16,13]   [23560179, 31871809]

                                                                               s → l(s)


[ <1,23553692,16,rs152451>,
...                                                                            s→s
]


                                                                                      6
Collections and iterations

                                                             [139618, 83093]        Processor signatures


                                                                                        l(s) → l(s)
                                                             [139618, 83093]




                                                                                        l(s) → l(s)
                                       [16,13]          [23520984, 31786617]



                                                                                        s→s

                                                                               Dot product
<16, 23560179,..> <13, 31871809,...>       [16,13]   [23560179, 31871809]

                                                                                        s → l(s)


[ <1,23553692,16,rs152451>,
...                                                                                     s→s
]
                  [<1,31840948,13,rs169546>,
                  ...
                  ]                                                                            6
Collections and iterations

                                                             [139618, 83093]        Processor signatures

     139618       83093
                                                                                        l(s) → l(s)
                                                             [139618, 83093]




                                                                                        l(s) → l(s)
                                       [16,13]          [23520984, 31786617]



                                                                                        s→s

                                                                               Dot product
<16, 23560179,..> <13, 31871809,...>       [16,13]   [23560179, 31871809]

                                                                                        s → l(s)


[ <1,23553692,16,rs152451>,
...                                                                                     s→s
]
                  [<1,31840948,13,rs169546>,
                  ...
                  ]                                                                            6
Collections and iterations

                                                             [139618, 83093]        Processor signatures

     139618       83093
                                                                                        l(s) → l(s)
                                                             [139618, 83093]




                                                                                        l(s) → l(s)
                                       [16,13]          [23520984, 31786617]



                                                                                        s→s

                                                                               Dot product
<16, 23560179,..> <13, 31871809,...>       [16,13]   [23560179, 31871809]

                                                                                        s → l(s)


[ <1,23553692,16,rs152451>,
...                                                                                     s→s
]
                  [<1,31840948,13,rs169546>,
                  ...
                  ]                                                                            6
Collections and iterations

                                                             [139618, 83093]        Processor signatures

     139618       83093
                                                                                        l(s) → l(s)
                                                             [139618, 83093]




                                                                                        l(s) → l(s)
                                       [16,13]          [23520984, 31786617]



                                                                                        s→s

                                                                               Dot product
<16, 23560179,..> <13, 31871809,...>       [16,13]   [23560179, 31871809]

                                                                                        s → l(s)


[ <1,23553692,16,rs152451>,
...                                                                                     s→s
]
                  [<1,31840948,13,rs169546>,
                  ...
                  ]                                                                            6
Collections and iterations

                                                             [139618, 83093]        Processor signatures

     139618       83093
                                                                                        l(s) → l(s)
                                                             [139618, 83093]




                                                                                        l(s) → l(s)
                                       [16,13]          [23520984, 31786617]



                                                                                        s→s

                                                                               Dot product
<16, 23560179,..> <13, 31871809,...>       [16,13]   [23560179, 31871809]

                                                                                        s → l(s)


[ <1,23553692,16,rs152451>,
...                                                                                     s→s
]
                  [<1,31840948,13,rs169546>,
                  ...
                  ]                                                                            6
Tracing granular lineage
• Provenance traces are most useful when they are
  granular
  – trace individual items in a collection
  – “which geneID is responsible for the presence of SNP
    rs169546 in the output?”


• Curse of black box processors:
  – M-M (many-many) and M-1 (many-one) processors
    destroy granularity




                                                           7
Granular lineage I: no loss of precision

                       X1            X2

                                P0

                      Y1:l(s)        Y2:l(s)                               P1 ≡ λ X . X2
[a1...ai...an]                                 [b1...bi...bm]              P2 ≡ λ X . 2X
                                                                           P3 ≡ λ X1 . λ X2 . X1 + X2
                 X:s                       X:s

                 P1                        P2
                                                                           Let
                 Y:s                      Y:s
                                                                             P0:Y1 = [a1...an],
[a12... ai2 ...an2]                            [2b1... 2bj ...2bm]           P0:Y2 = [b1...bm]
                            X1:s X2:s
                                                    Cross
                                P3                 product                 Then,
                                                                              P1:Y = [a12...an2],
                                Y
                                                                              P2:Y=[2b1...2bm]
         [a12+2b1... ai2+2bi ... an2+2bm]                                     P3:Y = [a12+2b1... an2+2bm]

                                                                And
                                                                 lineage(P3:Y[i], {P0}) = { P0:Y1[i], P0:Y2[j] }
                                                                                                         8
Granular lineage I: no loss of precision

                       X1            X2

                                P0

                      Y1:l(s)        Y2:l(s)                               P1 ≡ λ X . X2
[a1...ai...an]                                 [b1...bi...bm]              P2 ≡ λ X . 2X
                                                                           P3 ≡ λ X1 . λ X2 . X1 + X2
                 X:s                       X:s

                 P1                        P2
                                                                           Let
                 Y:s                      Y:s
                                                                             P0:Y1 = [a1...an],
[a12... ai2 ...an2]                            [2b1... 2bj ...2bm]           P0:Y2 = [b1...bm]
                            X1:s X2:s
                                                    Cross
                                P3                 product                 Then,
                                                                              P1:Y = [a12...an2],
                                Y
                                                                              P2:Y=[2b1...2bm]
         [a12+2b1... ai2+2bi ... an2+2bm]                                     P3:Y = [a12+2b1... an2+2bm]

                                                                And
                                                                 lineage(P3:Y[i], {P0}) = { P0:Y1[i], P0:Y2[j] }
                                                                                                         8
Granular lineage II: loss of precision

                         X1        X2

                              P0

                         Y1        Y2                       P1 ≡ λ X . X2
[a1...ai...an]                             [b1...bi...bm]   P2 ≡ λ X . min X
                                                            P3 ≡ λ X1 . λ X2 . X1 + X2
                 X:s                      X: l(s)

                 P1                        P2
                                                            Let
                 Y:s                      Y:s                     P0:Y1 = [a1...an],
                                                                  P0:Y2=[b1...bm]
[a12... ai2 ...an2]                        c
                        X1:s X2:s                           Then,
                              P3                                    P1:Y = [a12...an2],
                                                                    P2:Y = c = min {b1...bm}
                                Y                                   P3:Y = [a12+c... am2+c]

                                                            And
               [a1   2+c...   ai2+c ...   am   2+c]
                                                             lineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }


                                                                                               9
Granular lineage II: loss of precision

                         X1        X2

                              P0

                         Y1        Y2                       P1 ≡ λ X . X2
[a1...ai...an]                             [b1...bi...bm]   P2 ≡ λ X . min X
                                                            P3 ≡ λ X1 . λ X2 . X1 + X2
                 X:s                      X: l(s)

                 P1                        P2
                                                            Let
                 Y:s                      Y:s                     P0:Y1 = [a1...an],
                                                                  P0:Y2=[b1...bm]
[a12... ai2 ...an2]                        c
                        X1:s X2:s                           Then,
                              P3                                    P1:Y = [a12...an2],
                                                                    P2:Y = c = min {b1...bm}
                                Y                                   P3:Y = [a12+c... am2+c]

                                                            And
               [a1   2+c...   ai2+c ...   am   2+c]
                                                             lineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }


                                                                                               9
III: recoverable loss of precision

                       X1     X2

                         P0

                       Y1     Y2                      P1 ≡ λ X . X2
[a1...ai...an]                      [b1...bi...bm]    P2 ≡ λ X . f X
                                                      P3 ≡ λ X1 . λ X2 . X1 + X2
                 X:s               X: l(s)

                 P1                P2
                                                      Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]
                 Y:s               Y:l(s)
                                                      Then, P1:Y = [a12...an2], P2:Y=c
[a12... ai2 ...an2]                  [c1...ci...cm]         P3:Y = [a12+c... am2+c]
                       X1:s X2:s
                                                      And
                            P3                         lineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }
                            Y


               [a12+c1... ai2+ci ... am2+cm]


                                                                                         10
III: recoverable loss of precision

                       X1     X2

                         P0
                                                      P1 ≡ λ X . X2     “f is index-preserving”
                       Y1     Y2
[a1...ai...an]                      [b1...bi...bm]    P2 ≡ λ X . f X
                                                      P3 ≡ λ X1 . λ X2 . X1 + X2
                 X:s               X: l(s)

                 P1                P2
                                                      Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]
                 Y:s               Y:l(s)
                                                      Then, P1:Y = [a12...an2], P2:Y=c
[a12... ai2 ...an2]                  [c1...ci...cm]         P3:Y = [a12+c... am2+c]
                       X1:s X2:s
                                                      And
                            P3                         lineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }
                            Y


               [a12+c1... ai2+ci ... am2+cm]


                                                                                         10
III: recoverable loss of precision

                       X1     X2

                         P0
                                                      P1 ≡ λ X . X2     “f is index-preserving”
                       Y1     Y2
[a1...ai...an]                      [b1...bi...bm]    P2 ≡ λ X . f X
                                                      P3 ≡ λ X1 . λ X2 . X1 + X2
                 X:s               X: l(s)

                 P1                P2
                                                      Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]
                 Y:s               Y:l(s)
                                                      Then, P1:Y = [a12...an2], P2:Y=c
[a12... ai2 ...an2]                  [c1...ci...cm]         P3:Y = [a12+c... am2+c]
                       X1:s X2:s
                                                      And
                            P3                         lineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }
                            Y


               [a12+c1... ai2+ci ... am2+cm]


                                                                                         10
III: recoverable loss of precision

                       X1     X2

                         P0
                                                      P1 ≡ λ X . X2     “f is index-preserving”
                       Y1     Y2
[a1...ai...an]                      [b1...bi...bm]    P2 ≡ λ X . f X
                                                      P3 ≡ λ X1 . λ X2 . X1 + X2
                 X:s               X: l(s)

                 P1                P2
                                                      Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]
                 Y:s               Y:l(s)
                                                      Then, P1:Y = [a12...an2], P2:Y=c
[a12... ai2 ...an2]                  [c1...ci...cm]         P3:Y = [a12+c... am2+c]
                       X1:s X2:s
                                                      And
                            P3                         lineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }
                            Y                           lineage(P3:Y[i]) = { P0:Y1[i], P0:Y2[i] }

               [a12+c1... ai2+ci ... am2+cm]


                                                                                         10
Multi-level nesting and lineage precision




                                   11
Adding annotations to the original workflow

                               Processor signatures


                                  l(s) → l(s)




                                  l(s) → l(s)



                                  s→s




                                  s → l(s)



                                  s→s


                                         12
Adding annotations to the original workflow

                                                         [139618, 83093]    Processor signatures


                                                                               l(s) → l(s)
                                                          [139618, 83093]




                                                                               l(s) → l(s)
                                   [16,13]           [23520984, 31786617]



                                                                               s→s


                                        [16,13]   [23560179, 31871809]

                                                                               s → l(s)


[ <1,23553692,16,rs152451>,
...                                                                            s→s
]
                  [<1,31840948,13,rs169546>,
                  ...
                  ]                                                                   12
Adding annotations to the original workflow
              geneIdList:

                                                                [139618, 83093]    Processor signatures
          [139618, 83093]
                                                                                      l(s) → l(s)
                                                                 [139618, 83093]




                                                                                      l(s) → l(s)
                                           [16,13]          [23520984, 31786617]



                                                                                      s→s
lineage(CR:result[0,i]) = { geneIdList }
lineage(CR:result[1,j]) = { geneIdList }
                                               [16,13]   [23560179, 31871809]

                                                                                      s → l(s)


      [ <1,23553692,16,rs152451>,
      ...                                                                             s→s
      ]    CR:result[0,i]
                         [<1,31840948,13,rs169546>,
                         ...
                         ]    CR:result[1,j]                                                 12
Adding annotations to the original workflow
              geneIdList:

                                                                        [139618, 83093]     Processor signatures
          [139618, 83093]                                “f is index-
                                                         preserving”
                                                                                                 l(s) → l(s)
                                                                         [139618, 83093]
                                                                                           “f is index-
                                                                                           preserving”

                                                                                                 l(s) → l(s)
                                           [16,13]                  [23520984, 31786617]



                                                                                                 s→s
lineage(CR:result[0,i]) = { geneIdList }
lineage(CR:result[1,j]) = { geneIdList }
                                               [16,13]         [23560179, 31871809]

                                                                                                 s → l(s)


      [ <1,23553692,16,rs152451>,
      ...                                                                                        s→s
      ]    CR:result[0,i]
                         [<1,31840948,13,rs169546>,
                         ...
                         ]    CR:result[1,j]                                                              12
Adding annotations to the original workflow
              geneIdList:

                                                                        [139618, 83093]     Processor signatures
          [139618, 83093]                                “f is index-
                                                         preserving”
                                                                                                 l(s) → l(s)
                                                                         [139618, 83093]
                                                                                           “f is index-
                                                                                           preserving”

                                                                                                 l(s) → l(s)
                                           [16,13]                  [23520984, 31786617]



                                                                                                 s→s
lineage(CR:result[0,i]) = { geneIdList }
lineage(CR:result[1,j]) = { geneIdList }
                                               [16,13]         [23560179, 31871809]

                                                                                                 s → l(s)


      [ <1,23553692,16,rs152451>,
      ...                                                                                        s→s
      ]    CR:result[0,i]
                         [<1,31840948,13,rs169546>,
                         ...
                         ]    CR:result[1,j]                                                              12
Adding annotations to the original workflow
              geneIdList:

                                                                           [139618, 83093]     Processor signatures
          [139618, 83093]                                   “f is index-
                                                            preserving”
                                                                                                    l(s) → l(s)
                                                                            [139618, 83093]
                                                                                              “f is index-
                                                                                              preserving”

                                                                                                    l(s) → l(s)
                                              [16,13]                  [23520984, 31786617]

lineage(CR:result[0,i]) = { geneIdList[0] }
lineage(CR:result[1,j]) = { geneIdList[1] }
                                                                                                    s→s
lineage(CR:result[0,i]) = { geneIdList }
lineage(CR:result[1,j]) = { geneIdList }
                                                  [16,13]         [23560179, 31871809]

                                                                                                    s → l(s)


      [ <1,23553692,16,rs152451>,
      ...                                                                                           s→s
      ]    CR:result[0,i]
                         [<1,31840948,13,rs169546>,
                         ...
                         ]    CR:result[1,j]                                                                 12
Granular lineage: recap
• Lineage query model accounts for granular traces
  over nested collections
• arbitrary nesting levels:
  – values are trees in general
  – lineage query identifies the correct sub-trees



• Lineage queries are efficient
  – recursion problem “compiled away” by query rewriting
  – (shameless claim - details omitted)


• But:
  – One single M-* processor can destroy granularity
  – in some cases annotations are a remedy

                                                           13
Towards provenance-friendly workflows




                                14
Towards provenance-friendly workflows
1.Define metrics for workflow provenance precision
  – how well is granularity preserved over a lineage trace?
  – what is the impact of M-* processors?
  – use to prioritize remedial actions




                                                              14
Towards provenance-friendly workflows
1.Define metrics for workflow provenance precision
   – how well is granularity preserved over a lineage trace?
   – what is the impact of M-* processors?
   – use to prioritize remedial actions

2.Make workflows more provenance friendly:
   – Add knowledge (static):
      • “lightweight annotations” [MBZ+08] -- see IPAW08
   – Add knowledge (dynamic):
          –provenance-active workflow processors
   – Redesign processors / workflow
      • general guidelines, provenance friendly patterns


[MBZ+08] Missier, Khalid Belhajjame, Jun Zhao, Carole Goble, Data lineage model for Taverna workflows with
lightweight annotation requirements, Procs. International Provenance and Annotation Workshop (IPAW 2008)

                                                                                                    14
Lineage precision: example

                           c = [c1, c2, c3]


a = [a1, a2]                          e = [e1, e2]


b = [b1, b2]                 f

                            d = [d1, d2]




                                              15
Lineage precision: example

                                                 c = [c1, c2, c3]


a = [a1, a2]                                                e = [e1, e2]


b = [b1, b2]                                       f

                                                  d = [d1, d2]




     lineage(P4:Y1[1.2.2], {P0, P2, P3}) =                          15
Lineage precision: example

                                                 c = [c1, c2, c3]


a = [a1, a2]                                                e = [e1, e2]


b = [b1, b2]                                       f

                                                  d = [d1, d2]




     lineage(P4:Y1[1.2.2], {P0, P2, P3}) =                          15
Lineage precision: example

                                                 c = [c1, c2, c3]


a = [a1, a2]                                                e = [e1, e2]


b = [b1, b2]                                       f

                                                  d = [d1, d2]




     lineage(P4:Y1[1.2.2], {P0, P2, P3}) =                          15
Lineage precision: example

                                                 c = [c1, c2, c3]


a = [a1, a2]                                                e = [e1, e2]


b = [b1, b2]                                       f

                                                  d = [d1, d2]




     lineage(P4:Y1[1.2.2], {P0, P2, P3}) =                          15
Lineage precision: example

                                                          c = [c1, c2, c3]


a = [a1, a2]                                                         e = [e1, e2]


b = [b1, b2]                                                f

                                                          d = [d1, d2]




     lineage(P4:Y1[1.2.2], {P0, P2, P3}) =   { P0:Y[1]= a1, P2:X=c, P3:X=e }
                                                                         15
Lineage precision: example

                                                          c = [c1, c2, c3]


a = [a1, a2]                                                         e = [e1, e2]


b = [b1, b2]                                                f

               precision = (1 + .5 + .5) / 3 = 2/3[d , d ]
                                               d=                1   2




     lineage(P4:Y1[1.2.2], {P0, P2, P3}) =   { P0:Y[1]= a1, P2:X=c, P3:X=e }
                                                                         15
Precision relative to a sub-graph
• Refining the previous idea:
  – precision relative to a set O of output variables and a set I of input variables
     • because not all variables are equally interesting...
     • weights WI, WO account for relative importance of variables




                    I1        I2




                         O2   O3


           O1
                                                                               16
Precision relative to a sub-graph
• Refining the previous idea:
  – precision relative to a set O of output variables and a set I of input variables
     • because not all variables are equally interesting...
     • weights WI, WO account for relative importance of variables

                                                                                      len(pi )
 prec(I, WI , O, WO ) =                   WO (Oj )                         WI (Xi ) ·
                                                                                      nl (Xi )
                              j:1...|O|              Xi (pi )∈lin(Oj ,I)

                    I1          I2
                                                                    wi =               wj = 1
                                                          wi ∈WI              wj ∈WO




                         O2     O3


           O1
                                                                                          16
Impact of M-* processors on precision
     I1        I2   Count the number of variables in O that
                    can be reached from P
                    • weighted sum
P

                    impact(P, O) =            W (o) · reach(P, o)
                                        o∈O
          O2   O3


                                    1   if v is reachable from P
O1
                    reach(P, v) =
                                    0   otherwise




                                                           17
Improving provenance precision
• Impact used to prioritize user actions on processors
• Precision used to assess improvement


• add index-preserving annotations
  ✓illustrated earlier
• refactor M-* processors
• make processors provenance-active



                                                    18
Refactoring M-* → 1-1
                      [139618, 83093]        Processor signatures


                                                 l(s) → l(s)
                      [139618, 83093]




                                                 l(s) → l(s)
[16,13]          [23520984, 31786617]



                                                 s→s

                                        Dot product
    [16,13]   [23560179, 31871809]

                                                 s → l(s)



                                                 s→s


                                                        19
Refactoring M-* → 1-1
                      [139618, 83093]        Processor signatures


                                                 l(s) → l(s)
                      [139618, 83093]




                                                 l(s) → l(s)
[16,13]          [23520984, 31786617]
                                                s→s
                                                 s→s

                                        Dot product
    [16,13]   [23560179, 31871809]

                                                 s → l(s)



                                                 s→s


                                                        19
Refactoring M-* → 1-1
                                       [139618, 83093]        Processor signatures


                                                                  l(s) → l(s)
   139618                              [139618, 83093]


<16, 23520984>

                                                                  l(s) → l(s)
                 [16,13]          [23520984, 31786617]
                                                                 s→s
                                                                  s→s

                                                         Dot product
                     [16,13]   [23560179, 31871809]

                                                                  s → l(s)



                                                                  s→s


                                                                         19
Refactoring M-* → 1-1
                                                      [139618, 83093]        Processor signatures


                                                                                 l(s) → l(s)
   139618          83093                              [139618, 83093]


<16, 23520984> <13, 31786617>

                                                                                 l(s) → l(s)
                                [16,13]          [23520984, 31786617]
                                                                                s→s
                                                                                 s→s

                                                                        Dot product
                                    [16,13]   [23560179, 31871809]

                                                                                 s → l(s)



                                                                                 s→s


                                                                                        19
Refactoring M-* → 1-1
                                                      [139618, 83093]        Processor signatures


                                                                                 l(s) → l(s)
   139618            83093                            [139618, 83093]


<16, 23520984> <13, 31786617>

                                                                                 l(s) → l(s)
                                [16,13]          [23520984, 31786617]
                                                                                s→s
                                                                                 s→s

                                                                        Dot product
<16, 23560179>                      [16,13]   [23560179, 31871809]

                                                                                 s → l(s)


[ <1,23553692,16,rs152451>,
...                                                                              s→s
]


                                                                                        19
Refactoring M-* → 1-1
                                                          [139618, 83093]        Processor signatures


                                                                                     l(s) → l(s)
   139618            83093                                [139618, 83093]


<16, 23520984> <13, 31786617>

                                                                                     l(s) → l(s)
                                   [16,13]           [23520984, 31786617]
                                                                                    s→s
                                                                                     s→s

                                                                            Dot product
<16, 23560179>   <13, 31871809>         [16,13]   [23560179, 31871809]

                                                                                     s → l(s)


[ <1,23553692,16,rs152451>,
...                                                                                  s→s
]
                  [<1,31840948,13,rs169546>,
                  ...
                  ]                                                                         19
Refactoring M-* → 1-1
                                                          [139618, 83093]        Processor signatures


                                                                                     l(s) → l(s)
   139618            83093                                [139618, 83093]


<16, 23520984> <13, 31786617>

                                                                                     l(s) → l(s)
                                   [16,13]           [23520984, 31786617]
                                                                                    s→s
                                                                                     s→s

                                                                            Dot product
<16, 23560179>   <13, 31871809>         [16,13]   [23560179, 31871809]

                                                                                     s → l(s)


[ <1,23553692,16,rs152451>,
...                                                                                  s→s
]
                  [<1,31840948,13,rs169546>,
                  ...
                  ]                                                                         19
Refactoring M-* → 1-1
                                                          [139618, 83093]        Processor signatures


                                                                                     l(s) → l(s)
   139618            83093                                [139618, 83093]


<16, 23520984> <13, 31786617>

                                                                                     l(s) → l(s)
                                   [16,13]           [23520984, 31786617]
                                                                                    s→s
                                                                                     s→s

                                                                            Dot product
<16, 23560179>   <13, 31871809>         [16,13]   [23560179, 31871809]

                                                                                     s → l(s)


[ <1,23553692,16,rs152451>,
...                                                                                  s→s
]
                  [<1,31840948,13,rs169546>,
                  ...
                  ]                                                                         19
Provenance-active processors
–Passive processors do not contribute explicit
 provenance info
–provenance-active processors actively feed metadata
 to the lineage service

                   X: l(s) = [a1, a2, a3]          X: l(s) = [a1, a2, a3]
                             P                             P
                    Y: s =       b                Y: l(s) = [b1, b2]


Static               aggregation f()‫‏‬                P is index-
  annotations:                                        preserving
Dynamic            b = X[i]‫‏‬                             sorting:
 annotations:                                            Y = Π(X)
                 b = f(X[1]...X[k])

                                     IPAW'08 – Salt Lake City, Utah, June 2008
Open Provenance Model
• A graph notation to represent process provenance
  – independent of the provenance producers
  – suitable for exchanging provenance across different workflow
    systems
• State: draft 1.01 (July 2008)




                                                             21
Mapping to OPM - granularity issue

    a         X1    X2         b

               P0

    c         Y1    Y2         d                                        used
                                              a   used        wgb   c           P1

                                                         P0
        X:s              X:s                                            used
                                              b   used        wgb   d           P2
        P1               P2

e       Y:s              Y:s         f




                                                                           22
Mapping to OPM - granularity issue

    a         X1    X2         b

               P0

    c         Y1    Y2         d                                         used
                                              a   used        wgb    c           P1

                                                         P0
        X:s              X:s                                             used
                                              b   used        wgb    d           P2
        P1               P2

e       Y:s              Y:s         f              wasDerivedFrom




                                                                            22
Mapping to OPM - granularity issue

    a         X1    X2         b

               P0

    c         Y1    Y2         d                                                 used
                                              a       used        wgb    c               P1

                                                             P0
        X:s              X:s                                                     used
                                              b       used        wgb    d               P2
        P1               P2

e       Y:s              Y:s         f                  wasDerivedFrom



                                                  ☐                          ☐




                                                                                    22
Mapping to OPM - granularity issue

    a         X1    X2         b

               P0

    c         Y1    Y2         d                                                     used
                                              a        used        wgb      c                P1

                                                              P0
        X:s              X:s                                                         used
                                              b        used        wgb      d                P2
        P1               P2

e       Y:s              Y:s         f                   wasDerivedFrom



                                                   ☐                             ☐
                                            b[p]       wasDerivedFrom    d[p’]




                                                                                        22
Mapping to OPM - granularity issue

     a         X1    X2         b

                P0

     c         Y1    Y2         d                                                     used
                                               a        used        wgb      c                P1

                                                               P0
         X:s              X:s                                                         used
                                               b        used        wgb      d                P2
         P1               P2

e        Y:s              Y:s         f                   wasDerivedFrom



                                                    ☐                             ☐
                                             b[p]       wasDerivedFrom    d[p’]


How can this granular dependency be described for all arbitrary paths p?
Currently cannot be expressed using OPM

                                                                                         22
Path mapping rules
Static graph structure sufficient                                      c
                                                                                used
                                                                                       P2
                                         a        used        wgb
to provide this (in Taverna)
                                                         P1
                                                                                used
                                         b        used        wgb      d               P3


But this is only known at query time                wasDerivedFrom


(extensional enumeration not an
option)                                       ☐                             ☐
                                       b[p]       actual lineage    d[p’]




                                                                                  23
Path mapping rules
Static graph structure sufficient                                      c
                                                                                used
                                                                                       P2
                                          a       used        wgb
to provide this (in Taverna)
                                                         P1
                                                                                used
                                          b       used        wgb      d               P3


But this is only known at query time                wasDerivedFrom


(extensional enumeration not an
option)                                       ☐                             ☐
                                       b[p]       actual lineage    d[p’]

Observation:
• only need to consider individual processor transformations
• exploit local processor rules for propagating granular lineage



                                                                                  23
Path mapping rules
Static graph structure sufficient                                      c
                                                                                used
                                                                                       P2
                                          a       used        wgb
to provide this (in Taverna)
                                                         P1
                                                                                used
                                          b       used        wgb      d               P3


But this is only known at query time                wasDerivedFrom


(extensional enumeration not an
option)                                       ☐                             ☐
                                       b[p]       actual lineage    d[p’]

Observation:
• only need to consider individual processor transformations
• exploit local processor rules for propagating granular lineage
Hint:
granularity is only determined by depth of the path
At query time, the Taverna lineage query algorithm encodes a path
mapping rule to compute p’ given p
                                                                                  23
Architecture provenance-active processors
                                                           lin( P:Y, , Psel, E(D))

                             inputs      outputs                                lineage query
                                                                                interface
                   Taverna workflow engine                     provenance
                                              provenance
                                              events            manager
external
services

                                                                             provenance
                                                                             information
                                                                             repository

           1. Common content:
     –processor execution details
     –binding of input/output variables to values
     –completion status
                                                                                      24
Architecture provenance-active processors
                                                             lin( P:Y, , Psel, E(D))

                               inputs      outputs                                lineage query
                                                                                  interface
                     Taverna workflow engine                     provenance
                                                provenance
                                                events            manager
external
services

                                                                               provenance
                                                                               information
                                                                               repository

           1. Common content:
     –processor execution details
     –binding of input/output variables to values
     –completion status
                                                                                        24
               2. Optional content for provenance-active processors:
             – explicit output → input dependency assertions:
               let I, O be the input, resp. output variables set
              depends(Y, X[p], <depType>), X ∈ I, Y ∈ O
Architecture provenance-active processors
                                                             lin( P:Y, , Psel, E(D))

                               inputs      outputs                                lineage query
                                                                                  interface
                     Taverna workflow engine                     provenance
                                                provenance
                                                events            manager
external                    p-active API
services

                                                                               provenance
                                                                               information
                                                                               repository

           1. Common content:
     –processor execution details
     –binding of input/output variables to values
     –completion status
                                                                                        24
               2. Optional content for provenance-active processors:
             – explicit output → input dependency assertions:
               let I, O be the input, resp. output variables set
              depends(Y, X[p], <depType>), X ∈ I, Y ∈ O
Ongoing work
• Experimental evaluation:
  – to what extent is granularity a real practical problem?
  – Quantify provenance friendliness by analysing a large
    collection of workflows from myExperiment
  – Quantify available improvements (i.e. by refactoring)


• Compare collection management in Taverna with
  other workflow models
  – can we sucessfully exchange provenance graphs?

• Integration of the provenance service with the new
  version of Taverna
  – to be released before end of year



                                                              25

Weitere ähnliche Inhalte

Ähnlich wie Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pubsesejun
 
LogMap: Logic-based and Scalable Ontology Matching
LogMap: Logic-based and Scalable Ontology MatchingLogMap: Logic-based and Scalable Ontology Matching
LogMap: Logic-based and Scalable Ontology MatchingErnesto Jimenez Ruiz
 
Creating a SNP calling pipeline
Creating a SNP calling pipelineCreating a SNP calling pipeline
Creating a SNP calling pipelineDan Bolser
 
Splice site recognition among different organisms
Splice site recognition among different organismsSplice site recognition among different organisms
Splice site recognition among different organismsDespoina Kalfakakou
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionJatinder Singh
 
UMBC Research Day Presentation
UMBC Research Day PresentationUMBC Research Day Presentation
UMBC Research Day PresentationSDavis7
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2Shrayes Ramesh
 
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGSEVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGSAksw Group
 
A multivariate approach for process variograms
A multivariate approach for process variogramsA multivariate approach for process variograms
A multivariate approach for process variogramsQuentin Dehaine
 
OpenFOAM benchmark for EPYC server cavity flow small
OpenFOAM benchmark for EPYC server cavity flow smallOpenFOAM benchmark for EPYC server cavity flow small
OpenFOAM benchmark for EPYC server cavity flow smalltakuyayamamoto1800
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseNick Dimiduk
 
20110524zurichngs 2nd pub
20110524zurichngs 2nd pub20110524zurichngs 2nd pub
20110524zurichngs 2nd pubsesejun
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotLi Shen
 
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...fruitbreedomics
 
2015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and22015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and2Dan Gaston
 
Ensemble of Heterogeneous Flexible Neural Tree for the approximation and feat...
Ensemble of Heterogeneous Flexible Neural Tree for the approximation and feat...Ensemble of Heterogeneous Flexible Neural Tree for the approximation and feat...
Ensemble of Heterogeneous Flexible Neural Tree for the approximation and feat...Varun Ojha
 

Ähnlich wie Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008 (20)

20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
LogMap: Logic-based and Scalable Ontology Matching
LogMap: Logic-based and Scalable Ontology MatchingLogMap: Logic-based and Scalable Ontology Matching
LogMap: Logic-based and Scalable Ontology Matching
 
Creating a SNP calling pipeline
Creating a SNP calling pipelineCreating a SNP calling pipeline
Creating a SNP calling pipeline
 
Splice site recognition among different organisms
Splice site recognition among different organismsSplice site recognition among different organisms
Splice site recognition among different organisms
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential Expression
 
UMBC Research Day Presentation
UMBC Research Day PresentationUMBC Research Day Presentation
UMBC Research Day Presentation
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2
 
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGSEVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
 
A multivariate approach for process variograms
A multivariate approach for process variogramsA multivariate approach for process variograms
A multivariate approach for process variograms
 
OpenFOAM benchmark for EPYC server cavity flow small
OpenFOAM benchmark for EPYC server cavity flow smallOpenFOAM benchmark for EPYC server cavity flow small
OpenFOAM benchmark for EPYC server cavity flow small
 
Blast fasta 4
Blast fasta 4Blast fasta 4
Blast fasta 4
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBase
 
20110524zurichngs 2nd pub
20110524zurichngs 2nd pub20110524zurichngs 2nd pub
20110524zurichngs 2nd pub
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
 
2015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and22015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and2
 
Ensemble of Heterogeneous Flexible Neural Tree for the approximation and feat...
Ensemble of Heterogeneous Flexible Neural Tree for the approximation and feat...Ensemble of Heterogeneous Flexible Neural Tree for the approximation and feat...
Ensemble of Heterogeneous Flexible Neural Tree for the approximation and feat...
 

Mehr von Paolo Missier

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 

Mehr von Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 

Kürzlich hochgeladen

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 

Kürzlich hochgeladen (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

  • 1. Granular workflow provenance in Taverna Paolo Missier Information Management Group School of Computer Science, University of Manchester, UK Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008 1
  • 2. Outline • Collection values in [bioinformatics] workflows are important • Granular provenance over collections: model and issues • Measuring “provenance friendliness” of dataflows • Increasing friendliness of existing dataflows • Extending the Open Provenance Model graph to describe granular data derivations • Provenance service architecture - brief description 2
  • 3. Example (Taverna) dataflow QTL -> genes -> Kegg pathways IPAW'08 – Salt Lake City, Utah, June 2008
  • 4. Example (Taverna) dataflow IPAW'08 – Salt Lake City, Utah, June 2008
  • 5. Collections example: from genes to SNPs • See myexperiment.org: http://www.myexperiment.org/workflows/166 4
  • 6. Collections example: from genes to SNPs • See myexperiment.org: http://www.myexperiment.org/workflows/166 gene -> genomic region 4
  • 7. Collections example: from genes to SNPs • See myexperiment.org: http://www.myexperiment.org/workflows/166 gene -> genomic region extend region 4
  • 8. Collections example: from genes to SNPs • See myexperiment.org: http://www.myexperiment.org/workflows/166 gene -> genomic region extend region retrieve SNPs in the region 4
  • 9. Collections example: from genes to SNPs • See myexperiment.org: http://www.myexperiment.org/workflows/166 gene -> genomic region extend region retrieve SNPs in the region rearrange SNP details 4
  • 10. Collections example: from genes to SNPs • See myexperiment.org: http://www.myexperiment.org/workflows/166 [ ENSG00000139618 , ENSG00000083093 ] gene -> genomic region extend region retrieve SNPs in the region [[<1,23554512,16,rs45585833>, <1,23554712,16,rs45594034>, ... rearrange SNP ], [<1,31820153,13,ENSSNP10730823>, details <1,31818497,13,ENSSNP10730820>, ... ]] 4
  • 11. Computational model for collections Depth mismatch between declared / offered type: type(P4:X1) = s but type(a) = list(s) type(P4:X2) = type(c) = list(s) type(P4:X3) = s but type(c) = list(s) Execution at P4: Y = (map P1 <(a ⊗ b) , c>) // cross product Y = [ (P1 <a1,b1,c>) ... (P1 <an,bm,c>) ] 5
  • 12. Collections and iterations Processor signatures l(s) → l(s) l(s) → l(s) s→s s → l(s) s→s 6
  • 13. Collections and iterations [139618, 83093] Processor signatures l(s) → l(s) l(s) → l(s) s→s s → l(s) s→s 6
  • 14. Collections and iterations [139618, 83093] Processor signatures l(s) → l(s) [139618, 83093] l(s) → l(s) s→s s → l(s) s→s 6
  • 15. Collections and iterations [139618, 83093] Processor signatures l(s) → l(s) [139618, 83093] l(s) → l(s) [16,13] [23520984, 31786617] s→s s → l(s) s→s 6
  • 16. Collections and iterations [139618, 83093] Processor signatures l(s) → l(s) [139618, 83093] l(s) → l(s) [16,13] [23520984, 31786617] s→s [16,13] [23560179, 31871809] s → l(s) s→s 6
  • 17. Collections and iterations [139618, 83093] Processor signatures l(s) → l(s) [139618, 83093] l(s) → l(s) [16,13] [23520984, 31786617] s→s Dot product [16,13] [23560179, 31871809] s → l(s) s→s 6
  • 18. Collections and iterations [139618, 83093] Processor signatures l(s) → l(s) [139618, 83093] l(s) → l(s) [16,13] [23520984, 31786617] s→s Dot product <16, 23560179,..> [16,13] [23560179, 31871809] s → l(s) [ <1,23553692,16,rs152451>, ... s→s ] 6
  • 19. Collections and iterations [139618, 83093] Processor signatures l(s) → l(s) [139618, 83093] l(s) → l(s) [16,13] [23520984, 31786617] s→s Dot product <16, 23560179,..> <13, 31871809,...> [16,13] [23560179, 31871809] s → l(s) [ <1,23553692,16,rs152451>, ... s→s ] [<1,31840948,13,rs169546>, ... ] 6
  • 20. Collections and iterations [139618, 83093] Processor signatures 139618 83093 l(s) → l(s) [139618, 83093] l(s) → l(s) [16,13] [23520984, 31786617] s→s Dot product <16, 23560179,..> <13, 31871809,...> [16,13] [23560179, 31871809] s → l(s) [ <1,23553692,16,rs152451>, ... s→s ] [<1,31840948,13,rs169546>, ... ] 6
  • 21. Collections and iterations [139618, 83093] Processor signatures 139618 83093 l(s) → l(s) [139618, 83093] l(s) → l(s) [16,13] [23520984, 31786617] s→s Dot product <16, 23560179,..> <13, 31871809,...> [16,13] [23560179, 31871809] s → l(s) [ <1,23553692,16,rs152451>, ... s→s ] [<1,31840948,13,rs169546>, ... ] 6
  • 22. Collections and iterations [139618, 83093] Processor signatures 139618 83093 l(s) → l(s) [139618, 83093] l(s) → l(s) [16,13] [23520984, 31786617] s→s Dot product <16, 23560179,..> <13, 31871809,...> [16,13] [23560179, 31871809] s → l(s) [ <1,23553692,16,rs152451>, ... s→s ] [<1,31840948,13,rs169546>, ... ] 6
  • 23. Collections and iterations [139618, 83093] Processor signatures 139618 83093 l(s) → l(s) [139618, 83093] l(s) → l(s) [16,13] [23520984, 31786617] s→s Dot product <16, 23560179,..> <13, 31871809,...> [16,13] [23560179, 31871809] s → l(s) [ <1,23553692,16,rs152451>, ... s→s ] [<1,31840948,13,rs169546>, ... ] 6
  • 24. Tracing granular lineage • Provenance traces are most useful when they are granular – trace individual items in a collection – “which geneID is responsible for the presence of SNP rs169546 in the output?” • Curse of black box processors: – M-M (many-many) and M-1 (many-one) processors destroy granularity 7
  • 25. Granular lineage I: no loss of precision X1 X2 P0 Y1:l(s) Y2:l(s) P1 ≡ λ X . X2 [a1...ai...an] [b1...bi...bm] P2 ≡ λ X . 2X P3 ≡ λ X1 . λ X2 . X1 + X2 X:s X:s P1 P2 Let Y:s Y:s P0:Y1 = [a1...an], [a12... ai2 ...an2] [2b1... 2bj ...2bm] P0:Y2 = [b1...bm] X1:s X2:s Cross P3 product Then, P1:Y = [a12...an2], Y P2:Y=[2b1...2bm] [a12+2b1... ai2+2bi ... an2+2bm] P3:Y = [a12+2b1... an2+2bm] And lineage(P3:Y[i], {P0}) = { P0:Y1[i], P0:Y2[j] } 8
  • 26. Granular lineage I: no loss of precision X1 X2 P0 Y1:l(s) Y2:l(s) P1 ≡ λ X . X2 [a1...ai...an] [b1...bi...bm] P2 ≡ λ X . 2X P3 ≡ λ X1 . λ X2 . X1 + X2 X:s X:s P1 P2 Let Y:s Y:s P0:Y1 = [a1...an], [a12... ai2 ...an2] [2b1... 2bj ...2bm] P0:Y2 = [b1...bm] X1:s X2:s Cross P3 product Then, P1:Y = [a12...an2], Y P2:Y=[2b1...2bm] [a12+2b1... ai2+2bi ... an2+2bm] P3:Y = [a12+2b1... an2+2bm] And lineage(P3:Y[i], {P0}) = { P0:Y1[i], P0:Y2[j] } 8
  • 27. Granular lineage II: loss of precision X1 X2 P0 Y1 Y2 P1 ≡ λ X . X2 [a1...ai...an] [b1...bi...bm] P2 ≡ λ X . min X P3 ≡ λ X1 . λ X2 . X1 + X2 X:s X: l(s) P1 P2 Let Y:s Y:s P0:Y1 = [a1...an], P0:Y2=[b1...bm] [a12... ai2 ...an2] c X1:s X2:s Then, P3 P1:Y = [a12...an2], P2:Y = c = min {b1...bm} Y P3:Y = [a12+c... am2+c] And [a1 2+c... ai2+c ... am 2+c] lineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 } 9
  • 28. Granular lineage II: loss of precision X1 X2 P0 Y1 Y2 P1 ≡ λ X . X2 [a1...ai...an] [b1...bi...bm] P2 ≡ λ X . min X P3 ≡ λ X1 . λ X2 . X1 + X2 X:s X: l(s) P1 P2 Let Y:s Y:s P0:Y1 = [a1...an], P0:Y2=[b1...bm] [a12... ai2 ...an2] c X1:s X2:s Then, P3 P1:Y = [a12...an2], P2:Y = c = min {b1...bm} Y P3:Y = [a12+c... am2+c] And [a1 2+c... ai2+c ... am 2+c] lineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 } 9
  • 29. III: recoverable loss of precision X1 X2 P0 Y1 Y2 P1 ≡ λ X . X2 [a1...ai...an] [b1...bi...bm] P2 ≡ λ X . f X P3 ≡ λ X1 . λ X2 . X1 + X2 X:s X: l(s) P1 P2 Let P0:Y1 = [a1...an], P0:Y2=[b1...bm] Y:s Y:l(s) Then, P1:Y = [a12...an2], P2:Y=c [a12... ai2 ...an2] [c1...ci...cm] P3:Y = [a12+c... am2+c] X1:s X2:s And P3 lineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 } Y [a12+c1... ai2+ci ... am2+cm] 10
  • 30. III: recoverable loss of precision X1 X2 P0 P1 ≡ λ X . X2 “f is index-preserving” Y1 Y2 [a1...ai...an] [b1...bi...bm] P2 ≡ λ X . f X P3 ≡ λ X1 . λ X2 . X1 + X2 X:s X: l(s) P1 P2 Let P0:Y1 = [a1...an], P0:Y2=[b1...bm] Y:s Y:l(s) Then, P1:Y = [a12...an2], P2:Y=c [a12... ai2 ...an2] [c1...ci...cm] P3:Y = [a12+c... am2+c] X1:s X2:s And P3 lineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 } Y [a12+c1... ai2+ci ... am2+cm] 10
  • 31. III: recoverable loss of precision X1 X2 P0 P1 ≡ λ X . X2 “f is index-preserving” Y1 Y2 [a1...ai...an] [b1...bi...bm] P2 ≡ λ X . f X P3 ≡ λ X1 . λ X2 . X1 + X2 X:s X: l(s) P1 P2 Let P0:Y1 = [a1...an], P0:Y2=[b1...bm] Y:s Y:l(s) Then, P1:Y = [a12...an2], P2:Y=c [a12... ai2 ...an2] [c1...ci...cm] P3:Y = [a12+c... am2+c] X1:s X2:s And P3 lineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 } Y [a12+c1... ai2+ci ... am2+cm] 10
  • 32. III: recoverable loss of precision X1 X2 P0 P1 ≡ λ X . X2 “f is index-preserving” Y1 Y2 [a1...ai...an] [b1...bi...bm] P2 ≡ λ X . f X P3 ≡ λ X1 . λ X2 . X1 + X2 X:s X: l(s) P1 P2 Let P0:Y1 = [a1...an], P0:Y2=[b1...bm] Y:s Y:l(s) Then, P1:Y = [a12...an2], P2:Y=c [a12... ai2 ...an2] [c1...ci...cm] P3:Y = [a12+c... am2+c] X1:s X2:s And P3 lineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 } Y lineage(P3:Y[i]) = { P0:Y1[i], P0:Y2[i] } [a12+c1... ai2+ci ... am2+cm] 10
  • 33. Multi-level nesting and lineage precision 11
  • 34. Adding annotations to the original workflow Processor signatures l(s) → l(s) l(s) → l(s) s→s s → l(s) s→s 12
  • 35. Adding annotations to the original workflow [139618, 83093] Processor signatures l(s) → l(s) [139618, 83093] l(s) → l(s) [16,13] [23520984, 31786617] s→s [16,13] [23560179, 31871809] s → l(s) [ <1,23553692,16,rs152451>, ... s→s ] [<1,31840948,13,rs169546>, ... ] 12
  • 36. Adding annotations to the original workflow geneIdList: [139618, 83093] Processor signatures [139618, 83093] l(s) → l(s) [139618, 83093] l(s) → l(s) [16,13] [23520984, 31786617] s→s lineage(CR:result[0,i]) = { geneIdList } lineage(CR:result[1,j]) = { geneIdList } [16,13] [23560179, 31871809] s → l(s) [ <1,23553692,16,rs152451>, ... s→s ] CR:result[0,i] [<1,31840948,13,rs169546>, ... ] CR:result[1,j] 12
  • 37. Adding annotations to the original workflow geneIdList: [139618, 83093] Processor signatures [139618, 83093] “f is index- preserving” l(s) → l(s) [139618, 83093] “f is index- preserving” l(s) → l(s) [16,13] [23520984, 31786617] s→s lineage(CR:result[0,i]) = { geneIdList } lineage(CR:result[1,j]) = { geneIdList } [16,13] [23560179, 31871809] s → l(s) [ <1,23553692,16,rs152451>, ... s→s ] CR:result[0,i] [<1,31840948,13,rs169546>, ... ] CR:result[1,j] 12
  • 38. Adding annotations to the original workflow geneIdList: [139618, 83093] Processor signatures [139618, 83093] “f is index- preserving” l(s) → l(s) [139618, 83093] “f is index- preserving” l(s) → l(s) [16,13] [23520984, 31786617] s→s lineage(CR:result[0,i]) = { geneIdList } lineage(CR:result[1,j]) = { geneIdList } [16,13] [23560179, 31871809] s → l(s) [ <1,23553692,16,rs152451>, ... s→s ] CR:result[0,i] [<1,31840948,13,rs169546>, ... ] CR:result[1,j] 12
  • 39. Adding annotations to the original workflow geneIdList: [139618, 83093] Processor signatures [139618, 83093] “f is index- preserving” l(s) → l(s) [139618, 83093] “f is index- preserving” l(s) → l(s) [16,13] [23520984, 31786617] lineage(CR:result[0,i]) = { geneIdList[0] } lineage(CR:result[1,j]) = { geneIdList[1] } s→s lineage(CR:result[0,i]) = { geneIdList } lineage(CR:result[1,j]) = { geneIdList } [16,13] [23560179, 31871809] s → l(s) [ <1,23553692,16,rs152451>, ... s→s ] CR:result[0,i] [<1,31840948,13,rs169546>, ... ] CR:result[1,j] 12
  • 40. Granular lineage: recap • Lineage query model accounts for granular traces over nested collections • arbitrary nesting levels: – values are trees in general – lineage query identifies the correct sub-trees • Lineage queries are efficient – recursion problem “compiled away” by query rewriting – (shameless claim - details omitted) • But: – One single M-* processor can destroy granularity – in some cases annotations are a remedy 13
  • 42. Towards provenance-friendly workflows 1.Define metrics for workflow provenance precision – how well is granularity preserved over a lineage trace? – what is the impact of M-* processors? – use to prioritize remedial actions 14
  • 43. Towards provenance-friendly workflows 1.Define metrics for workflow provenance precision – how well is granularity preserved over a lineage trace? – what is the impact of M-* processors? – use to prioritize remedial actions 2.Make workflows more provenance friendly: – Add knowledge (static): • “lightweight annotations” [MBZ+08] -- see IPAW08 – Add knowledge (dynamic): –provenance-active workflow processors – Redesign processors / workflow • general guidelines, provenance friendly patterns [MBZ+08] Missier, Khalid Belhajjame, Jun Zhao, Carole Goble, Data lineage model for Taverna workflows with lightweight annotation requirements, Procs. International Provenance and Annotation Workshop (IPAW 2008) 14
  • 44. Lineage precision: example c = [c1, c2, c3] a = [a1, a2] e = [e1, e2] b = [b1, b2] f d = [d1, d2] 15
  • 45. Lineage precision: example c = [c1, c2, c3] a = [a1, a2] e = [e1, e2] b = [b1, b2] f d = [d1, d2] lineage(P4:Y1[1.2.2], {P0, P2, P3}) = 15
  • 46. Lineage precision: example c = [c1, c2, c3] a = [a1, a2] e = [e1, e2] b = [b1, b2] f d = [d1, d2] lineage(P4:Y1[1.2.2], {P0, P2, P3}) = 15
  • 47. Lineage precision: example c = [c1, c2, c3] a = [a1, a2] e = [e1, e2] b = [b1, b2] f d = [d1, d2] lineage(P4:Y1[1.2.2], {P0, P2, P3}) = 15
  • 48. Lineage precision: example c = [c1, c2, c3] a = [a1, a2] e = [e1, e2] b = [b1, b2] f d = [d1, d2] lineage(P4:Y1[1.2.2], {P0, P2, P3}) = 15
  • 49. Lineage precision: example c = [c1, c2, c3] a = [a1, a2] e = [e1, e2] b = [b1, b2] f d = [d1, d2] lineage(P4:Y1[1.2.2], {P0, P2, P3}) = { P0:Y[1]= a1, P2:X=c, P3:X=e } 15
  • 50. Lineage precision: example c = [c1, c2, c3] a = [a1, a2] e = [e1, e2] b = [b1, b2] f precision = (1 + .5 + .5) / 3 = 2/3[d , d ] d= 1 2 lineage(P4:Y1[1.2.2], {P0, P2, P3}) = { P0:Y[1]= a1, P2:X=c, P3:X=e } 15
  • 51. Precision relative to a sub-graph • Refining the previous idea: – precision relative to a set O of output variables and a set I of input variables • because not all variables are equally interesting... • weights WI, WO account for relative importance of variables I1 I2 O2 O3 O1 16
  • 52. Precision relative to a sub-graph • Refining the previous idea: – precision relative to a set O of output variables and a set I of input variables • because not all variables are equally interesting... • weights WI, WO account for relative importance of variables len(pi ) prec(I, WI , O, WO ) = WO (Oj ) WI (Xi ) · nl (Xi ) j:1...|O| Xi (pi )∈lin(Oj ,I) I1 I2 wi = wj = 1 wi ∈WI wj ∈WO O2 O3 O1 16
  • 53. Impact of M-* processors on precision I1 I2 Count the number of variables in O that can be reached from P • weighted sum P impact(P, O) = W (o) · reach(P, o) o∈O O2 O3 1 if v is reachable from P O1 reach(P, v) = 0 otherwise 17
  • 54. Improving provenance precision • Impact used to prioritize user actions on processors • Precision used to assess improvement • add index-preserving annotations ✓illustrated earlier • refactor M-* processors • make processors provenance-active 18
  • 55. Refactoring M-* → 1-1 [139618, 83093] Processor signatures l(s) → l(s) [139618, 83093] l(s) → l(s) [16,13] [23520984, 31786617] s→s Dot product [16,13] [23560179, 31871809] s → l(s) s→s 19
  • 56. Refactoring M-* → 1-1 [139618, 83093] Processor signatures l(s) → l(s) [139618, 83093] l(s) → l(s) [16,13] [23520984, 31786617] s→s s→s Dot product [16,13] [23560179, 31871809] s → l(s) s→s 19
  • 57. Refactoring M-* → 1-1 [139618, 83093] Processor signatures l(s) → l(s) 139618 [139618, 83093] <16, 23520984> l(s) → l(s) [16,13] [23520984, 31786617] s→s s→s Dot product [16,13] [23560179, 31871809] s → l(s) s→s 19
  • 58. Refactoring M-* → 1-1 [139618, 83093] Processor signatures l(s) → l(s) 139618 83093 [139618, 83093] <16, 23520984> <13, 31786617> l(s) → l(s) [16,13] [23520984, 31786617] s→s s→s Dot product [16,13] [23560179, 31871809] s → l(s) s→s 19
  • 59. Refactoring M-* → 1-1 [139618, 83093] Processor signatures l(s) → l(s) 139618 83093 [139618, 83093] <16, 23520984> <13, 31786617> l(s) → l(s) [16,13] [23520984, 31786617] s→s s→s Dot product <16, 23560179> [16,13] [23560179, 31871809] s → l(s) [ <1,23553692,16,rs152451>, ... s→s ] 19
  • 60. Refactoring M-* → 1-1 [139618, 83093] Processor signatures l(s) → l(s) 139618 83093 [139618, 83093] <16, 23520984> <13, 31786617> l(s) → l(s) [16,13] [23520984, 31786617] s→s s→s Dot product <16, 23560179> <13, 31871809> [16,13] [23560179, 31871809] s → l(s) [ <1,23553692,16,rs152451>, ... s→s ] [<1,31840948,13,rs169546>, ... ] 19
  • 61. Refactoring M-* → 1-1 [139618, 83093] Processor signatures l(s) → l(s) 139618 83093 [139618, 83093] <16, 23520984> <13, 31786617> l(s) → l(s) [16,13] [23520984, 31786617] s→s s→s Dot product <16, 23560179> <13, 31871809> [16,13] [23560179, 31871809] s → l(s) [ <1,23553692,16,rs152451>, ... s→s ] [<1,31840948,13,rs169546>, ... ] 19
  • 62. Refactoring M-* → 1-1 [139618, 83093] Processor signatures l(s) → l(s) 139618 83093 [139618, 83093] <16, 23520984> <13, 31786617> l(s) → l(s) [16,13] [23520984, 31786617] s→s s→s Dot product <16, 23560179> <13, 31871809> [16,13] [23560179, 31871809] s → l(s) [ <1,23553692,16,rs152451>, ... s→s ] [<1,31840948,13,rs169546>, ... ] 19
  • 63. Provenance-active processors –Passive processors do not contribute explicit provenance info –provenance-active processors actively feed metadata to the lineage service X: l(s) = [a1, a2, a3] X: l(s) = [a1, a2, a3] P P Y: s = b Y: l(s) = [b1, b2] Static aggregation f()‫‏‬ P is index- annotations: preserving Dynamic b = X[i]‫‏‬ sorting: annotations: Y = Π(X) b = f(X[1]...X[k]) IPAW'08 – Salt Lake City, Utah, June 2008
  • 64. Open Provenance Model • A graph notation to represent process provenance – independent of the provenance producers – suitable for exchanging provenance across different workflow systems • State: draft 1.01 (July 2008) 21
  • 65. Mapping to OPM - granularity issue a X1 X2 b P0 c Y1 Y2 d used a used wgb c P1 P0 X:s X:s used b used wgb d P2 P1 P2 e Y:s Y:s f 22
  • 66. Mapping to OPM - granularity issue a X1 X2 b P0 c Y1 Y2 d used a used wgb c P1 P0 X:s X:s used b used wgb d P2 P1 P2 e Y:s Y:s f wasDerivedFrom 22
  • 67. Mapping to OPM - granularity issue a X1 X2 b P0 c Y1 Y2 d used a used wgb c P1 P0 X:s X:s used b used wgb d P2 P1 P2 e Y:s Y:s f wasDerivedFrom ☐ ☐ 22
  • 68. Mapping to OPM - granularity issue a X1 X2 b P0 c Y1 Y2 d used a used wgb c P1 P0 X:s X:s used b used wgb d P2 P1 P2 e Y:s Y:s f wasDerivedFrom ☐ ☐ b[p] wasDerivedFrom d[p’] 22
  • 69. Mapping to OPM - granularity issue a X1 X2 b P0 c Y1 Y2 d used a used wgb c P1 P0 X:s X:s used b used wgb d P2 P1 P2 e Y:s Y:s f wasDerivedFrom ☐ ☐ b[p] wasDerivedFrom d[p’] How can this granular dependency be described for all arbitrary paths p? Currently cannot be expressed using OPM 22
  • 70. Path mapping rules Static graph structure sufficient c used P2 a used wgb to provide this (in Taverna) P1 used b used wgb d P3 But this is only known at query time wasDerivedFrom (extensional enumeration not an option) ☐ ☐ b[p] actual lineage d[p’] 23
  • 71. Path mapping rules Static graph structure sufficient c used P2 a used wgb to provide this (in Taverna) P1 used b used wgb d P3 But this is only known at query time wasDerivedFrom (extensional enumeration not an option) ☐ ☐ b[p] actual lineage d[p’] Observation: • only need to consider individual processor transformations • exploit local processor rules for propagating granular lineage 23
  • 72. Path mapping rules Static graph structure sufficient c used P2 a used wgb to provide this (in Taverna) P1 used b used wgb d P3 But this is only known at query time wasDerivedFrom (extensional enumeration not an option) ☐ ☐ b[p] actual lineage d[p’] Observation: • only need to consider individual processor transformations • exploit local processor rules for propagating granular lineage Hint: granularity is only determined by depth of the path At query time, the Taverna lineage query algorithm encodes a path mapping rule to compute p’ given p 23
  • 73. Architecture provenance-active processors lin( P:Y, , Psel, E(D)) inputs outputs lineage query interface Taverna workflow engine provenance provenance events manager external services provenance information repository 1. Common content: –processor execution details –binding of input/output variables to values –completion status 24
  • 74. Architecture provenance-active processors lin( P:Y, , Psel, E(D)) inputs outputs lineage query interface Taverna workflow engine provenance provenance events manager external services provenance information repository 1. Common content: –processor execution details –binding of input/output variables to values –completion status 24 2. Optional content for provenance-active processors: – explicit output → input dependency assertions: let I, O be the input, resp. output variables set depends(Y, X[p], <depType>), X ∈ I, Y ∈ O
  • 75. Architecture provenance-active processors lin( P:Y, , Psel, E(D)) inputs outputs lineage query interface Taverna workflow engine provenance provenance events manager external p-active API services provenance information repository 1. Common content: –processor execution details –binding of input/output variables to values –completion status 24 2. Optional content for provenance-active processors: – explicit output → input dependency assertions: let I, O be the input, resp. output variables set depends(Y, X[p], <depType>), X ∈ I, Y ∈ O
  • 76. Ongoing work • Experimental evaluation: – to what extent is granularity a real practical problem? – Quantify provenance friendliness by analysing a large collection of workflows from myExperiment – Quantify available improvements (i.e. by refactoring) • Compare collection management in Taverna with other workflow models – can we sucessfully exchange provenance graphs? • Integration of the provenance service with the new version of Taverna – to be released before end of year 25