Presented at the Open Knowledge Conference 2011 in Berlin.
This work is being done under the heading of DataONE. More information can be found at http://notebooks.dataone.org/workflows
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
Workflow Classification and Open-Sourcing Methods
1. Workflow Classification and Open-Sourcing Methods: Towards a New Publication Model Richard Littauer, Karthik Ram, Bertram Ludäscher, William Michener, Rebecca Koskela DataONE 1
3. Scientific Workflows Tools that help scientists: Automate repetitive or difficult work Provide reproducibility to their experiments DataONE 3
4. Scientific Workflows Tools that help scientists: Automate repetitive or difficult work Provide reproducibility to their experiments Track provenance DataONE 4
5. Scientific Workflows Tools that help scientists: Automate repetitive or difficult work Provide reproducibility to their experiments Track provenance Share their data with other scientists DataONE 5
16. Workflow Workbenches Not all scientists are coders. By using front-end visualizations and eliminating the need for lower-level coding (ie, shell scripts)… DataONE 16
17. Workflow Workbenches Not all scientists are coders. By usingfront-end visualizations and eliminating the need for lower-level coding (ie, shell scripts)… …it is easier for scientists to do and share their work. DataONE 17 http://www.flickr.com/photos/wouterverhelst/362538835/
18. Workflow Workbenches This is a common way how workflows are ‘sold’. DataONE 18 http://www.flickr.com/photos/amagill/3366720659/
19. Workflow Workbenches This is a common way how workflows are ‘sold’. However, the reality isn't quite there yet. DataONE 19 http://www.flickr.com/photos/amagill/3366720659/
20. Workflow Workbenches This is a common way how workflows are ‘sold’. However, the reality isn't quite there yet. Often it is just replacing one style of coding (conventional) with another (workflows). DataONE 20 http://www.flickr.com/photos/amagill/3366720659/
21. Workflow Workbenches This is a common way how workflows are ‘sold’. However, the reality isn't quite there yet. Often it is just replacing one style of coding (conventional) with another (workflows). We’re trying to see if we can get to the bottom of how the promises cash out. DataONE 21 http://www.flickr.com/photos/amagill/3366720659/
22. Our Study However, there have been few studies done looking at how these workflows work. DataONE 22 http://www.flickr.com/photos/eleaf/2536358399
23. Our Study How do we classify workflows? DataONE 23 http://www.flickr.com/photos/eleaf/2536358399
24. Our Study How do we classify workflows? Where do existing workflow systems fall short? DataONE 24 http://www.flickr.com/photos/eleaf/2536358399
25. Our Study How do we classify workflows? Where do existing workflow systems fall short? How can the process of creating workflows be improved? DataONE 25 http://www.flickr.com/photos/eleaf/2536358399
26. Our Study How do we classify workflows? Where do existing workflow systems fall short? How can the process of creating workflows be improved? How about executing them? DataONE 26 http://www.flickr.com/photos/eleaf/2536358399
27. Our Study How do we classify workflows? Where do existing workflow systems fall short? How can the process of creating workflows be improved? How about executing them? And sharing them? DataONE 27 http://www.flickr.com/photos/eleaf/2536358399
29. Our Study Some studies have been done. For example, as much as 30% of workflow components have been assessed to be so-called data conversion shims [4]. DataONE 29
30. Our Study Some studies have been done. For example, as much as 30% of workflow components have been assessed to be so-called data conversion shims [4]. This large percentage and the difficulty of developing custom shims suggest that workflow design technology can still be improved. DataONE 30
31. Our Study But most importantly, these studies have not significantly changed the way we use workflows. DataONE 31
32. Our Study But most importantly, these studies have not significantly changed the way we use workflows. In some cases, studies run on the same data came up with different results, which suggests that open data alone does not lead to reproducible science [5]. DataONE 32
33. Our Study But most importantly, these studies have not significantly changed the way we use workflows. In some cases, studies run on the same data came up with different results, which suggests that open data alone does not lead to reproducible science [5]. Therefore, a greater understanding of workflows and how we can most adequately implement them into open science is called for. DataONE 33
34. Our Study We are analyzing a wide variety of workflow systems and publicly available workflows. DataONE 34
35. Our Study We are analyzing a wide variety of workflow systems and publicly available workflows. Our main repository: http://www.myexperiment.org DataONE 35
36. Our Study We are analyzing a wide variety of workflow systems and publicly available workflows. Our main repository: http://www.myexperiment.org Est. 2007 DataONE 36
37. Our Study We are analyzing a wide variety of workflow systems and publicly available workflows. Our main repository: http://www.myexperiment.org Est. 2007 4500+ users DataONE 37
38. Our Study We are analyzing a wide variety of workflow systems and publicly available workflows. Our main repository: http://www.myexperiment.org Est. 2007 4500+ users 1850+ workflows (mostly Taverna 1, 2, and RapidMiner) DataONE 38
39. Our Study We are analyzing a wide variety of workflow systems and publicly available workflows. Our main repository: http://www.myexperiment.org Est. 2007 4500+ users 1850+ workflows (mostly Taverna 1, 2, and RapidMiner) Minable by SPARQL DataONE 39
40. Our Study Methods: For each workflow, we’re gathering three tiers of information. DataONE 40 http://www.flickr.com/photos/jpvargas/83258973/
41. Our Study Methods: For each workflow, we’re gathering three tiers of information. DataONE 41 Meta-Data Description `Worth’ http://www.flickr.com/photos/jpvargas/83258973/
42. Tier 1 Metadata: Workflow source Workflow system Works on run Area of research Type Description User User total uploads Published citations Downloads Date uploaded DataONE 42
43. Tier 2 Description: Foreign components QA/QC steps Visual Output Number of inputs Intermediate input Linear Embedded Embedded details Number of databases Type conversion Tag conversion Multiple outputs Processing Stats Scalable Smart reruns provenance retained Multipurpose research mining Query Loop Grid Accounts necessary External results DataONE 43
44. Tier 3 `Worth’: Sufficiency of metadata Sufficiency of Natural Language Description Reuse in published articles Relevant issues based on the system it was created in. DataONE 44
45. Research Hypotheses Most workflows perform simple, but repetitive data acquisition tasks as opposed to complex operations. DataONE 45 http://www.flickr.com/photos/nauright/5391995939/
46. Research Hypotheses Most workflows perform simple, but repetitive data acquisition tasks as opposed to complex operations. Workflows are becoming more complex over time. DataONE 46 http://www.flickr.com/photos/nauright/5391995939/
47. Research Hypotheses Most workflows perform simple, but repetitive data acquisition tasks as opposed to complex operations. Workflows are becoming more complex over time. Workflows become more powerful over time. DataONE 47 http://www.flickr.com/photos/nauright/5391995939/
48. Research Hypotheses Most workflows perform simple, but repetitive data acquisition tasks as opposed to complex operations. Workflows are becoming more complex over time. Workflows become more powerful over time. Workflows become more complex as one gains more experience. DataONE 48 http://www.flickr.com/photos/nauright/5391995939/
49. Research Hypotheses Workflow re-use is proportional to the complexity of tasks performed by the workflow. DataONE 49 http://www.flickr.com/photos/nauright/5391995939/
50. Research Hypotheses Workflow re-use is proportional to the complexity of tasks performed by the workflow. Workflow re-use is proportional to the sufficiency of the documentation. DataONE 50 http://www.flickr.com/photos/nauright/5391995939/
51. Research Hypotheses Workflow re-use is proportional to the complexity of tasks performed by the workflow. Workflow re-use is proportional to the sufficiency of the documentation. Reuse is proportional to the age of the workflow. DataONE 51 http://www.flickr.com/photos/nauright/5391995939/
52. Research Hypotheses Workflow re-use is proportional to the complexity of tasks performed by the workflow. Workflow re-use is proportional to the sufficiency of the documentation. Reuse is proportional to the age of the workflow. Workflow reuse is proportional to the proficiency of the creator. DataONE 52 http://www.flickr.com/photos/nauright/5391995939/
54. Data Still being gathered and analysed. We’re using myExperiment download rate as a proxy for workflow reuse. DataONE 54
55. Data Still being gathered and analysed. We’re using myExperiment download rate as a proxy for workflow reuse. DataONE 55
56. Data Still being gathered and analysed. We’re using myExperiment download rate as a proxy for workflow reuse. DataONE 56
57. Data One of the issues with this is the amount of workflows being created by each user. However, this still should allow for a diachronic analysis. DataONE 57
59. Conclusion Old publishing model: Write paper. Submit paper. Drink wine. New publishing model: Write paper. Submit paper. Get feedback. Submit data. Replication (?) DataONE 59 http://www.flickr.com/photos/joelmontes/4762384399/
60. Conclusion Better publishing model: Write paper using Submit paper. Get feedback. Workflows. Submit data. Replication DataONE 60 http://www.flickr.com/photos/mactitioner/5595830505
61. Conclusion Better publishing model: Write paper using Submit paper. Get feedback. Workflows. Submit data. Replication Submit workflows. That works. DataONE 61 http://www.flickr.com/photos/mactitioner/5595830505
62. Conclusion Better publishing model: Write paper using Submit paper. Get feedback. Workflows. Submit data. Replication Submit workflows. That works. As this is done, questions of how effective workflows are, and how they can be utilized in the new research and publishing paradigm, might be answered. DataONE 62 http://www.flickr.com/photos/mactitioner/5595830505
63. References [1] Kepler Project. http://www.kepler-project.org [2] Taverna. http://www.taverna.org.uk/ [3] Vistrailshttp://www.vistrails.org/ [4] Cui Lin, Shiyong Lu, XuboFei, DarshanPai, and Jing Hua. 2009. A Task Abstraction and Mapping Approach to the Shimming Problem in Scientific Workflows. In Proceedings of the 2009 IEEE International Conference on Services Computing (SCC '09). IEEE Computer Society, Washington, DC, USA, http://dx.doi.org/10.1109/SCC.2009.77 [5]Coombes, K. R., Wang, J. & Baggerly, K. A. Microarrays: retracing steps.Nature Med.13, 1276–1277 (2007). DataONEWorkflows Project: http://notebooks.dataone.org/workflows Mendeley Research Group: http://www.mendeley.com/groups/1189721/scientific-workflows-and-workflow-systems/ DataONE 63 http://www.flickr.com/photos/wwworks/4759535950/