3. SCAPE
Background
•
What is a scientific workflow?
•
““The automation of a business process, in whole or part, during which documents, information or tasks are passed from one participant to another for action, according to a set of procedural rules.”
•
Background: eSciences, in particular Life Sciences
•
Two approaches
•
Data driven (what)
•
Control driven (how)
3
4. SCAPE
Background II
•
Why use scientific workflows?
•
Automation of repetitive processes
•
Chaining of distinct components (interoperability)
•
“In‐silico experimentation”
•
Documented experiment configuration
•
Re‐usable by others (encapsulation)
5. SCAPE
Background III
•
Scientific Workflow Management Systems
•
Taverna (myGrid, UK)
•
Kepler (Kepler, USA)
•
Meandre (SEASR, USA)
•
and there are many more…
•
Why Taverna?
•
Good experience in IMPACT, Open source
•
European partner (University of Manchester, UK)
•
Widely used (> 4000 active users)
•
Shields complexity from end‐user
6. SCAPE
Excourse: IMPACT
•
EU FP7 project on OCR, coordinated by the KB
•
Prototyping use of scientific workflows in digitization
•
Some components being further developed in SCAPE
•
See also http://impact.kbresearch.nl/
7. SCAPE
…but back to digital preservation
•
Example use cases for scientific workflows
•
File format identification/migration/validation
•
Tool evaluation
•
Quality assurance
•
…and many more!
8. SCAPE
Enter Taverna
•
Web services (SOAP, REST)
•
Beanshells (Java scripting, libraries)
•
R (statistics)
•
Local tools (SH/SSH)
•
Excel/CSV
•
Plugins
17. SCAPE
Scalability
•
Taverna workflows on Hadoop
•
Hadoop = Map/Reduce implementation from Yahoo
•
Idea: Execute workflows on a Hadoop cluster
•
Mainly responsible: AIT, UMAN
•
Clusters: IMF, ONB, KB, SB
•
Some problems:
•
Scheduling: Hadoop (1 big jar) or Taverna (many small jars)?
•
Error handling (long running automated workflows)
•
List handling (cross product vs. dot product)
•
“Small files problem” Hadoop sequenceFile
•
OPF Blog: http://www.openplanetsfoundation.org/blogs/2012‐08‐07‐big‐data‐ processing‐chaining‐hadoop‐jobs‐using‐taverna
18. SCAPE
Examples
Workflow for preparing large document collections for data analysis.
Different types of hadoop jobs (Hadoop‐Streaming‐ API, Hadoop Map/Reduce, and Hive) are used (ONB)
Processing time 60.000 books / 24 Mio. pages: 6 h