Preservation Workflows with Taverna

1. SCAP E Preservation Workflows with Taverna Clemens Neudecker, Afdeling Onderzoek, Koninklijke Bibliotheek I&O Kennissessie 28 november 2012 SCAPE SCAlable Preservation Environments

2. SCAP E

3. SCAPE Background • What is a scientific workflow? • ““The automation of a business process, in whole or part, during which documents, information or tasks are passed from one participant to another for action, according to a set of procedural rules.” • Background: eSciences, in particular Life Sciences • Two approaches • Data driven (what) • Control driven (how) 3

4. SCAPE Background II • Why use scientific workflows? • Automation of repetitive processes • Chaining of distinct components (interoperability) • “In‐silico experimentation” • Documented experiment configuration • Re‐usable by others (encapsulation)

5. SCAPE Background III • Scientific Workflow Management Systems • Taverna (myGrid, UK) • Kepler (Kepler, USA) • Meandre (SEASR, USA) • and there are many more… • Why Taverna? • Good experience in IMPACT, Open source • European partner (University of Manchester, UK) • Widely used (> 4000 active users) • Shields complexity from end‐user

6. SCAPE Excourse: IMPACT • EU FP7 project on OCR, coordinated by the KB • Prototyping use of scientific workflows in digitization • Some components being further developed in SCAPE • See also http://impact.kbresearch.nl/

7. SCAPE …but back to digital preservation • Example use cases for scientific workflows • File format identification/migration/validation • Tool evaluation • Quality assurance • …and many more!

8. SCAPE Enter Taverna • Web services (SOAP, REST) • Beanshells (Java scripting, libraries) • R (statistics) • Local tools (SH/SSH) • Excel/CSV • Plugins

9. SCAPE Components I • Taverna Workbench

10. SCAPE Components II • Taverna Server

11. SCAPE Components III • SCAPECatalogue

12. SCAPE Components IV • myExperiment

13. SCAPE Examples Validate JPEG2000 with Jpylyzer, convert invalid JP2’s based on TIFF masters and validate derived JP2’s again using Jpylyzer

14. SCAPE Examples Apply Matchbox Book Page Images Duplicate Detection to a list of books from Google Books Project

15. SCAPE Examples Takes a list of ARC files as input and creates a mime type report per ARC and a summary report over all ARCs using TIKA

16. SCAPE Examples Validating WAV File Format using JHOVE2 Web Service

17. SCAPE Scalability • Taverna workflows on Hadoop • Hadoop = Map/Reduce implementation from Yahoo • Idea: Execute workflows on a Hadoop cluster • Mainly responsible: AIT, UMAN • Clusters: IMF, ONB, KB, SB • Some problems: • Scheduling: Hadoop (1 big jar) or Taverna (many small jars)? • Error handling (long running automated workflows) • List handling (cross product vs. dot product) • “Small files problem” Hadoop sequenceFile • OPF Blog: http://www.openplanetsfoundation.org/blogs/2012‐08‐07‐big‐data‐ processing‐chaining‐hadoop‐jobs‐using‐taverna

18. SCAPE Examples Workflow for preparing large document collections for data analysis. Different types of hadoop jobs (Hadoop‐Streaming‐ API, Hadoop Map/Reduce, and Hive) are used (ONB) Processing time 60.000 books / 24 Mio. pages: 6 h

19. SCAPE Demo(s)

20. SCAPE Want some more? • SCAPE source code on github github.com/openplanets/scape • SCAPE for Developers SCAPE Developer's Guide • SCAPE Platform SCAPE Preservation Execution Platform • SCAPE workshops, hackathons: check with us! http://www.scape‐project.eu/events

Preservation Workflows with Taverna

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Preservation Workflows with Taverna

Ähnlich wie Preservation Workflows with Taverna (20)

Mehr von cneudecker

Mehr von cneudecker (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Preservation Workflows with Taverna