SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
SCAP 
E 
Preservation Workflows with Taverna 
Clemens Neudecker, Afdeling Onderzoek, Koninklijke Bibliotheek 
I&O Kennissessie 28 november 2012 
SCAPE 
SCAlable Preservation Environments
SCAP 
E
SCAPE 
Background 
• 
What is a scientific workflow? 
• 
““The automation of a business process, in whole or part, during which documents, information or tasks are passed from one participant to another for action, according to a set of procedural rules.” 
• 
Background: eSciences, in particular Life Sciences 
• 
Two approaches 
• 
Data driven (what) 
• 
Control driven (how) 
3
SCAPE 
Background II 
• 
Why use scientific workflows? 
• 
Automation of repetitive processes 
• 
Chaining of distinct components (interoperability) 
• 
“In‐silico experimentation” 
• 
Documented experiment configuration 
• 
Re‐usable by others (encapsulation)
SCAPE 
Background III 
• 
Scientific Workflow Management Systems 
• 
Taverna (myGrid, UK) 
• 
Kepler (Kepler, USA) 
• 
Meandre (SEASR, USA) 
• 
and there are many more… 
• 
Why Taverna? 
• 
Good experience in IMPACT, Open source 
• 
European partner (University of Manchester, UK) 
• 
Widely used (> 4000 active users) 
• 
Shields complexity from end‐user
SCAPE 
Excourse: IMPACT 
• 
EU FP7 project on OCR, coordinated by the KB 
• 
Prototyping use of scientific workflows in digitization 
• 
Some components being further developed in SCAPE 
• 
See also http://impact.kbresearch.nl/
SCAPE 
…but back to digital preservation 
• 
Example use cases for scientific workflows 
• 
File format identification/migration/validation 
• 
Tool evaluation 
• 
Quality assurance 
• 
…and many more!
SCAPE 
Enter Taverna 
• 
Web services (SOAP, REST) 
• 
Beanshells (Java scripting, libraries) 
• 
R (statistics) 
• 
Local tools (SH/SSH) 
• 
Excel/CSV 
• 
Plugins
SCAPE 
Components I 
• 
Taverna Workbench
SCAPE 
Components II 
• 
Taverna Server
SCAPE 
Components III 
• 
SCAPECatalogue
SCAPE 
Components IV 
• 
myExperiment
SCAPE 
Examples 
Validate JPEG2000 with Jpylyzer, convert invalid JP2’s based on TIFF masters and validate derived JP2’s again using Jpylyzer
SCAPE 
Examples 
Apply Matchbox Book Page Images Duplicate Detection to a list of books from Google Books Project
SCAPE 
Examples 
Takes a list of ARC files as input and creates a mime type report per ARC and a summary report over all ARCs using TIKA
SCAPE 
Examples 
Validating WAV File Format using JHOVE2 Web Service
SCAPE 
Scalability 
• 
Taverna workflows on Hadoop 
• 
Hadoop = Map/Reduce implementation from Yahoo 
• 
Idea: Execute workflows on a Hadoop cluster 
• 
Mainly responsible: AIT, UMAN 
• 
Clusters: IMF, ONB, KB, SB 
• 
Some problems: 
• 
Scheduling: Hadoop (1 big jar) or Taverna (many small jars)? 
• 
Error handling (long running automated workflows) 
• 
List handling (cross product vs. dot product) 
• 
“Small files problem” Hadoop sequenceFile 
• 
OPF Blog: http://www.openplanetsfoundation.org/blogs/2012‐08‐07‐big‐data‐ processing‐chaining‐hadoop‐jobs‐using‐taverna
SCAPE 
Examples 
Workflow for preparing large document collections for data analysis. 
Different types of hadoop jobs (Hadoop‐Streaming‐ API, Hadoop Map/Reduce, and Hive) are used (ONB) 
Processing time 60.000 books / 24 Mio. pages: 6 h
SCAPE 
Demo(s)
SCAPE 
Want some more? 
• 
SCAPE source code on github 
github.com/openplanets/scape 
• 
SCAPE for Developers 
SCAPE Developer's Guide 
• 
SCAPE Platform 
SCAPE Preservation Execution Platform 
• 
SCAPE workshops, hackathons: check with us! 
http://www.scape‐project.eu/events

Weitere ähnliche Inhalte

Was ist angesagt?

Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, RocanaSolr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Lucidworks
 
Presto in my_use_case2
Presto in my_use_case2Presto in my_use_case2
Presto in my_use_case2
wyukawa
 

Was ist angesagt? (20)

Logs aggregation and analysis
Logs aggregation and analysisLogs aggregation and analysis
Logs aggregation and analysis
 
ResourceSync tutorial OAI8
ResourceSync tutorial OAI8ResourceSync tutorial OAI8
ResourceSync tutorial OAI8
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutes
 
Intro to Python for C# Developers
Intro to Python for C# DevelopersIntro to Python for C# Developers
Intro to Python for C# Developers
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit Pal
 
20160512 apache-spark-for-everyone
20160512 apache-spark-for-everyone20160512 apache-spark-for-everyone
20160512 apache-spark-for-everyone
 
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, RocanaSolr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
 
Introducing prefLabel.org
Introducing prefLabel.orgIntroducing prefLabel.org
Introducing prefLabel.org
 
Computational workflows for omics analyses at the IARC
Computational workflows for omics analyses at the IARCComputational workflows for omics analyses at the IARC
Computational workflows for omics analyses at the IARC
 
Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and Kibana
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry Larko
 
DSD-INT 2015 - Data management with open earth datalabs - Gerben de Boer, van...
DSD-INT 2015 - Data management with open earth datalabs - Gerben de Boer, van...DSD-INT 2015 - Data management with open earth datalabs - Gerben de Boer, van...
DSD-INT 2015 - Data management with open earth datalabs - Gerben de Boer, van...
 
Patterns of Streaming Applications
Patterns of Streaming ApplicationsPatterns of Streaming Applications
Patterns of Streaming Applications
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenches
 
Elk
Elk Elk
Elk
 
Log management with ELK
Log management with ELKLog management with ELK
Log management with ELK
 
Presto in my_use_case2
Presto in my_use_case2Presto in my_use_case2
Presto in my_use_case2
 

Ähnlich wie Preservation Workflows with Taverna

Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
OpenBlend society
 
Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...
Ola Spjuth
 

Ähnlich wie Preservation Workflows with Taverna (20)

Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
Taverna and myExperiment. SCAPE presentation at a Hack-a-thon
Taverna and myExperiment. SCAPE presentation at a Hack-a-thonTaverna and myExperiment. SCAPE presentation at a Hack-a-thon
Taverna and myExperiment. SCAPE presentation at a Hack-a-thon
 
Using JPA applications in the era of NoSQL: Introducing Hibernate OGM
Using JPA applications in the era of NoSQL: Introducing Hibernate OGMUsing JPA applications in the era of NoSQL: Introducing Hibernate OGM
Using JPA applications in the era of NoSQL: Introducing Hibernate OGM
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
 
Introduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqIntroduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-Seq
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
 
SCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation Environments
 
Bootcamp Data Science using Cloudera
Bootcamp Data Science using ClouderaBootcamp Data Science using Cloudera
Bootcamp Data Science using Cloudera
 
Advances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow Environments
 
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017
 
Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
A Semantic-Based Approach to Attain Reproducibility of Computational Environm...
A Semantic-Based Approach to Attain Reproducibility of Computational Environm...A Semantic-Based Approach to Attain Reproducibility of Computational Environm...
A Semantic-Based Approach to Attain Reproducibility of Computational Environm...
 
Getting started with Riak in the Cloud
Getting started with Riak in the CloudGetting started with Riak in the Cloud
Getting started with Riak in the Cloud
 
The RSC chemical validation and standardization platform, a potential path to...
The RSC chemical validation and standardization platform, a potential path to...The RSC chemical validation and standardization platform, a potential path to...
The RSC chemical validation and standardization platform, a potential path to...
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4j
 

Mehr von cneudecker

OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
cneudecker
 

Mehr von cneudecker (20)

EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
 
ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltexte
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungen
 
Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
 
AI for digitized cultural heritage
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritage
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenz
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-D
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspapers
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documents
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Mining
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltexte
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europe
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minuten
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshell
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlin
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Preservation Workflows with Taverna

  • 1. SCAP E Preservation Workflows with Taverna Clemens Neudecker, Afdeling Onderzoek, Koninklijke Bibliotheek I&O Kennissessie 28 november 2012 SCAPE SCAlable Preservation Environments
  • 3. SCAPE Background • What is a scientific workflow? • ““The automation of a business process, in whole or part, during which documents, information or tasks are passed from one participant to another for action, according to a set of procedural rules.” • Background: eSciences, in particular Life Sciences • Two approaches • Data driven (what) • Control driven (how) 3
  • 4. SCAPE Background II • Why use scientific workflows? • Automation of repetitive processes • Chaining of distinct components (interoperability) • “In‐silico experimentation” • Documented experiment configuration • Re‐usable by others (encapsulation)
  • 5. SCAPE Background III • Scientific Workflow Management Systems • Taverna (myGrid, UK) • Kepler (Kepler, USA) • Meandre (SEASR, USA) • and there are many more… • Why Taverna? • Good experience in IMPACT, Open source • European partner (University of Manchester, UK) • Widely used (> 4000 active users) • Shields complexity from end‐user
  • 6. SCAPE Excourse: IMPACT • EU FP7 project on OCR, coordinated by the KB • Prototyping use of scientific workflows in digitization • Some components being further developed in SCAPE • See also http://impact.kbresearch.nl/
  • 7. SCAPE …but back to digital preservation • Example use cases for scientific workflows • File format identification/migration/validation • Tool evaluation • Quality assurance • …and many more!
  • 8. SCAPE Enter Taverna • Web services (SOAP, REST) • Beanshells (Java scripting, libraries) • R (statistics) • Local tools (SH/SSH) • Excel/CSV • Plugins
  • 9. SCAPE Components I • Taverna Workbench
  • 10. SCAPE Components II • Taverna Server
  • 11. SCAPE Components III • SCAPECatalogue
  • 12. SCAPE Components IV • myExperiment
  • 13. SCAPE Examples Validate JPEG2000 with Jpylyzer, convert invalid JP2’s based on TIFF masters and validate derived JP2’s again using Jpylyzer
  • 14. SCAPE Examples Apply Matchbox Book Page Images Duplicate Detection to a list of books from Google Books Project
  • 15. SCAPE Examples Takes a list of ARC files as input and creates a mime type report per ARC and a summary report over all ARCs using TIKA
  • 16. SCAPE Examples Validating WAV File Format using JHOVE2 Web Service
  • 17. SCAPE Scalability • Taverna workflows on Hadoop • Hadoop = Map/Reduce implementation from Yahoo • Idea: Execute workflows on a Hadoop cluster • Mainly responsible: AIT, UMAN • Clusters: IMF, ONB, KB, SB • Some problems: • Scheduling: Hadoop (1 big jar) or Taverna (many small jars)? • Error handling (long running automated workflows) • List handling (cross product vs. dot product) • “Small files problem” Hadoop sequenceFile • OPF Blog: http://www.openplanetsfoundation.org/blogs/2012‐08‐07‐big‐data‐ processing‐chaining‐hadoop‐jobs‐using‐taverna
  • 18. SCAPE Examples Workflow for preparing large document collections for data analysis. Different types of hadoop jobs (Hadoop‐Streaming‐ API, Hadoop Map/Reduce, and Hive) are used (ONB) Processing time 60.000 books / 24 Mio. pages: 6 h
  • 20. SCAPE Want some more? • SCAPE source code on github github.com/openplanets/scape • SCAPE for Developers SCAPE Developer's Guide • SCAPE Platform SCAPE Preservation Execution Platform • SCAPE workshops, hackathons: check with us! http://www.scape‐project.eu/events