SlideShare ist ein Scribd-Unternehmen logo
1 von 58
A Service-Oriented Architecture 
for Collaborative Workflow 
Development and 
Experimentation 
eHumanities Seminar 2012 
University of Leipzig 
10-10-2012 
Clemens Neudecker, KB @cneudecker 
Zeki Mustafa Dogan, SUB-DL 
Sven Schlarb, ÖNB @SvenSchlarb 
Juan Garcés, GCDH @juan_garces
Idea 
• Provide web-based versions of tools 
(web services) 
• Package web services, data and 
documentation into ready-to-run 
“components” (encapsulation) 
• Chain the components to create workflows 
via drag-and-drop operation 
• Share and use workflows to re-run 
experiments and to demonstrate results
Background 
• High degree of diversity in research topics, 
but also tools and frameworks being used 
• Technical resources should be easy to 
use, well documented, accessible from 
anywhere 
• Prevent re-inventing of the wheel
Requirements 
• Interoperability = connect different resources 
• Flexibility = easy to deploy and adapt 
• Modularity = allow different combinations of tools 
• Usability = simple to use for non-technical users 
• Re-usability = easy to share with others 
• Scalability = apt for large-scale processing 
• Sustainability = resources simple to preserve 
• Transparency = tools evaluated separately 
• Distributed development and deployment
Interoperability Framework (IIF) 
• Modules: 
- Java Wrapper for command line tools 
- Web Services (incl. format converters) 
- Taverna Workflow Engine 
- Client interfaces 
- Repository connectors
Sources 
https://github.com/impactcentre/interoperability-framework
IIF Command Line Wrapper 
• Java project, builds using Maven2 
• Creates a web service project from 
a given tool description (XML) 
• Web service exposes SOAP & REST 
endpoints and Java API interface 
• Requirements: command line call, 
no direct user interaction
IIF Web Services 
• Web services are described by a WSDL 
• Input/output data structures 
• Data is referenced by URL 
• Annotations 
• Default values
REST
SOAP
IIF Workflows 
• What is a workflow? (Yahoo Pipes, etc.) 
• Different kinds of workflows: for a single 
command, application, chain of processes 
• Main benefit: Encapsulation, Reuse 
• Workflows as “components”: include link 
to WS endpoint, sample input data and 
documentation = ready-to-use resource 
• Web 2.0 workflow registry: myExperiment
Why workflows? 
• “In-silico experimentation” 
• Good structuring of experiment setup: 
– Challenge/Research question 
– Dataset definition 
– Processing with algorithms 
– Evaluation/Provenance 
– Presentation of results 
• All this can be modelled into a workflow
Integration into Taverna 
• Web Services (SOAP and REST) 
• Command line tools (SH and SSH) 
• Beanshells (can import Java libraries) 
• R (statistics) 
• Excel, CSV 
• Additional service types can be added 
through dedicated plug-ins
Taverna flavours 
• Workbench – local GUI client for Linux, 
Windows, OSX 
• Command line tool – run workflows from 
the command line 
• Server – Webapp with REST API and 
Java/Ruby client libs 
• Web-Wf-Designer – Javascript version for 
designing workflows in a browser
Workbench
Webapp
Workflow registry
Client interfaces 
• Web service client: create a simple HTML 
form from a given web service description 
• Taverna client: create a simple HTML form 
from a given Taverna workflow description 
 integration into production and 
presentation environments via iframes
WS-client
T2-client
Repositories 
• Accessible via web service API 
– Fedora Commons 
– WebDAV 
– PRImA
Architecture
Examples 
• Use case 1: OCR (IMPACT) 
• Start: Images (scanned documents) 
• Processing: OCR, NLP, Evaluation 
• Result: Full text, Entities, Sentiments
Examples 
• Use case 2: Preservation (SCAPE) 
• Start: Document collection preparation 
• Processing: Hadoop, Hive 
• Result: Statistics
Reading image metadata 
Jp2PathCreator HadoopStreamingExiftoolRead 
find 
/NAS/Z119585409/00000001.jp2 
/NAS/Z119585409/00000002.jp2 
/NAS/Z119585409/00000003.jp2 
… 
/NAS/Z117655409/00000001.jp2 
/NAS/Z117655409/00000002.jp2 
/NAS/Z117655409/00000003.jp2 
… 
/NAS/Z119585987/00000001.jp2 
/NAS/Z119585987/00000002.jp2 
/NAS/Z119585987/00000003.jp2 
… 
/NAS/Z119584539/00000001.jp2 
/NAS/Z119584539/00000002.jp2 
/NAS/Z119584539/00000003.jp2 
… 
/NAS/Z119599879/00000001.jp2l 
/NAS/Z119589879/00000002.jp2 
/NAS/Z119589879/00000003.jp2 
... 
... 
NAS 
reading files from NAS 
1,4 GB 1,2 GB 
: ~ 5 h + ~ 38 h = ~ 43 h 
60.000 books 
24 Million pages 
Z119585409/00000001 2345 
Z119585409/00000002 2340 
Z119585409/00000003 2543 
… 
Z117655409/00000001 2300 
Z117655409/00000002 2300 
Z117655409/00000003 2345 
… 
Z119585987/00000001 2300 
Z119585987/00000002 2340 
Z119585987/00000003 2432 
… 
Z119584539/00000001 5205 
Z119584539/00000002 2310 
Z119584539/00000003 2134 
… 
Z119599879/00000001 2312 
Z119589879/00000002 2300 
Z119589879/00000003 2300 
...
HtmlPathCreator SequenceFileCreator 
find 
/NAS/Z119585409/00000707.html 
/NAS/Z119585409/00000708.html 
/NAS/Z119585409/00000709.html 
… 
/NAS/Z138682341/00000707.html 
/NAS/Z138682341/00000708.html 
/NAS/Z138682341/00000709.html 
… 
/NAS/Z178791257/00000707.html 
/NAS/Z178791257/00000708.html 
/NAS/Z178791257/00000709.html 
… 
/NAS/Z967985409/00000707.html 
/NAS/Z967985409/00000708.html 
/NAS/Z967985409/00000709.html 
… 
/NAS/Z196545409/00000707.html 
/NAS/Z196545409/00000708.html 
/NAS/Z196545409/00000709.html 
... 
Z119585409/00000707 
Z119585409/00000708 
Z119585409/00000709 
Z119585409/00000710 
Z119585409/00000711 
Z119585409/00000712 
NAS 
reading files from NAS 
1,4 GB 997 GB (uncompressed) 
: ~ 5 h + ~ 24 h = ~ 29 h 
60.000 books 
24 Million pages 
Sequence file creation
Z119585409/00000001 
Z119585409/00000002 
Z119585409/00000003 
Z119585409/00000004 
Z119585409/00000005 
HTML parsing 
HadoopAvBlockWidthMapReduce 
... 
: ~ 6 h 
60.000 books 
24 Million pages 
Z119585409/00000001 2100 
Z119585409/00000001 2200 
Z119585409/00000001 2300 
Z119585409/00000001 2400 
Z119585409/00000002 2100 
Z119585409/00000002 2200 
Z119585409/00000002 2300 
Z119585409/00000002 2400 
Z119585409/00000003 2100 
Z119585409/00000003 2200 
Z119585409/00000003 2300 
Z119585409/00000003 2400 
Z119585409/00000004 2100 
Z119585409/00000004 2200 
Z119585409/00000004 2300 
Z119585409/00000004 2400 
Z119585409/00000005 2100 
Z119585409/00000005 2200 
Z119585409/00000005 2300 
Z119585409/00000005 2400 
Z119585409/00000001 2250 
Z119585409/00000002 2250 
Z119585409/00000003 2250 
Z119585409/00000004 2250 
Z119585409/00000005 2250 
Map Reduce 
SequenceFile Textfile
Analytic Queries 
CREATE TABLE htmlwidth 
(hid STRING, hwidth INT) 
: ~ 6 h 
60.000 books 
24 Million pages 
HiveLoadExifData & HiveLoadHocrData 
htmlwidth 
hid hwidth 
Z119585409/00000001 1870 
Z119585409/00000002 2100 
Z119585409/00000003 2015 
Z119585409/00000004 1350 
Z119585409/00000005 1700 
jp2width 
jid jwidth 
Z119585409/00000001 2250 
Z119585409/00000002 2150 
Z119585409/00000003 2125 
Z119585409/00000004 2125 
Z119585409/00000005 2250 
Z119585409/00000001 1870 
Z119585409/00000002 2100 
Z119585409/00000003 2015 
Z119585409/00000004 1350 
Z119585409/00000005 1700 
Z119585409/00000001 2250 
Z119585409/00000002 2150 
Z119585409/00000003 2125 
Z119585409/00000004 2125 
Z119585409/00000005 2250 
CREATE TABLE jp2width 
(hid STRING, jwidth INT)
Analytic Queries 
HiveSelect 
jp2width htmlwidth 
jid jwidth 
Z119585409/00000001 2250 
Z119585409/00000002 2150 
Z119585409/00000003 2125 
Z119585409/00000004 2125 
Z119585409/00000005 2250 
: ~ 6 h 
60.000 books 
24 Million pages 
hid hwidth 
Z119585409/00000001 1870 
Z119585409/00000002 2100 
Z119585409/00000003 2015 
Z119585409/00000004 1350 
Z119585409/00000005 1700 
jid jwidth hwidth 
Z119585409/000000 
2250 1870 
01 
Z119585409/000000 
02 
2150 2100 
Z119585409/000000 
03 
2125 2015 
Z119585409/000000 
04 
2125 1350 
Z119585409/000000 
05 
2250 1700 
select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid
Examples 
• Use case 3: Curation (GDZ) 
• Start: Get documents from repository 
• Processing: Enrichment 
(OCR, Entities, GeoNames) 
• Result: Online presentation
ROPEN 
(= Resource Oriented Presentation ENvironment)
Scalability 
• Multiple options: 
- Service parallelization 
- Cloud 
- Grid 
- Hadoop
Compatibility 
• Taverna  UIMA 
• Taverna  Galaxy 
• Taverna  Kepler 
• Taverna  Weblicht 
• Taverna  Seasr
But… 
• Multi-layered approach increases 
complexity (debugging, maintenance) 
• Diverse set of endpoints (OS, CPU, etc.) 
• Multiple dependencies 
• Shared responsibilities 
• Authentication & Authorization 
• Error handling / Fail-over / Monitoring
Demo(s)
Discussion 
• Potential/use cases DH? 
• Tools/features to make available? 
• Questions, comments or remarks?
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in ProductionRobert Sanders
 
Open Source Logging and Monitoring Tools
Open Source Logging and Monitoring ToolsOpen Source Logging and Monitoring Tools
Open Source Logging and Monitoring ToolsPhase2
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Kaxil Naik
 
Open Source Logging and Metric Tools
Open Source Logging and Metric ToolsOpen Source Logging and Metric Tools
Open Source Logging and Metric ToolsPhase2
 
Monitoramento com ELK - Elasticsearch - Logstash - Kibana
Monitoramento com ELK - Elasticsearch - Logstash - KibanaMonitoramento com ELK - Elasticsearch - Logstash - Kibana
Monitoramento com ELK - Elasticsearch - Logstash - KibanaWaldemar Neto
 
The ELK Stack - Get to Know Logs
The ELK Stack - Get to Know LogsThe ELK Stack - Get to Know Logs
The ELK Stack - Get to Know LogsGlobalLogic Ukraine
 
Logmanagement with Icinga2 and ELK
Logmanagement with Icinga2 and ELKLogmanagement with Icinga2 and ELK
Logmanagement with Icinga2 and ELKIcinga
 
Elasitcsearch + Logstash + Kibana 日誌監控
Elasitcsearch + Logstash + Kibana 日誌監控Elasitcsearch + Logstash + Kibana 日誌監控
Elasitcsearch + Logstash + Kibana 日誌監控Jui An Huang (黃瑞安)
 
Declarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data modelsDeclarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data modelsMonal Daxini
 
OSMC 2014: Current state of Icinga | Icinga Team
OSMC 2014: Current state of Icinga | Icinga TeamOSMC 2014: Current state of Icinga | Icinga Team
OSMC 2014: Current state of Icinga | Icinga TeamNETWAYS
 
Reactive Streams: Handling Data-Flow the Reactive Way
Reactive Streams: Handling Data-Flow the Reactive WayReactive Streams: Handling Data-Flow the Reactive Way
Reactive Streams: Handling Data-Flow the Reactive WayRoland Kuhn
 
Reactive database access with Slick3
Reactive database access with Slick3Reactive database access with Slick3
Reactive database access with Slick3takezoe
 
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloadsTill Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloadsFlink Forward
 
Going Reactive with Spring 5
Going Reactive with Spring 5Going Reactive with Spring 5
Going Reactive with Spring 5Drazen Nikolic
 
ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)Mathew Beane
 
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...Flink Forward
 

Was ist angesagt? (19)

ELK introduction
ELK introductionELK introduction
ELK introduction
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 
Open Source Logging and Monitoring Tools
Open Source Logging and Monitoring ToolsOpen Source Logging and Monitoring Tools
Open Source Logging and Monitoring Tools
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
 
Open Source Logging and Metric Tools
Open Source Logging and Metric ToolsOpen Source Logging and Metric Tools
Open Source Logging and Metric Tools
 
Monitoramento com ELK - Elasticsearch - Logstash - Kibana
Monitoramento com ELK - Elasticsearch - Logstash - KibanaMonitoramento com ELK - Elasticsearch - Logstash - Kibana
Monitoramento com ELK - Elasticsearch - Logstash - Kibana
 
The ELK Stack - Get to Know Logs
The ELK Stack - Get to Know LogsThe ELK Stack - Get to Know Logs
The ELK Stack - Get to Know Logs
 
Logmanagement with Icinga2 and ELK
Logmanagement with Icinga2 and ELKLogmanagement with Icinga2 and ELK
Logmanagement with Icinga2 and ELK
 
Elasitcsearch + Logstash + Kibana 日誌監控
Elasitcsearch + Logstash + Kibana 日誌監控Elasitcsearch + Logstash + Kibana 日誌監控
Elasitcsearch + Logstash + Kibana 日誌監控
 
Declarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data modelsDeclarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data models
 
OSMC 2014: Current state of Icinga | Icinga Team
OSMC 2014: Current state of Icinga | Icinga TeamOSMC 2014: Current state of Icinga | Icinga Team
OSMC 2014: Current state of Icinga | Icinga Team
 
Reactive Streams: Handling Data-Flow the Reactive Way
Reactive Streams: Handling Data-Flow the Reactive WayReactive Streams: Handling Data-Flow the Reactive Way
Reactive Streams: Handling Data-Flow the Reactive Way
 
Reactive database access with Slick3
Reactive database access with Slick3Reactive database access with Slick3
Reactive database access with Slick3
 
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloadsTill Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads
 
Going Reactive with Spring 5
Going Reactive with Spring 5Going Reactive with Spring 5
Going Reactive with Spring 5
 
Reactive Everywhere
Reactive EverywhereReactive Everywhere
Reactive Everywhere
 
More kibana
More kibanaMore kibana
More kibana
 
ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)
 
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
 

Ähnlich wie Collaborative Workflow Development and Experimentation in the Digital Humanities

Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...SCAPE Project
 
A Practical Guide To End-to-End Tracing In Event Driven Architectures
A Practical Guide To End-to-End Tracing In Event Driven ArchitecturesA Practical Guide To End-to-End Tracing In Event Driven Architectures
A Practical Guide To End-to-End Tracing In Event Driven ArchitecturesHostedbyConfluent
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a ServiceSteven Wu
 
Encode Club workshop slides
Encode Club workshop slidesEncode Club workshop slides
Encode Club workshop slidesVanessa Lošić
 
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드confluent
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
Logstash - CeBIT 2014 - Open Source Forum
Logstash - CeBIT 2014 - Open Source ForumLogstash - CeBIT 2014 - Open Source Forum
Logstash - CeBIT 2014 - Open Source ForumNETWAYS
 
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...Flink Forward
 
Pivotal - Advanced Analytics for Telecommunications
Pivotal - Advanced Analytics for Telecommunications Pivotal - Advanced Analytics for Telecommunications
Pivotal - Advanced Analytics for Telecommunications Hortonworks
 
WebSockets wiith Scala and Play! Framework
WebSockets wiith Scala and Play! FrameworkWebSockets wiith Scala and Play! Framework
WebSockets wiith Scala and Play! FrameworkFabio Tiriticco
 
Top 10 Kubernetes Native Java Quarkus Features
Top 10 Kubernetes Native Java Quarkus FeaturesTop 10 Kubernetes Native Java Quarkus Features
Top 10 Kubernetes Native Java Quarkus Featuresjclingan
 
Service Mesh @Lara Camp Myanmar - 02 Sep,2023
Service Mesh @Lara Camp Myanmar - 02 Sep,2023Service Mesh @Lara Camp Myanmar - 02 Sep,2023
Service Mesh @Lara Camp Myanmar - 02 Sep,2023Hello Cloud
 
Web Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectWeb Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectSaltlux Inc.
 
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogicRakuten Group, Inc.
 
07 (IDNOG02) SDN Research activity in Institut Teknologi Bandung by Affan Bas...
07 (IDNOG02) SDN Research activity in Institut Teknologi Bandung by Affan Bas...07 (IDNOG02) SDN Research activity in Institut Teknologi Bandung by Affan Bas...
07 (IDNOG02) SDN Research activity in Institut Teknologi Bandung by Affan Bas...Indonesia Network Operators Group
 
Play Framework and Activator
Play Framework and ActivatorPlay Framework and Activator
Play Framework and ActivatorKevin Webber
 
Icinga @ OSMC 2014
Icinga @ OSMC 2014Icinga @ OSMC 2014
Icinga @ OSMC 2014Icinga
 
Integrating Taverna Player into Scratchpads
Integrating Taverna Player into ScratchpadsIntegrating Taverna Player into Scratchpads
Integrating Taverna Player into ScratchpadsRobert Haines
 

Ähnlich wie Collaborative Workflow Development and Experimentation in the Digital Humanities (20)

Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
 
A Practical Guide To End-to-End Tracing In Event Driven Architectures
A Practical Guide To End-to-End Tracing In Event Driven ArchitecturesA Practical Guide To End-to-End Tracing In Event Driven Architectures
A Practical Guide To End-to-End Tracing In Event Driven Architectures
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a Service
 
Encode Club workshop slides
Encode Club workshop slidesEncode Club workshop slides
Encode Club workshop slides
 
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Logstash - CeBIT 2014 - Open Source Forum
Logstash - CeBIT 2014 - Open Source ForumLogstash - CeBIT 2014 - Open Source Forum
Logstash - CeBIT 2014 - Open Source Forum
 
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
 
Pivotal - Advanced Analytics for Telecommunications
Pivotal - Advanced Analytics for Telecommunications Pivotal - Advanced Analytics for Telecommunications
Pivotal - Advanced Analytics for Telecommunications
 
AngularJS Basics
AngularJS BasicsAngularJS Basics
AngularJS Basics
 
WebSockets wiith Scala and Play! Framework
WebSockets wiith Scala and Play! FrameworkWebSockets wiith Scala and Play! Framework
WebSockets wiith Scala and Play! Framework
 
Top 10 Kubernetes Native Java Quarkus Features
Top 10 Kubernetes Native Java Quarkus FeaturesTop 10 Kubernetes Native Java Quarkus Features
Top 10 Kubernetes Native Java Quarkus Features
 
Service Mesh @Lara Camp Myanmar - 02 Sep,2023
Service Mesh @Lara Camp Myanmar - 02 Sep,2023Service Mesh @Lara Camp Myanmar - 02 Sep,2023
Service Mesh @Lara Camp Myanmar - 02 Sep,2023
 
Web Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectWeb Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC Project
 
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
 
07 (IDNOG02) SDN Research activity in Institut Teknologi Bandung by Affan Bas...
07 (IDNOG02) SDN Research activity in Institut Teknologi Bandung by Affan Bas...07 (IDNOG02) SDN Research activity in Institut Teknologi Bandung by Affan Bas...
07 (IDNOG02) SDN Research activity in Institut Teknologi Bandung by Affan Bas...
 
Play Framework and Activator
Play Framework and ActivatorPlay Framework and Activator
Play Framework and Activator
 
Icinga @ OSMC 2014
Icinga @ OSMC 2014Icinga @ OSMC 2014
Icinga @ OSMC 2014
 
Exploring Relay land
Exploring Relay landExploring Relay land
Exploring Relay land
 
Integrating Taverna Player into Scratchpads
Integrating Taverna Player into ScratchpadsIntegrating Taverna Player into Scratchpads
Integrating Taverna Player into Scratchpads
 

Mehr von cneudecker

EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Librarycneudecker
 
ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltextecneudecker
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungencneudecker
 
Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?cneudecker
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspaperscneudecker
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...cneudecker
 
AI for digitized cultural heritage
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritagecneudecker
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenzcneudecker
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-Dcneudecker
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspaperscneudecker
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...cneudecker
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...cneudecker
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentscneudecker
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Miningcneudecker
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltextecneudecker
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europecneudecker
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minutencneudecker
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshellcneudecker
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlincneudecker
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspaperscneudecker
 

Mehr von cneudecker (20)

EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
 
ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltexte
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungen
 
Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
 
AI for digitized cultural heritage
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritage
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenz
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-D
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspapers
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documents
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Mining
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltexte
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europe
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minuten
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshell
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlin
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
 

Kürzlich hochgeladen

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Kürzlich hochgeladen (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Collaborative Workflow Development and Experimentation in the Digital Humanities

  • 1. A Service-Oriented Architecture for Collaborative Workflow Development and Experimentation eHumanities Seminar 2012 University of Leipzig 10-10-2012 Clemens Neudecker, KB @cneudecker Zeki Mustafa Dogan, SUB-DL Sven Schlarb, ÖNB @SvenSchlarb Juan Garcés, GCDH @juan_garces
  • 2. Idea • Provide web-based versions of tools (web services) • Package web services, data and documentation into ready-to-run “components” (encapsulation) • Chain the components to create workflows via drag-and-drop operation • Share and use workflows to re-run experiments and to demonstrate results
  • 3. Background • High degree of diversity in research topics, but also tools and frameworks being used • Technical resources should be easy to use, well documented, accessible from anywhere • Prevent re-inventing of the wheel
  • 4. Requirements • Interoperability = connect different resources • Flexibility = easy to deploy and adapt • Modularity = allow different combinations of tools • Usability = simple to use for non-technical users • Re-usability = easy to share with others • Scalability = apt for large-scale processing • Sustainability = resources simple to preserve • Transparency = tools evaluated separately • Distributed development and deployment
  • 5. Interoperability Framework (IIF) • Modules: - Java Wrapper for command line tools - Web Services (incl. format converters) - Taverna Workflow Engine - Client interfaces - Repository connectors
  • 7. IIF Command Line Wrapper • Java project, builds using Maven2 • Creates a web service project from a given tool description (XML) • Web service exposes SOAP & REST endpoints and Java API interface • Requirements: command line call, no direct user interaction
  • 8.
  • 9.
  • 10.
  • 11.
  • 12. IIF Web Services • Web services are described by a WSDL • Input/output data structures • Data is referenced by URL • Annotations • Default values
  • 13. REST
  • 14. SOAP
  • 15. IIF Workflows • What is a workflow? (Yahoo Pipes, etc.) • Different kinds of workflows: for a single command, application, chain of processes • Main benefit: Encapsulation, Reuse • Workflows as “components”: include link to WS endpoint, sample input data and documentation = ready-to-use resource • Web 2.0 workflow registry: myExperiment
  • 16.
  • 17. Why workflows? • “In-silico experimentation” • Good structuring of experiment setup: – Challenge/Research question – Dataset definition – Processing with algorithms – Evaluation/Provenance – Presentation of results • All this can be modelled into a workflow
  • 18. Integration into Taverna • Web Services (SOAP and REST) • Command line tools (SH and SSH) • Beanshells (can import Java libraries) • R (statistics) • Excel, CSV • Additional service types can be added through dedicated plug-ins
  • 19. Taverna flavours • Workbench – local GUI client for Linux, Windows, OSX • Command line tool – run workflows from the command line • Server – Webapp with REST API and Java/Ruby client libs • Web-Wf-Designer – Javascript version for designing workflows in a browser
  • 23. Client interfaces • Web service client: create a simple HTML form from a given web service description • Taverna client: create a simple HTML form from a given Taverna workflow description  integration into production and presentation environments via iframes
  • 26. Repositories • Accessible via web service API – Fedora Commons – WebDAV – PRImA
  • 28. Examples • Use case 1: OCR (IMPACT) • Start: Images (scanned documents) • Processing: OCR, NLP, Evaluation • Result: Full text, Entities, Sentiments
  • 29.
  • 30.
  • 31.
  • 32.
  • 33. Examples • Use case 2: Preservation (SCAPE) • Start: Document collection preparation • Processing: Hadoop, Hive • Result: Statistics
  • 34.
  • 35.
  • 36. Reading image metadata Jp2PathCreator HadoopStreamingExiftoolRead find /NAS/Z119585409/00000001.jp2 /NAS/Z119585409/00000002.jp2 /NAS/Z119585409/00000003.jp2 … /NAS/Z117655409/00000001.jp2 /NAS/Z117655409/00000002.jp2 /NAS/Z117655409/00000003.jp2 … /NAS/Z119585987/00000001.jp2 /NAS/Z119585987/00000002.jp2 /NAS/Z119585987/00000003.jp2 … /NAS/Z119584539/00000001.jp2 /NAS/Z119584539/00000002.jp2 /NAS/Z119584539/00000003.jp2 … /NAS/Z119599879/00000001.jp2l /NAS/Z119589879/00000002.jp2 /NAS/Z119589879/00000003.jp2 ... ... NAS reading files from NAS 1,4 GB 1,2 GB : ~ 5 h + ~ 38 h = ~ 43 h 60.000 books 24 Million pages Z119585409/00000001 2345 Z119585409/00000002 2340 Z119585409/00000003 2543 … Z117655409/00000001 2300 Z117655409/00000002 2300 Z117655409/00000003 2345 … Z119585987/00000001 2300 Z119585987/00000002 2340 Z119585987/00000003 2432 … Z119584539/00000001 5205 Z119584539/00000002 2310 Z119584539/00000003 2134 … Z119599879/00000001 2312 Z119589879/00000002 2300 Z119589879/00000003 2300 ...
  • 37.
  • 38. HtmlPathCreator SequenceFileCreator find /NAS/Z119585409/00000707.html /NAS/Z119585409/00000708.html /NAS/Z119585409/00000709.html … /NAS/Z138682341/00000707.html /NAS/Z138682341/00000708.html /NAS/Z138682341/00000709.html … /NAS/Z178791257/00000707.html /NAS/Z178791257/00000708.html /NAS/Z178791257/00000709.html … /NAS/Z967985409/00000707.html /NAS/Z967985409/00000708.html /NAS/Z967985409/00000709.html … /NAS/Z196545409/00000707.html /NAS/Z196545409/00000708.html /NAS/Z196545409/00000709.html ... Z119585409/00000707 Z119585409/00000708 Z119585409/00000709 Z119585409/00000710 Z119585409/00000711 Z119585409/00000712 NAS reading files from NAS 1,4 GB 997 GB (uncompressed) : ~ 5 h + ~ 24 h = ~ 29 h 60.000 books 24 Million pages Sequence file creation
  • 39.
  • 40. Z119585409/00000001 Z119585409/00000002 Z119585409/00000003 Z119585409/00000004 Z119585409/00000005 HTML parsing HadoopAvBlockWidthMapReduce ... : ~ 6 h 60.000 books 24 Million pages Z119585409/00000001 2100 Z119585409/00000001 2200 Z119585409/00000001 2300 Z119585409/00000001 2400 Z119585409/00000002 2100 Z119585409/00000002 2200 Z119585409/00000002 2300 Z119585409/00000002 2400 Z119585409/00000003 2100 Z119585409/00000003 2200 Z119585409/00000003 2300 Z119585409/00000003 2400 Z119585409/00000004 2100 Z119585409/00000004 2200 Z119585409/00000004 2300 Z119585409/00000004 2400 Z119585409/00000005 2100 Z119585409/00000005 2200 Z119585409/00000005 2300 Z119585409/00000005 2400 Z119585409/00000001 2250 Z119585409/00000002 2250 Z119585409/00000003 2250 Z119585409/00000004 2250 Z119585409/00000005 2250 Map Reduce SequenceFile Textfile
  • 41.
  • 42. Analytic Queries CREATE TABLE htmlwidth (hid STRING, hwidth INT) : ~ 6 h 60.000 books 24 Million pages HiveLoadExifData & HiveLoadHocrData htmlwidth hid hwidth Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700 jp2width jid jwidth Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250 Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700 Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250 CREATE TABLE jp2width (hid STRING, jwidth INT)
  • 43. Analytic Queries HiveSelect jp2width htmlwidth jid jwidth Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250 : ~ 6 h 60.000 books 24 Million pages hid hwidth Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700 jid jwidth hwidth Z119585409/000000 2250 1870 01 Z119585409/000000 02 2150 2100 Z119585409/000000 03 2125 2015 Z119585409/000000 04 2125 1350 Z119585409/000000 05 2250 1700 select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid
  • 44. Examples • Use case 3: Curation (GDZ) • Start: Get documents from repository • Processing: Enrichment (OCR, Entities, GeoNames) • Result: Online presentation
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52. ROPEN (= Resource Oriented Presentation ENvironment)
  • 53. Scalability • Multiple options: - Service parallelization - Cloud - Grid - Hadoop
  • 54. Compatibility • Taverna  UIMA • Taverna  Galaxy • Taverna  Kepler • Taverna  Weblicht • Taverna  Seasr
  • 55. But… • Multi-layered approach increases complexity (debugging, maintenance) • Diverse set of endpoints (OS, CPU, etc.) • Multiple dependencies • Shared responsibilities • Authentication & Authorization • Error handling / Fail-over / Monitoring
  • 57. Discussion • Potential/use cases DH? • Tools/features to make available? • Questions, comments or remarks?