SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Downloaden Sie, um offline zu lesen
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
An Experimental Workflow Development 
Platform for Historical Document 
Digitisation and Analysis 
Clemens Neudecker, KB National Library of the Netherlands 
International workshop on Historical Document Imaging and Processing, Beijing, 17 September 2011
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
2 
Background 
 IMPACT – Improving Access to Text (2008 – 2011) 
Large-scale integrating research project, funded by the EC 
Main objectives: 
- Innovate OCR technology 
- Capacity building in mass-digitisation 
 From a technical perspective: 
> 20 software toolkits for solving specific issues 
Prototyping new algorithms 
“One ring to rule them all…” 
 IMPACT Interoperability Framework (IIF)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
3 
Main requirements 
Behavioural: 
 Minimize integration effort 
 Minimize deployment effort 
 Maximize usability 
 Maximize scalability 
Functional: 
 Modular 
 Transparent 
 Expandable 
 Open source 
 Platform independent
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
4 
Architecture 
 IMPACT Interoperability Framework: Technologies 
- Java 6 
- Generic Web Service Wrapper 
- Apache Ant/Maven 
- Apache Tomcat/httpd 
- Apache Axis2 
- Apache Synapse 
- Taverna Workflow Engine 
 IMPACT Interoperability Framework: Dataset 
- more than 500.000 images from digital libraries 
- more than 25.000 ground truth transcriptions
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
5 
So how does it work? 
1. Digitisation/OCR challenges registered and tagged in database 
2. Database contains 99,99% correct result: “ground truth” 
3. Researcher develops new method to tackle a problem 
4. Research prototype is wrapped to a web service 
5. Web service is integrated as a workflow module 
6. Workflow module can be evaluated, combined, etc.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
6 
Framework integration 
 Easy to use generic command line wrapper (open source)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
7 
Workflow development 
 OCR workflow = 
data pipeline 
 Building blocks = 
processing steps (nodes) 
 Integration = 
interaction between nodes 
(mashup) 
 Collaboration with
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
8 
Workflow management 
 Web 2.0 style registry: myExperiment 
 Local client: Taverna Workbench 
 Web client: project website
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
9 
Compute cluster 
 Enterprise Service Bus 
receives requests from 
users and distributes 
the load to the available 
worker nodes 
 Main effect: 
Process parallelization, 
Load distribution, 
Fail over
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
10 
Dataset 
 Access to a representative and annotated dataset of significant size, 
with metadata, ground truth and search facilities
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
11 
Evaluation features 
 Text based comparison of result with ground truth, 
using Levenshtein distance method 
 Layout based comparison of result with ground truth, 
using the Page Analysis And Ground Truth Elements Framework 
 Example:
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
12 
Community 
 Web2.0 style workflow registry 
 Community of experts 
 Sharing of resources 
 Knowledge exchange 
 A central meeting point 
for users and researchers
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
13 
Summary 
Benefits: 
- Availability of resources (images, ground truth and tools) 
to the international research community 
- A common baseline for transparent evaluation and comparison 
- Sharing of results and know-how 
- Enable new research through scalable computing 
- Consolidation of support and maintenance 
Thank you! 
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

agINFRA 5BOAC Presentation
agINFRA 5BOAC PresentationagINFRA 5BOAC Presentation
agINFRA 5BOAC PresentationBenjamin Cave
 
Realising the value of Europe's newspaper heritage
Realising the value of Europe's newspaper heritage Realising the value of Europe's newspaper heritage
Realising the value of Europe's newspaper heritage Europeana Newspapers
 
Up2U Worskshop at the TNC18 conference
Up2U Worskshop at the TNC18 conferenceUp2U Worskshop at the TNC18 conference
Up2U Worskshop at the TNC18 conferenceUp2Universe
 
17. kb.nederlab.20150324
17. kb.nederlab.2015032417. kb.nederlab.20150324
17. kb.nederlab.20150324ingeangevaare
 
European Open Science Cloud: History and Status
European Open Science Cloud: History and StatusEuropean Open Science Cloud: History and Status
European Open Science Cloud: History and StatusMatthew Dovey
 
Gergely Sipos (EGI): Exploiting scientific data in the international context ...
Gergely Sipos (EGI): Exploiting scientific data in the international context ...Gergely Sipos (EGI): Exploiting scientific data in the international context ...
Gergely Sipos (EGI): Exploiting scientific data in the international context ...Gergely Sipos
 
The value of EOSC from a user perspective: Key themes and actions from Day 1
The value of EOSC from a user perspective: Key themes and actions from Day 1The value of EOSC from a user perspective: Key themes and actions from Day 1
The value of EOSC from a user perspective: Key themes and actions from Day 1EOSCpilot .eu
 
RNP Cloud Infrastructure model, services and challenges
RNP Cloud Infrastructure model, services and challengesRNP Cloud Infrastructure model, services and challenges
RNP Cloud Infrastructure model, services and challengesEUBrasilCloudFORUM .
 
Up2U Workshop at TNC 2018-introduction
Up2U Workshop at TNC 2018-introductionUp2U Workshop at TNC 2018-introduction
Up2U Workshop at TNC 2018-introductionUp2Universe
 
FUTEBOL - Federated Union of Telecommunications Research Facilities for an EU...
FUTEBOL - Federated Union of Telecommunications Research Facilities for an EU...FUTEBOL - Federated Union of Telecommunications Research Facilities for an EU...
FUTEBOL - Federated Union of Telecommunications Research Facilities for an EU...EUBrasilCloudFORUM .
 
Benefits of collaborative EU digitization projects
Benefits of collaborative EU digitization projectsBenefits of collaborative EU digitization projects
Benefits of collaborative EU digitization projectsTrilce Navarrete
 
SEMANCO poster at ESWC 2014
SEMANCO poster at ESWC 2014SEMANCO poster at ESWC 2014
SEMANCO poster at ESWC 2014Álvaro Sicilia
 
16,40 16,55 h. open aire eblida-naple conference
16,40 16,55 h. open aire eblida-naple conference16,40 16,55 h. open aire eblida-naple conference
16,40 16,55 h. open aire eblida-naple conferenceFESABID
 
Progetto EOSC-Pillar (Fulvio Galeazzi, GARR)
Progetto EOSC-Pillar (Fulvio Galeazzi, GARR)Progetto EOSC-Pillar (Fulvio Galeazzi, GARR)
Progetto EOSC-Pillar (Fulvio Galeazzi, GARR)Data Driven Innovation
 
ENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project OverviewENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project OverviewEuropeana Newspapers
 

Was ist angesagt? (20)

Metadata
MetadataMetadata
Metadata
 
agINFRA 5BOAC Presentation
agINFRA 5BOAC PresentationagINFRA 5BOAC Presentation
agINFRA 5BOAC Presentation
 
Europeana Newspapers Project
Europeana Newspapers ProjectEuropeana Newspapers Project
Europeana Newspapers Project
 
Realising the value of Europe's newspaper heritage
Realising the value of Europe's newspaper heritage Realising the value of Europe's newspaper heritage
Realising the value of Europe's newspaper heritage
 
Up2U Worskshop at the TNC18 conference
Up2U Worskshop at the TNC18 conferenceUp2U Worskshop at the TNC18 conference
Up2U Worskshop at the TNC18 conference
 
17. kb.nederlab.20150324
17. kb.nederlab.2015032417. kb.nederlab.20150324
17. kb.nederlab.20150324
 
European Open Science Cloud: History and Status
European Open Science Cloud: History and StatusEuropean Open Science Cloud: History and Status
European Open Science Cloud: History and Status
 
Gergely Sipos (EGI): Exploiting scientific data in the international context ...
Gergely Sipos (EGI): Exploiting scientific data in the international context ...Gergely Sipos (EGI): Exploiting scientific data in the international context ...
Gergely Sipos (EGI): Exploiting scientific data in the international context ...
 
The value of EOSC from a user perspective: Key themes and actions from Day 1
The value of EOSC from a user perspective: Key themes and actions from Day 1The value of EOSC from a user perspective: Key themes and actions from Day 1
The value of EOSC from a user perspective: Key themes and actions from Day 1
 
RNP Cloud Infrastructure model, services and challenges
RNP Cloud Infrastructure model, services and challengesRNP Cloud Infrastructure model, services and challenges
RNP Cloud Infrastructure model, services and challenges
 
Up2U Workshop at TNC 2018-introduction
Up2U Workshop at TNC 2018-introductionUp2U Workshop at TNC 2018-introduction
Up2U Workshop at TNC 2018-introduction
 
FUTEBOL - Federated Union of Telecommunications Research Facilities for an EU...
FUTEBOL - Federated Union of Telecommunications Research Facilities for an EU...FUTEBOL - Federated Union of Telecommunications Research Facilities for an EU...
FUTEBOL - Federated Union of Telecommunications Research Facilities for an EU...
 
Benefits of collaborative EU digitization projects
Benefits of collaborative EU digitization projectsBenefits of collaborative EU digitization projects
Benefits of collaborative EU digitization projects
 
SEMANCO poster at ESWC 2014
SEMANCO poster at ESWC 2014SEMANCO poster at ESWC 2014
SEMANCO poster at ESWC 2014
 
16,40 16,55 h. open aire eblida-naple conference
16,40 16,55 h. open aire eblida-naple conference16,40 16,55 h. open aire eblida-naple conference
16,40 16,55 h. open aire eblida-naple conference
 
ExPaNDS
ExPaNDSExPaNDS
ExPaNDS
 
Opengovinteligence Leaflet
Opengovinteligence Leaflet  Opengovinteligence Leaflet
Opengovinteligence Leaflet
 
Progetto EOSC-Pillar (Fulvio Galeazzi, GARR)
Progetto EOSC-Pillar (Fulvio Galeazzi, GARR)Progetto EOSC-Pillar (Fulvio Galeazzi, GARR)
Progetto EOSC-Pillar (Fulvio Galeazzi, GARR)
 
EOSC-Pillar
EOSC-PillarEOSC-Pillar
EOSC-Pillar
 
ENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project OverviewENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project Overview
 

Andere mochten auch

INSERTAR ELEMENTOS DE FORMULARIO
INSERTAR ELEMENTOS DE FORMULARIOINSERTAR ELEMENTOS DE FORMULARIO
INSERTAR ELEMENTOS DE FORMULARIOinformatica97
 
типы химических связей
типы химических связейтипы химических связей
типы химических связейOlga Pishchik
 
0. ex physi
0. ex physi0. ex physi
0. ex physidelwong
 
Digitalisierte Zeitungen und Digital Humanities - Probleme und Chancen
Digitalisierte Zeitungen und Digital Humanities - Probleme und ChancenDigitalisierte Zeitungen und Digital Humanities - Probleme und Chancen
Digitalisierte Zeitungen und Digital Humanities - Probleme und Chancencneudecker
 
Search Technologies for Digital Libraries
Search Technologies for Digital LibrariesSearch Technologies for Digital Libraries
Search Technologies for Digital Librariescneudecker
 
Formularios access 2010
Formularios access 2010Formularios access 2010
Formularios access 2010informatica97
 
VideoBoard Digital Signage
VideoBoard Digital SignageVideoBoard Digital Signage
VideoBoard Digital SignageNino Torres
 
User experience presentation
User experience presentationUser experience presentation
User experience presentationbluebottlebiz
 
Team+2 energyt+storage+system final_2013 spring
Team+2 energyt+storage+system final_2013 springTeam+2 energyt+storage+system final_2013 spring
Team+2 energyt+storage+system final_2013 springJaeho Jung
 
Mamiferos por karen burbano
Mamiferos por karen burbanoMamiferos por karen burbano
Mamiferos por karen burbanoKarEn Bl
 
MAKALAH MEKASNIME DAN KONFLIK DALAM APBN
MAKALAH MEKASNIME DAN KONFLIK DALAM APBNMAKALAH MEKASNIME DAN KONFLIK DALAM APBN
MAKALAH MEKASNIME DAN KONFLIK DALAM APBNSolala Halawa
 
MAKALAH TEORI EKOLOGI ADMINISTRASI
MAKALAH TEORI EKOLOGI ADMINISTRASIMAKALAH TEORI EKOLOGI ADMINISTRASI
MAKALAH TEORI EKOLOGI ADMINISTRASISolala Halawa
 

Andere mochten auch (16)

INSERTAR ELEMENTOS DE FORMULARIO
INSERTAR ELEMENTOS DE FORMULARIOINSERTAR ELEMENTOS DE FORMULARIO
INSERTAR ELEMENTOS DE FORMULARIO
 
Teaching powerpoint
Teaching powerpointTeaching powerpoint
Teaching powerpoint
 
типы химических связей
типы химических связейтипы химических связей
типы химических связей
 
Deportes.
Deportes.Deportes.
Deportes.
 
0. ex physi
0. ex physi0. ex physi
0. ex physi
 
Digitalisierte Zeitungen und Digital Humanities - Probleme und Chancen
Digitalisierte Zeitungen und Digital Humanities - Probleme und ChancenDigitalisierte Zeitungen und Digital Humanities - Probleme und Chancen
Digitalisierte Zeitungen und Digital Humanities - Probleme und Chancen
 
Search Technologies for Digital Libraries
Search Technologies for Digital LibrariesSearch Technologies for Digital Libraries
Search Technologies for Digital Libraries
 
Formularios access 2010
Formularios access 2010Formularios access 2010
Formularios access 2010
 
VideoBoard Digital Signage
VideoBoard Digital SignageVideoBoard Digital Signage
VideoBoard Digital Signage
 
User experience presentation
User experience presentationUser experience presentation
User experience presentation
 
User experience
User experienceUser experience
User experience
 
Team+2 energyt+storage+system final_2013 spring
Team+2 energyt+storage+system final_2013 springTeam+2 energyt+storage+system final_2013 spring
Team+2 energyt+storage+system final_2013 spring
 
Mamiferos por karen burbano
Mamiferos por karen burbanoMamiferos por karen burbano
Mamiferos por karen burbano
 
MAKALAH MEKASNIME DAN KONFLIK DALAM APBN
MAKALAH MEKASNIME DAN KONFLIK DALAM APBNMAKALAH MEKASNIME DAN KONFLIK DALAM APBN
MAKALAH MEKASNIME DAN KONFLIK DALAM APBN
 
MAKALAH TEORI EKOLOGI ADMINISTRASI
MAKALAH TEORI EKOLOGI ADMINISTRASIMAKALAH TEORI EKOLOGI ADMINISTRASI
MAKALAH TEORI EKOLOGI ADMINISTRASI
 
Construction claims
Construction claimsConstruction claims
Construction claims
 

Ähnlich wie An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis

IMPACT HPC Cloud Day
IMPACT HPC Cloud DayIMPACT HPC Cloud Day
IMPACT HPC Cloud Daycneudecker
 
Centre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerCentre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerBiblioteca Nacional de España
 
Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)cneudecker
 
IMPACT Interoperability and Evaluation Framework. Clemens Neudecker
IMPACT Interoperability and Evaluation Framework. Clemens NeudeckerIMPACT Interoperability and Evaluation Framework. Clemens Neudecker
IMPACT Interoperability and Evaluation Framework. Clemens NeudeckerBiblioteca Nacional de España
 
The Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiativesThe Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiativesMichael Day
 
IMPACT: Building a Centre of Competence for Digitisation
IMPACT: Building a Centre of Competence for DigitisationIMPACT: Building a Centre of Competence for Digitisation
IMPACT: Building a Centre of Competence for DigitisationIMPACT Centre of Competence
 
ECLAP White paper, social network for Cultural Heritage on Peforming arts
ECLAP White paper, social network for Cultural Heritage on Peforming artsECLAP White paper, social network for Cultural Heritage on Peforming arts
ECLAP White paper, social network for Cultural Heritage on Peforming artsPaolo Nesi
 
Science Demonstrator Session: Social and Earth Sciences
Science Demonstrator Session: Social and Earth SciencesScience Demonstrator Session: Social and Earth Sciences
Science Demonstrator Session: Social and Earth SciencesEOSCpilot .eu
 
Ecloud copenhagen-130625074823-phpapp01
Ecloud copenhagen-130625074823-phpapp01Ecloud copenhagen-130625074823-phpapp01
Ecloud copenhagen-130625074823-phpapp01The European Library
 
Europeana Cloud Work Package 1: Assessing Researchers' Needs in the Cloud
Europeana Cloud Work Package 1: Assessing Researchers' Needs in the CloudEuropeana Cloud Work Package 1: Assessing Researchers' Needs in the Cloud
Europeana Cloud Work Package 1: Assessing Researchers' Needs in the CloudTU Delft, Netherlands
 
Europeana Newspapers Amsterdam workshop introduction
Europeana Newspapers Amsterdam workshop introductionEuropeana Newspapers Amsterdam workshop introduction
Europeana Newspapers Amsterdam workshop introductionEuropeana Newspapers
 
IMPACT Final Conference - Centre of Competence introduction - Hildelies Balk
IMPACT Final Conference - Centre of Competence introduction - Hildelies BalkIMPACT Final Conference - Centre of Competence introduction - Hildelies Balk
IMPACT Final Conference - Centre of Competence introduction - Hildelies BalkIMPACT Centre of Competence
 
Europeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop introEuropeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop introEuropeana Newspapers
 
OCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACTOCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACTcneudecker
 
Science Demonstrator Session: Physics and Astrophysics
Science Demonstrator Session: Physics and AstrophysicsScience Demonstrator Session: Physics and Astrophysics
Science Demonstrator Session: Physics and AstrophysicsEOSCpilot .eu
 
Overview of the Europeana Newspapers Project
Overview of the Europeana Newspapers ProjectOverview of the Europeana Newspapers Project
Overview of the Europeana Newspapers ProjectEuropeana Newspapers
 
PaNOSC: EOSC for Photon and Neutron Facilities Users
PaNOSC: EOSC for Photon and Neutron Facilities Users PaNOSC: EOSC for Photon and Neutron Facilities Users
PaNOSC: EOSC for Photon and Neutron Facilities Users EOSC-hub project
 
eROSA Policy WS2: European Open Science Cloud (EOSC) - The Perspective of e-I...
eROSA Policy WS2: European Open Science Cloud (EOSC) - The Perspective of e-I...eROSA Policy WS2: European Open Science Cloud (EOSC) - The Perspective of e-I...
eROSA Policy WS2: European Open Science Cloud (EOSC) - The Perspective of e-I...e-ROSA
 

Ähnlich wie An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis (20)

IMPACT HPC Cloud Day
IMPACT HPC Cloud DayIMPACT HPC Cloud Day
IMPACT HPC Cloud Day
 
Centre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerCentre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens Neudecker
 
Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)
 
IMPACT Interoperability and Evaluation Framework. Clemens Neudecker
IMPACT Interoperability and Evaluation Framework. Clemens NeudeckerIMPACT Interoperability and Evaluation Framework. Clemens Neudecker
IMPACT Interoperability and Evaluation Framework. Clemens Neudecker
 
The Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiativesThe Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiatives
 
IMPACT: Building a Centre of Competence for Digitisation
IMPACT: Building a Centre of Competence for DigitisationIMPACT: Building a Centre of Competence for Digitisation
IMPACT: Building a Centre of Competence for Digitisation
 
ECLAP White paper, social network for Cultural Heritage on Peforming arts
ECLAP White paper, social network for Cultural Heritage on Peforming artsECLAP White paper, social network for Cultural Heritage on Peforming arts
ECLAP White paper, social network for Cultural Heritage on Peforming arts
 
Bne impact co_c
Bne impact co_cBne impact co_c
Bne impact co_c
 
Science Demonstrator Session: Social and Earth Sciences
Science Demonstrator Session: Social and Earth SciencesScience Demonstrator Session: Social and Earth Sciences
Science Demonstrator Session: Social and Earth Sciences
 
Ecloud copenhagen-130625074823-phpapp01
Ecloud copenhagen-130625074823-phpapp01Ecloud copenhagen-130625074823-phpapp01
Ecloud copenhagen-130625074823-phpapp01
 
Europeana Cloud Work Package 1: Assessing Researchers' Needs in the Cloud
Europeana Cloud Work Package 1: Assessing Researchers' Needs in the CloudEuropeana Cloud Work Package 1: Assessing Researchers' Needs in the Cloud
Europeana Cloud Work Package 1: Assessing Researchers' Needs in the Cloud
 
Europeana Newspapers Amsterdam workshop introduction
Europeana Newspapers Amsterdam workshop introductionEuropeana Newspapers Amsterdam workshop introduction
Europeana Newspapers Amsterdam workshop introduction
 
IMPACT Final Conference - Muehlberger - FEP
IMPACT Final Conference - Muehlberger - FEPIMPACT Final Conference - Muehlberger - FEP
IMPACT Final Conference - Muehlberger - FEP
 
IMPACT Final Conference - Centre of Competence introduction - Hildelies Balk
IMPACT Final Conference - Centre of Competence introduction - Hildelies BalkIMPACT Final Conference - Centre of Competence introduction - Hildelies Balk
IMPACT Final Conference - Centre of Competence introduction - Hildelies Balk
 
Europeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop introEuropeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop intro
 
OCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACTOCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACT
 
Science Demonstrator Session: Physics and Astrophysics
Science Demonstrator Session: Physics and AstrophysicsScience Demonstrator Session: Physics and Astrophysics
Science Demonstrator Session: Physics and Astrophysics
 
Overview of the Europeana Newspapers Project
Overview of the Europeana Newspapers ProjectOverview of the Europeana Newspapers Project
Overview of the Europeana Newspapers Project
 
PaNOSC: EOSC for Photon and Neutron Facilities Users
PaNOSC: EOSC for Photon and Neutron Facilities Users PaNOSC: EOSC for Photon and Neutron Facilities Users
PaNOSC: EOSC for Photon and Neutron Facilities Users
 
eROSA Policy WS2: European Open Science Cloud (EOSC) - The Perspective of e-I...
eROSA Policy WS2: European Open Science Cloud (EOSC) - The Perspective of e-I...eROSA Policy WS2: European Open Science Cloud (EOSC) - The Perspective of e-I...
eROSA Policy WS2: European Open Science Cloud (EOSC) - The Perspective of e-I...
 

Mehr von cneudecker

EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Librarycneudecker
 
ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltextecneudecker
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungencneudecker
 
Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?cneudecker
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspaperscneudecker
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...cneudecker
 
AI for digitized cultural heritage
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritagecneudecker
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenzcneudecker
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-Dcneudecker
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspaperscneudecker
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...cneudecker
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...cneudecker
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentscneudecker
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Miningcneudecker
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltextecneudecker
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europecneudecker
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minutencneudecker
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshellcneudecker
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlincneudecker
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspaperscneudecker
 

Mehr von cneudecker (20)

EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
 
ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltexte
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungen
 
Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
 
AI for digitized cultural heritage
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritage
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenz
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-D
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspapers
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documents
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Mining
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltexte
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europe
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minuten
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshell
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlin
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
 

Kürzlich hochgeladen

OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 

Kürzlich hochgeladen (20)

OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 

An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis

  • 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis Clemens Neudecker, KB National Library of the Netherlands International workshop on Historical Document Imaging and Processing, Beijing, 17 September 2011
  • 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 2 Background  IMPACT – Improving Access to Text (2008 – 2011) Large-scale integrating research project, funded by the EC Main objectives: - Innovate OCR technology - Capacity building in mass-digitisation  From a technical perspective: > 20 software toolkits for solving specific issues Prototyping new algorithms “One ring to rule them all…”  IMPACT Interoperability Framework (IIF)
  • 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 3 Main requirements Behavioural:  Minimize integration effort  Minimize deployment effort  Maximize usability  Maximize scalability Functional:  Modular  Transparent  Expandable  Open source  Platform independent
  • 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 4 Architecture  IMPACT Interoperability Framework: Technologies - Java 6 - Generic Web Service Wrapper - Apache Ant/Maven - Apache Tomcat/httpd - Apache Axis2 - Apache Synapse - Taverna Workflow Engine  IMPACT Interoperability Framework: Dataset - more than 500.000 images from digital libraries - more than 25.000 ground truth transcriptions
  • 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 5 So how does it work? 1. Digitisation/OCR challenges registered and tagged in database 2. Database contains 99,99% correct result: “ground truth” 3. Researcher develops new method to tackle a problem 4. Research prototype is wrapped to a web service 5. Web service is integrated as a workflow module 6. Workflow module can be evaluated, combined, etc.
  • 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 6 Framework integration  Easy to use generic command line wrapper (open source)
  • 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 7 Workflow development  OCR workflow = data pipeline  Building blocks = processing steps (nodes)  Integration = interaction between nodes (mashup)  Collaboration with
  • 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 8 Workflow management  Web 2.0 style registry: myExperiment  Local client: Taverna Workbench  Web client: project website
  • 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 9 Compute cluster  Enterprise Service Bus receives requests from users and distributes the load to the available worker nodes  Main effect: Process parallelization, Load distribution, Fail over
  • 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 10 Dataset  Access to a representative and annotated dataset of significant size, with metadata, ground truth and search facilities
  • 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 11 Evaluation features  Text based comparison of result with ground truth, using Levenshtein distance method  Layout based comparison of result with ground truth, using the Page Analysis And Ground Truth Elements Framework  Example:
  • 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 12 Community  Web2.0 style workflow registry  Community of experts  Sharing of resources  Knowledge exchange  A central meeting point for users and researchers
  • 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 13 Summary Benefits: - Availability of resources (images, ground truth and tools) to the international research community - A common baseline for transparent evaluation and comparison - Sharing of results and know-how - Enable new research through scalable computing - Consolidation of support and maintenance Thank you! Questions?

Hinweis der Redaktion

  1. 7
  2. 8