SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Structural analysis of documents
Functional Extension Parser (FEP)

Günter Mühlberger
University Innsbruck Library (UIBK)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Agenda
      Introduction
      Features
        – What do we recognise with the structural analysis?
      Benefits
        – Why is structural analysis useful?
      Architecture
        – How does it work?
      Results
        – How good are we?
      Roadmap
        – When will it come into being?
      Business
        – Which offers will be available?
                                                                                                                                                         2
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Introduction
      Document understanding platform
      Try to enhance and exploit the logical structure of documents for
        – Display
        – Navigation
        – Retrieval
      Enhance OCR output with structural metadata
        – Fully automated processing
        – Interactive correction




IMPACT EVA/MINERVA 12th Nov. 2008                                                                                                                        3
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Features
      General
        – We are able to recognise all structural elements which have some layout
          representation: e.g. region, size, typeface, distance to other elements, etc.
        – Focus in IMPACT: Basic features which are typical for all documents
        – Rules set can be extended or specified according to other datasets
                      E.g. journals, dissertations, index cards, yearbooks, newspapers, etc.
        – The better the OCR, the better our structural analysis
      Basic features for books
        –     Page numbers
        –     Running titles (headers)
        –     Print space
        –     Footnotes
        –     Signature marks
        –     Headings (within the running text)
        –     Table of contents entries (additional to headings)
        –     Front/Body/Back
        –     Paragraphs
                                                                                                                                                         4
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Print space
      Headings
      Footnotes




                                                                                                                                                         5
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Running title (header)
      Page number
      Signature mark




                                                                                                                                                         6
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Table of contents
        – (linked with headings in
          the running text,
          respectively page
          numbers)




                                                                                                                                                         7
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Benefits (1)
      Display
        – Correct print space allows to display images centred (no flipping
          between pages)
      Search & retrieval
        – Scoring of results
                      Could take into account structural data (headings, footnotes)
        – Noise reduction
                      Front, body, back are separated, text from the front is often misleading
                      Running titles repeat the same words
                      Footnotes can be included or excluded
        – Facetted search
                      Results can be displayed for running text, footnotes, headings



                                                                                                                                                         8
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Benefits (2)
      Navigation
        – Page numbers allow usage of original table of contents
        – Original table of contents can be linked with headings/page numbers in
          the book
      Document editing
        –     Further mark up (e.g. TEI) is supported
        –     Manual preparation for Print-on-Demand is eased (print space)
        –     Selective OCR correction can be applied:
        –     E.g. only headings, running text, footnotes could be fed to CONCERT
      Document matching
        – Contributions or footnotes can be matched with existing bibliographical
          databases


                                                                                                                                                         9
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Improved display in the
      Internet and PDF




                                                                                                                                                         10
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Refinement of full-text
      search
      Facets for e.g.
        – Running text
        – Footnotes
        – Headings
      Less noise
        – Running titles,
          signature marks
          excluded from search




                                                                                                                                                         11
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Clickable table of contents
      entries
        – Google style
      Selective OCR correction
        – Correct only ToC,
          headings, footnotes, etc.




                                                                                                                                                         12
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Matching of documents
      with external sources
        – Match footnotes with
          library catalogues
          (bibliographies)Clickable
          table of content
        – Match table of contents
          entries and headings with
          bibliographies




                                                                                                                                                         13
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Improved editing
        – Alternating print spaces
          for Print on Demand
        – Further processing for
          TEI editions etc.




                                                                                                                                                         14
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Architecture
      Input
        – Results from OCR processing on word level (coordinates)
        – E.g. ALTO file, ABBYY XML file or Google HTML
      Output
        – Structural annotations for recognized text features, e.g. page numbers,
          running titles, headings, etc.
        – E.g. XML, ALTO, METS, TEI, etc.
      General workflow
        –     OCR result files are parsed (FEP general XML format)
        –     Rules set is applied to the dataset (rules are managed by rules engine)
        –     Results are stored in a database
        –     Export on various levels is provided
      Optional
        – Online or offline correction (GUI)
        – Adaptation of rules set
        – Quality assurance on basis of ground truth

                                                                                                                                                         15
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         16
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




The FEP Core
      Based on expert-system like rule engine for java (Jess)
      Both manually crafted rules and rules obtained by machine learning
      Uses fuzzy logic to deal with uncertainty

Typical rules:
      IF there is a numeral in the first line of the page AND this numeral is centred
      THEN this numeral may be the page number
      IF there is a numeral in the first line of the page AND this numeral is at the
      right hand side of the page AND this numeral is an odd number THEN this
      numeral may be the page number
      IF there is a numeral in the first line of the page AND this numeral is at the
      left hand side of the page AND this numeral is an even number THEN this
      numeral may be the page number.

IMPACT EVA/MINERVA 12th Nov. 2008                                                                                                                        17
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Results
      Basic rules set
        – General features for books from 1700 to 2000
        – Dataset of 155 books, 30.673 pages (141 training set, 41 evaluation set)
        – All books were manually annotated (ground truth)
      Recall, Precision, F-Measure
        – E.g. 10 lines with headings in a book. We find 12 lines, 8 of them are
          correct, 4 are false.
        – Recall                = 8 of 10                   = 0,8
        – Precision             = 8 of 12                   = 0,66
        – F-Measure             = 2*0.8*0.66/(0.8+0.66)     = 0,72
      More explanations
        – Important: We are counting lines, not structural items!
                      E.g. a heading consists of two lines (often with different size of typeface we have
                      to find both to succeed)
        – Difference between training and evaluation sets are marginal

                                                                                                                                                         18
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




  Results on Evaluation Set

                                                Recall                            Precision                          F‐measure
Running text                                                           0,99                              0,98                              0,98
Footnotes                                                              0,83                              0,89                              0,86
Page numbers                                                           0,97                                      1                         0,98
Running titles                                                         0,97                                      1                         0,98
Heading                                                                0,85                              0,80                              0,82
Signature marks                                                        0,68                              0,89                              0,77
                                                                                                                                                           19
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Roadmap
      Summer 2011: Beta version
        – Integration into IMPACT Interoperability Platform
        – Basic rules set: books from 1700 to 1900
      End of the year: Version 1.0
        – Full featured version
        – Enhanced online correction interface
        – FEP as a service, not as a product for local installation




                                                                                                                                                         20
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Business offers
      Web-service for processing single volumes and correction
        – Will be integrated into eBooks-on-Demand EOD Network
        – Already now 30 libraries are uploading their images to OCR server in
          Innsbruck
        – FEP will be an additional service for general material
        – Similar offers can be made to other libraries or networks as well
      Adaptation of rules set
        – For specific datasets much more can be detected than just the basic
          features
        – E.g. journals with a fixed structure over many years or parliamentary
          papers, dissertations, research papers, etc.
        Onsite installations
        – Not our focus, but could be done for very large datasets or due to legal
          requirements (e.g. Google images)

                                                                                                                                                         21
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         22
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         23
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




IMPACT EVA/MINERVA 12th Nov. 2008                                                                                                                        24
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         25
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Results: TOC

      25 TOC entries in total
      22 TOC entries are completely correct
      1 TOC entry was missed
      2 TOC entries are grouped incorrectly
      1 TOC entry has no link
      1 TOC entry has a wrong link




IMPACT EVA/MINERVA 12th Nov. 2008                                                                                                                        26
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




             Thank you for your attention!




                                                                                                                                                         27

Weitere ähnliche Inhalte

Was ist angesagt?

Bratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdfBratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdfIMPACT Centre of Competence
 
Tel concertation meeting project presentations - 7-2-2014
Tel concertation meeting   project presentations - 7-2-2014Tel concertation meeting   project presentations - 7-2-2014
Tel concertation meeting project presentations - 7-2-2014munarmu
 
Jpl cv en JUN10
Jpl cv en JUN10Jpl cv en JUN10
Jpl cv en JUN10JoaoPL
 
Jpl cv en_mar11
Jpl cv en_mar11Jpl cv en_mar11
Jpl cv en_mar11JoaoPL
 
iDiscover: Towards the next generation of contextualised mobile museum guides
iDiscover: Towards the next generation of contextualised mobile museum guidesiDiscover: Towards the next generation of contextualised mobile museum guides
iDiscover: Towards the next generation of contextualised mobile museum guidesiDiscover Interactief Erfgoed
 
Jpl Cv En v2
Jpl Cv En v2Jpl Cv En v2
Jpl Cv En v2JoaoPL
 
CV of Joao Penha-Lopes (En)
CV of Joao Penha-Lopes (En)CV of Joao Penha-Lopes (En)
CV of Joao Penha-Lopes (En)JoaoPL
 
J01 dov winer_scientix_national_workshop_20151109
J01 dov winer_scientix_national_workshop_20151109J01 dov winer_scientix_national_workshop_20151109
J01 dov winer_scientix_national_workshop_20151109evaminerva
 
J01 dov winer_scientix_national_workshop_20151109
J01 dov winer_scientix_national_workshop_20151109J01 dov winer_scientix_national_workshop_20151109
J01 dov winer_scientix_national_workshop_20151109evaminerva
 
Workshop Barcelona: Copyright Limitations and Exceptions for Education in the EU
Workshop Barcelona: Copyright Limitations and Exceptions for Education in the EUWorkshop Barcelona: Copyright Limitations and Exceptions for Education in the EU
Workshop Barcelona: Copyright Limitations and Exceptions for Education in the EUOpenCourseWare Europe
 
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...IMPACT Centre of Competence
 
EuropeanaConnect
EuropeanaConnectEuropeanaConnect
EuropeanaConnectMax Kaiser
 
I Lab2 I Lab Vision Ws 3 Oct 06 V2
I Lab2 I Lab Vision Ws 3 Oct 06 V2I Lab2 I Lab Vision Ws 3 Oct 06 V2
I Lab2 I Lab Vision Ws 3 Oct 06 V2imec.archive
 

Was ist angesagt? (14)

Bratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdfBratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdf
 
Enoll hannover-2013-anna
Enoll hannover-2013-annaEnoll hannover-2013-anna
Enoll hannover-2013-anna
 
Tel concertation meeting project presentations - 7-2-2014
Tel concertation meeting   project presentations - 7-2-2014Tel concertation meeting   project presentations - 7-2-2014
Tel concertation meeting project presentations - 7-2-2014
 
Jpl cv en JUN10
Jpl cv en JUN10Jpl cv en JUN10
Jpl cv en JUN10
 
Jpl cv en_mar11
Jpl cv en_mar11Jpl cv en_mar11
Jpl cv en_mar11
 
iDiscover: Towards the next generation of contextualised mobile museum guides
iDiscover: Towards the next generation of contextualised mobile museum guidesiDiscover: Towards the next generation of contextualised mobile museum guides
iDiscover: Towards the next generation of contextualised mobile museum guides
 
Jpl Cv En v2
Jpl Cv En v2Jpl Cv En v2
Jpl Cv En v2
 
CV of Joao Penha-Lopes (En)
CV of Joao Penha-Lopes (En)CV of Joao Penha-Lopes (En)
CV of Joao Penha-Lopes (En)
 
J01 dov winer_scientix_national_workshop_20151109
J01 dov winer_scientix_national_workshop_20151109J01 dov winer_scientix_national_workshop_20151109
J01 dov winer_scientix_national_workshop_20151109
 
J01 dov winer_scientix_national_workshop_20151109
J01 dov winer_scientix_national_workshop_20151109J01 dov winer_scientix_national_workshop_20151109
J01 dov winer_scientix_national_workshop_20151109
 
Workshop Barcelona: Copyright Limitations and Exceptions for Education in the EU
Workshop Barcelona: Copyright Limitations and Exceptions for Education in the EUWorkshop Barcelona: Copyright Limitations and Exceptions for Education in the EU
Workshop Barcelona: Copyright Limitations and Exceptions for Education in the EU
 
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
 
EuropeanaConnect
EuropeanaConnectEuropeanaConnect
EuropeanaConnect
 
I Lab2 I Lab Vision Ws 3 Oct 06 V2
I Lab2 I Lab Vision Ws 3 Oct 06 V2I Lab2 I Lab Vision Ws 3 Oct 06 V2
I Lab2 I Lab Vision Ws 3 Oct 06 V2
 

Ähnlich wie Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)cneudecker
 
IMPACT Demo Dag at KB
IMPACT Demo Dag at KBIMPACT Demo Dag at KB
IMPACT Demo Dag at KBcneudecker
 
An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...cneudecker
 
IMPACT HPC Cloud Day
IMPACT HPC Cloud DayIMPACT HPC Cloud Day
IMPACT HPC Cloud Daycneudecker
 
IMPACT at OCR Summit
IMPACT at OCR SummitIMPACT at OCR Summit
IMPACT at OCR Summitcneudecker
 
Experimental Workflow Development in Digitisation
Experimental Workflow Development in DigitisationExperimental Workflow Development in Digitisation
Experimental Workflow Development in Digitisationcneudecker
 
OCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACTOCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACTcneudecker
 
Centre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerCentre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerBiblioteca Nacional de España
 
Targeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsTargeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsEmma Huber
 
IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and ...
IMPACT Final Event 26-06-2012  - Library experiences in IMPACT: National and ...IMPACT Final Event 26-06-2012  - Library experiences in IMPACT: National and ...
IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and ...IMPACT Centre of Competence
 
Models and tools for aggregating and annotating content on ECLAP
Models and tools for aggregating and annotating content on ECLAPModels and tools for aggregating and annotating content on ECLAP
Models and tools for aggregating and annotating content on ECLAPPaolo Nesi
 
Presentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayEuropeana Newspapers
 
Mapping the European TEL Project Landscape Using Social Network Analysis and ...
Mapping the European TEL Project Landscape Using Social Network Analysis and ...Mapping the European TEL Project Landscape Using Social Network Analysis and ...
Mapping the European TEL Project Landscape Using Social Network Analysis and ...Michael Derntl
 
Dissemination activities(E. Axdorph) - 2nd Share.TEC project workshop
Dissemination activities(E. Axdorph) - 2nd Share.TEC project workshopDissemination activities(E. Axdorph) - 2nd Share.TEC project workshop
Dissemination activities(E. Axdorph) - 2nd Share.TEC project workshopErik Axdorph
 
Share.TEC dissemination activities 2009, E. Axdorph
Share.TEC dissemination activities 2009, E. AxdorphShare.TEC dissemination activities 2009, E. Axdorph
Share.TEC dissemination activities 2009, E. AxdorphShare.TEC
 
Dissemination activities
Dissemination activitiesDissemination activities
Dissemination activitiesguest1e6768
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers
 

Ähnlich wie Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger (20)

Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)
 
IMPACT Demo Dag at KB
IMPACT Demo Dag at KBIMPACT Demo Dag at KB
IMPACT Demo Dag at KB
 
An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...
 
IMPACT HPC Cloud Day
IMPACT HPC Cloud DayIMPACT HPC Cloud Day
IMPACT HPC Cloud Day
 
IMPACT at OCR Summit
IMPACT at OCR SummitIMPACT at OCR Summit
IMPACT at OCR Summit
 
Experimental Workflow Development in Digitisation
Experimental Workflow Development in DigitisationExperimental Workflow Development in Digitisation
Experimental Workflow Development in Digitisation
 
IMPACT Final Conference - Stefan Pletschacher
IMPACT Final Conference - Stefan PletschacherIMPACT Final Conference - Stefan Pletschacher
IMPACT Final Conference - Stefan Pletschacher
 
OCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACTOCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACT
 
Centre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerCentre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens Neudecker
 
Targeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsTargeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical Collections
 
IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and ...
IMPACT Final Event 26-06-2012  - Library experiences in IMPACT: National and ...IMPACT Final Event 26-06-2012  - Library experiences in IMPACT: National and ...
IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and ...
 
text summarization
text summarizationtext summarization
text summarization
 
Models and tools for aggregating and annotating content on ECLAP
Models and tools for aggregating and annotating content on ECLAPModels and tools for aggregating and annotating content on ECLAP
Models and tools for aggregating and annotating content on ECLAP
 
Presentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information Day
 
Metadata
MetadataMetadata
Metadata
 
Mapping the European TEL Project Landscape Using Social Network Analysis and ...
Mapping the European TEL Project Landscape Using Social Network Analysis and ...Mapping the European TEL Project Landscape Using Social Network Analysis and ...
Mapping the European TEL Project Landscape Using Social Network Analysis and ...
 
Dissemination activities(E. Axdorph) - 2nd Share.TEC project workshop
Dissemination activities(E. Axdorph) - 2nd Share.TEC project workshopDissemination activities(E. Axdorph) - 2nd Share.TEC project workshop
Dissemination activities(E. Axdorph) - 2nd Share.TEC project workshop
 
Share.TEC dissemination activities 2009, E. Axdorph
Share.TEC dissemination activities 2009, E. AxdorphShare.TEC dissemination activities 2009, E. Axdorph
Share.TEC dissemination activities 2009, E. Axdorph
 
Dissemination activities
Dissemination activitiesDissemination activities
Dissemination activities
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday Muehlberger
 

Mehr von Biblioteca Nacional de España

La colección de relaciones de sucesos en la Biblioteca Nacional de España
La colección de relaciones de sucesos en la Biblioteca Nacional de EspañaLa colección de relaciones de sucesos en la Biblioteca Nacional de España
La colección de relaciones de sucesos en la Biblioteca Nacional de EspañaBiblioteca Nacional de España
 
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos AramburoIdentidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos AramburoBiblioteca Nacional de España
 
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...Biblioteca Nacional de España
 
RDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
RDA. Autoridades. Fundamentos. Identificación de entidades. RelacionesRDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
RDA. Autoridades. Fundamentos. Identificación de entidades. RelacionesBiblioteca Nacional de España
 
Pleno del Real Patronato. Biblioteca Nacional de España
Pleno del Real Patronato. Biblioteca Nacional de EspañaPleno del Real Patronato. Biblioteca Nacional de España
Pleno del Real Patronato. Biblioteca Nacional de EspañaBiblioteca Nacional de España
 
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de EspañaObjetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de EspañaBiblioteca Nacional de España
 
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...Biblioteca Nacional de España
 
Evaluación actuaciones 2018. Planificación actuaciones 2019
Evaluación actuaciones 2018. Planificación actuaciones 2019Evaluación actuaciones 2018. Planificación actuaciones 2019
Evaluación actuaciones 2018. Planificación actuaciones 2019Biblioteca Nacional de España
 
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos AramburoPleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos AramburoBiblioteca Nacional de España
 
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...Biblioteca Nacional de España
 

Mehr von Biblioteca Nacional de España (20)

La colección de relaciones de sucesos en la Biblioteca Nacional de España
La colección de relaciones de sucesos en la Biblioteca Nacional de EspañaLa colección de relaciones de sucesos en la Biblioteca Nacional de España
La colección de relaciones de sucesos en la Biblioteca Nacional de España
 
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos AramburoIdentidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
 
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
 
Data privacy in library authority files: a survey
Data privacy in library authority files: a surveyData privacy in library authority files: a survey
Data privacy in library authority files: a survey
 
Perfil de RDA de la BNE. Resumen de cambios
Perfil de RDA de la BNE. Resumen de cambiosPerfil de RDA de la BNE. Resumen de cambios
Perfil de RDA de la BNE. Resumen de cambios
 
RDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
RDA. Autoridades. Fundamentos. Identificación de entidades. RelacionesRDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
RDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
 
RDA: el nuevo texto
RDA: el nuevo textoRDA: el nuevo texto
RDA: el nuevo texto
 
Pleno del Real Patronato. Biblioteca Nacional de España
Pleno del Real Patronato. Biblioteca Nacional de EspañaPleno del Real Patronato. Biblioteca Nacional de España
Pleno del Real Patronato. Biblioteca Nacional de España
 
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de EspañaObjetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
 
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
 
Evaluación actuaciones 2018. Planificación actuaciones 2019
Evaluación actuaciones 2018. Planificación actuaciones 2019Evaluación actuaciones 2018. Planificación actuaciones 2019
Evaluación actuaciones 2018. Planificación actuaciones 2019
 
Dirección Técnica. Objetivos 2019
Dirección Técnica. Objetivos 2019Dirección Técnica. Objetivos 2019
Dirección Técnica. Objetivos 2019
 
Evaluación 2018. Objetivos 2019
Evaluación 2018. Objetivos 2019Evaluación 2018. Objetivos 2019
Evaluación 2018. Objetivos 2019
 
Evaluación actuaciones 2018. Dirección Cultural
Evaluación actuaciones 2018. Dirección CulturalEvaluación actuaciones 2018. Dirección Cultural
Evaluación actuaciones 2018. Dirección Cultural
 
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos AramburoPleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
 
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
 
VIAF GDPR
VIAF GDPRVIAF GDPR
VIAF GDPR
 
Renacer prensa historica
Renacer prensa historicaRenacer prensa historica
Renacer prensa historica
 
RDA y Linked data (Ricardo Santos Muñoz)
RDA y Linked data (Ricardo Santos Muñoz)RDA y Linked data (Ricardo Santos Muñoz)
RDA y Linked data (Ricardo Santos Muñoz)
 
Desarrollo actual de RDA (Pilar Tejero López)
Desarrollo actual de RDA (Pilar Tejero López)Desarrollo actual de RDA (Pilar Tejero López)
Desarrollo actual de RDA (Pilar Tejero López)
 

Kürzlich hochgeladen

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

  • 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Structural analysis of documents Functional Extension Parser (FEP) Günter Mühlberger University Innsbruck Library (UIBK)
  • 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Agenda Introduction Features – What do we recognise with the structural analysis? Benefits – Why is structural analysis useful? Architecture – How does it work? Results – How good are we? Roadmap – When will it come into being? Business – Which offers will be available? 2
  • 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Introduction Document understanding platform Try to enhance and exploit the logical structure of documents for – Display – Navigation – Retrieval Enhance OCR output with structural metadata – Fully automated processing – Interactive correction IMPACT EVA/MINERVA 12th Nov. 2008 3
  • 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Features General – We are able to recognise all structural elements which have some layout representation: e.g. region, size, typeface, distance to other elements, etc. – Focus in IMPACT: Basic features which are typical for all documents – Rules set can be extended or specified according to other datasets E.g. journals, dissertations, index cards, yearbooks, newspapers, etc. – The better the OCR, the better our structural analysis Basic features for books – Page numbers – Running titles (headers) – Print space – Footnotes – Signature marks – Headings (within the running text) – Table of contents entries (additional to headings) – Front/Body/Back – Paragraphs 4
  • 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Print space Headings Footnotes 5
  • 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Running title (header) Page number Signature mark 6
  • 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Table of contents – (linked with headings in the running text, respectively page numbers) 7
  • 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Benefits (1) Display – Correct print space allows to display images centred (no flipping between pages) Search & retrieval – Scoring of results Could take into account structural data (headings, footnotes) – Noise reduction Front, body, back are separated, text from the front is often misleading Running titles repeat the same words Footnotes can be included or excluded – Facetted search Results can be displayed for running text, footnotes, headings 8
  • 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Benefits (2) Navigation – Page numbers allow usage of original table of contents – Original table of contents can be linked with headings/page numbers in the book Document editing – Further mark up (e.g. TEI) is supported – Manual preparation for Print-on-Demand is eased (print space) – Selective OCR correction can be applied: – E.g. only headings, running text, footnotes could be fed to CONCERT Document matching – Contributions or footnotes can be matched with existing bibliographical databases 9
  • 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Improved display in the Internet and PDF 10
  • 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Refinement of full-text search Facets for e.g. – Running text – Footnotes – Headings Less noise – Running titles, signature marks excluded from search 11
  • 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Clickable table of contents entries – Google style Selective OCR correction – Correct only ToC, headings, footnotes, etc. 12
  • 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Matching of documents with external sources – Match footnotes with library catalogues (bibliographies)Clickable table of content – Match table of contents entries and headings with bibliographies 13
  • 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Improved editing – Alternating print spaces for Print on Demand – Further processing for TEI editions etc. 14
  • 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Architecture Input – Results from OCR processing on word level (coordinates) – E.g. ALTO file, ABBYY XML file or Google HTML Output – Structural annotations for recognized text features, e.g. page numbers, running titles, headings, etc. – E.g. XML, ALTO, METS, TEI, etc. General workflow – OCR result files are parsed (FEP general XML format) – Rules set is applied to the dataset (rules are managed by rules engine) – Results are stored in a database – Export on various levels is provided Optional – Online or offline correction (GUI) – Adaptation of rules set – Quality assurance on basis of ground truth 15
  • 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 16
  • 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The FEP Core Based on expert-system like rule engine for java (Jess) Both manually crafted rules and rules obtained by machine learning Uses fuzzy logic to deal with uncertainty Typical rules: IF there is a numeral in the first line of the page AND this numeral is centred THEN this numeral may be the page number IF there is a numeral in the first line of the page AND this numeral is at the right hand side of the page AND this numeral is an odd number THEN this numeral may be the page number IF there is a numeral in the first line of the page AND this numeral is at the left hand side of the page AND this numeral is an even number THEN this numeral may be the page number. IMPACT EVA/MINERVA 12th Nov. 2008 17
  • 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Results Basic rules set – General features for books from 1700 to 2000 – Dataset of 155 books, 30.673 pages (141 training set, 41 evaluation set) – All books were manually annotated (ground truth) Recall, Precision, F-Measure – E.g. 10 lines with headings in a book. We find 12 lines, 8 of them are correct, 4 are false. – Recall = 8 of 10 = 0,8 – Precision = 8 of 12 = 0,66 – F-Measure = 2*0.8*0.66/(0.8+0.66) = 0,72 More explanations – Important: We are counting lines, not structural items! E.g. a heading consists of two lines (often with different size of typeface we have to find both to succeed) – Difference between training and evaluation sets are marginal 18
  • 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Results on Evaluation Set Recall Precision F‐measure Running text 0,99 0,98 0,98 Footnotes 0,83 0,89 0,86 Page numbers 0,97 1 0,98 Running titles 0,97 1 0,98 Heading 0,85 0,80 0,82 Signature marks 0,68 0,89 0,77 19
  • 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Roadmap Summer 2011: Beta version – Integration into IMPACT Interoperability Platform – Basic rules set: books from 1700 to 1900 End of the year: Version 1.0 – Full featured version – Enhanced online correction interface – FEP as a service, not as a product for local installation 20
  • 21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Business offers Web-service for processing single volumes and correction – Will be integrated into eBooks-on-Demand EOD Network – Already now 30 libraries are uploading their images to OCR server in Innsbruck – FEP will be an additional service for general material – Similar offers can be made to other libraries or networks as well Adaptation of rules set – For specific datasets much more can be detected than just the basic features – E.g. journals with a fixed structure over many years or parliamentary papers, dissertations, research papers, etc. Onsite installations – Not our focus, but could be done for very large datasets or due to legal requirements (e.g. Google images) 21
  • 22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 22
  • 23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 23
  • 24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT EVA/MINERVA 12th Nov. 2008 24
  • 25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 25
  • 26. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Results: TOC 25 TOC entries in total 22 TOC entries are completely correct 1 TOC entry was missed 2 TOC entries are grouped incorrectly 1 TOC entry has no link 1 TOC entry has a wrong link IMPACT EVA/MINERVA 12th Nov. 2008 26
  • 27. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Thank you for your attention! 27