SlideShare ist ein Scribd-Unternehmen logo
1 von 5
Downloaden Sie, um offline zu lesen
Outline

       IMPACT final event – The Hague – 26 June 2012          1. Institutional background
                                                              2. IMPACT test case
       Metadata extraction from title pages                   3.     Strategic goals
                                                              4.     Preliminary work
       Evaluation of the FEP pilot
                                                              5.     Results
       at the German National Library
                                                              6.     Perspective

       Christa Schöning-Walter

1                                                      2 | IMPACT event | June 26, 2012 |




       The German National Library (DNB)                      The German National Library (DNB)
       – some facts and figures (I)                           – some facts and figures (II)

       − Legal deposit:                                       − Collection size             (January 2012):   27 million media units
         Collecting, cataloguing, archiving
         and making available to the                          − Daily input: 1.500 physical units (each with 2 copies)
         general public all German and                        − Since 2006:
         German-language publications,
                                                                Collection mandate includes non-physical media
         publications about Germany etc                         (online publications)
         from 1913
                                                                    − DNBG = Law regarding the German National Library
       − Bibliographic services:                                    − PflAV = Legal Deposit Regulation
          − National Bibliography
          − Authority files                                   − Since 2009:
                                                                Considerations on and implementation of automated
          − Bibliographic standards
                                                                cataloguing processes
       − 2 sites: Leipzig, Frankfurt am Main
3 | IMPACT event | June 26, 2012 |                     4 | IMPACT event | June 26, 2012 |




                                                                                                                                       1
Target of the IMPACT scenario                                        Starting point


       Opening questions (summer 2011):                                     Since January 2012:
                                                                            − Experimental application studies in collaboration with
       − Can metadata extraction from title pages successfully                 the University of Innsbruck
         be done by a rule engine in case of simple structured
         monographic publications?                                          − Using the rule-based exploitation features of FEP
                                                                              (Functional Extension Parser)
       − Is this useful in order to accelerate the cataloguing
         processes if no machine-readable metadata from                     What is FEP?
         other sources is available?
                                                                            − Software platform for the purpose of analysing the
                                     Test case: Theses                        logical structure of documents

                                                                            − Developed within IMPACT work package EE4
                                     − 14.000 print units annually
                                                                              (Goal: enrichment of OCR output with structure
                                     − simple structure !?
                                                                              information)
5 | IMPACT event | June 26, 2012 |                                   6 | IMPACT event | June 26, 2012 |




       Strategic goals                                                      Conceptual design of the workflow

                                                                            Example: http://d-nb.info/1017138931
       In particular:
                                                                                     Accession             Repository              FEP results
       − Making descriptive cataloguing less time-consuming and
                                                                                   (Printed media
         literature processing of printed media faster by                               units)             OAI-Harvester   Cataloguing
           − Partial digitisation
           − Automated metadata extraction
                                                                                  Bibliographic                             Qualitiy
           − Result transfer into the bibliographic record                                                Data Provider      check
                                                                                     record
           − Quality check and completion of cataloguing by the                                                             Statistics
             staff
                                                                                                    a
       Generally:                                                                  Service partner
                                                                                   Scan service           OCR output/
                                                                                                                              Stack
       − Gaining experience in the area of automated metadata                       (title page +          Indexing
         extraction / automated cataloguing                                              ToC)
7 | IMPACT event | June 26, 2012 |                                   8 | IMPACT event | June 26, 2012 |




                                                                                                                                                 2
The Objective: Automated
                                                                                            exploitation of descriptive bibliographic data

                                                                                                                             − Specification, implementation,
                                                                                                                               evaluation and gradually
                                                                                                                               improvement of
                                                                                                                                − Appropriate structure
                                                                                                                                  types
                                                                                                                                − Dictionaries
                                                                                                                                  (controlled vocabulary,
                                              The idea:                                                                           indicating keywords,
                                              Taking bibliographic data over                                                      abbreviations etc)
                                              from metadata mining tools.                                                       − Expert rules
                                                                                                                                − Etc

                                                                                                                                  Illustration: University of Innsbruck
9 | IMPACT event | June 26, 2012 |                                                   10 |




       Preliminary work (I)                                                                 Preliminary work (II)


       − Specification of the bibliographic statements to be mined                          − Going over some hundreds of title pages of theses
         from the title page                                                                  (scans from 2009-2011 + documents from daily business)
             Attribute                Value
                                                                                            − Exploring typical structural patterns / regularities etc,
             Publication year         2010                                                    such as                    Examples of indicating phrases to find out
             Language code            /1ger                                                    − Prefixes                the creator:
                                      /1eng                                                                              von
                                                                                               − Phrases                 von <Verfasser> vorgelegte Dissertation
             Creator                  <last name>,<first name>
                                                                                               − Notation                von Herrn/Frau:
             Title                    <full title>:<additional title information>/                                       vorgelegt von(:)
                                                                                               − Position                vorgelegt JJJJ von
                                      <author statement>
                                                                                                                            vorgelegt dem Fachbereich ... von
             Size                     30 cm                                                                                 Name:
                                      21 cm                                                                                 Name des Verfassers:
             Theses statement         <city name>, <corporate body name>,
                                                                                                     Expert rules           Name der Verfasserin:
                                                                                                                            verfasst von(:)
                                      <type of publication>,<year of graduation>
                                                                                                                            eingereicht von
11 | IMPACT event | June 26, 2012 |                                                  12 | IMPACT event | June 26, 2012 |
                                                                                                                            ...




                                                                                                                                                                          3
Preliminary work (III)                                                                      Preliminary work (IV)

                                               Theses statement items (examples):
                                               …
       Choosing / preparing                    Berlin, ESCP Europe Wirtschaftshochschule          − Setting up a sample of documents for evaluation
       dictionaries for tagging,               Berlin, Freie Univ.                                  purposes:
       matching and mapping                    Berlin, Humboldt-Univ.
                                               Berlin, Steinbeis-Hochsch.
                                                                                                     − 1.000 theses from several universities
       purposes:                               Berlin, Techn. Univ.                                  − Publication year: 2010 – 2011
                                               Berlin, Univ. der Künste
       − List of universities                  …
                                                                                                     − Different dimensions (A- and B-size)
         which have the right to                                                                     − Scans: 300 dpi, bitonal
         graduation (identifying               Academic grades (examples):                           − Transfer format: Pdf (in future: XML files)
         the corporate bodies)                 …
                                               M.A.   Master of Arts / Magister Artium            − Ground truth determination:
       − Name Authority File                   M.Sc.  Master of Science
                                               M.Eng. Master of Engineering                          − Manually region tagging on image files
         subset (identifying
                                               LL.M.  Master of Laws / Legum Magister                  (done in Vietnam by the Aletheia tool)
         personal names)                       M.F.A. Master of Fine Arts
                                               M.Mus. Master of Music
       − List of academic grades               M.Ed.  Master of Education
13 | IMPACT event | June 26, 2012 |            …                                           14 | IMPACT event | June 26, 2012 |




       Document processing in brief                                                                Results


       − Database: Storage of all                                                                 Second test phase with a revised list of universities                       (June 2012):
         available information
         (OCR output, automatically
         or manually produced
         annotations, dictionaries,
         facts etc)

       − Input of expert rules

       − Rule engine: Stepwise
         proceeding taking
         intermediary results into
         account
       Illustration: University of Innsbruck                                                     (1) total conformity            (2) complete title + noise (just to be deleted by the staff)
15 | IMPACT event | June 26, 2012 |                                                        16 | IMPACT event | June 26, 2012 |




                                                                                                                                                                                                4
Forecast: Feasibility study                                          New ideas


       − Technical and organisational requirements:                         − Extraction of defined structures from the body of
         Operational aspects, technical workflow, interfaces etc              monographic publications, such as table of contents,
                                                                              abstracts, pure text (without any introductory remarks,
       − Further functional enhancements needed:                              footers, references etc)
          − Dictionary maintenance: Expanding controlled
            vocabulary, sorting out unsuitable items etc                    Target:
          − Taking additional facts into account: Ground truth etc          − Improvement of the results of current automated
                                                                              subject cataloguing projects, such as
          − Additional expert rules (?)
                                                                               − Thematic classification by machine learning
          − Additional functions: Language guesser, document                     techniques
            size etc
                                                                               − Subject headings obtainment by text analysis
          − Customising FEP (?)                                                  techniques
                                                                                                   Reducing the noise via preceding
                                                                                                   structure analysis processes
17 | IMPACT event | June 26, 2012 |                                  18 | IMPACT event | June 26, 2012 |




       Thank you for your attention.

       Christa Schöning-Walter                   Sandra Hamm
       Staff position ’Automated Cataloguing’    Project leader
       c.schoening@dnb.de                        s.hamm@dnb.de


       German National Library
       Digital Services
       Frankfurt am Main, Germany



19 | IMPACT event | June 26, 2012 |




                                                                                                                                        5

Weitere ähnliche Inhalte

Ähnlich wie IMPACT Final Event 26-06-2012 - Automated metadata extraction from title pages. Report on the FEP pilot at the German National Library by Christa Schöning-Walter (DNB)

SplunkLive! San Francisco Dec 2012 - Intuit
SplunkLive! San Francisco Dec 2012 - IntuitSplunkLive! San Francisco Dec 2012 - Intuit
SplunkLive! San Francisco Dec 2012 - Intuit
Splunk
 
Online performance modeling and analysis of message-passing parallel applicat...
Online performance modeling and analysis of message-passing parallel applicat...Online performance modeling and analysis of message-passing parallel applicat...
Online performance modeling and analysis of message-passing parallel applicat...
MOCA Platform
 
Scientific Workflows Systems :In Drug discovery informatics
Scientific Workflows Systems :In Drug discovery informaticsScientific Workflows Systems :In Drug discovery informatics
Scientific Workflows Systems :In Drug discovery informatics
Khaled Tumbi
 
SpagoBI Webinar @ OW2
SpagoBI Webinar @ OW2SpagoBI Webinar @ OW2
SpagoBI Webinar @ OW2
SpagoWorld
 
GeoKettle, GeoMondrian et Spatialytics : une suite open source de GeoBI
GeoKettle, GeoMondrian et Spatialytics : une suite open source de GeoBIGeoKettle, GeoMondrian et Spatialytics : une suite open source de GeoBI
GeoKettle, GeoMondrian et Spatialytics : une suite open source de GeoBI
ACSG Section Montréal
 
Stat Tech Reportv1
Stat Tech Reportv1Stat Tech Reportv1
Stat Tech Reportv1
stat
 
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...
IMPACT Centre of Competence
 

Ähnlich wie IMPACT Final Event 26-06-2012 - Automated metadata extraction from title pages. Report on the FEP pilot at the German National Library by Christa Schöning-Walter (DNB) (20)

Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]
Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]
Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]
 
SplunkLive! San Francisco Dec 2012 - Intuit
SplunkLive! San Francisco Dec 2012 - IntuitSplunkLive! San Francisco Dec 2012 - Intuit
SplunkLive! San Francisco Dec 2012 - Intuit
 
Goobi
GoobiGoobi
Goobi
 
Online performance modeling and analysis of message-passing parallel applicat...
Online performance modeling and analysis of message-passing parallel applicat...Online performance modeling and analysis of message-passing parallel applicat...
Online performance modeling and analysis of message-passing parallel applicat...
 
Presentation of SCAPE Project
Presentation of SCAPE ProjectPresentation of SCAPE Project
Presentation of SCAPE Project
 
Scientific Workflows Systems :In Drug discovery informatics
Scientific Workflows Systems :In Drug discovery informaticsScientific Workflows Systems :In Drug discovery informatics
Scientific Workflows Systems :In Drug discovery informatics
 
ISO 15926 Reference Data Engineering Methodology
ISO 15926 Reference Data Engineering MethodologyISO 15926 Reference Data Engineering Methodology
ISO 15926 Reference Data Engineering Methodology
 
Who cares about Software Process Modelling? A First Investigation about the P...
Who cares about Software Process Modelling? A First Investigation about the P...Who cares about Software Process Modelling? A First Investigation about the P...
Who cares about Software Process Modelling? A First Investigation about the P...
 
SpagoBI Webinar @ OW2
SpagoBI Webinar @ OW2SpagoBI Webinar @ OW2
SpagoBI Webinar @ OW2
 
GeoKettle, GeoMondrian et Spatialytics : une suite open source de GeoBI
GeoKettle, GeoMondrian et Spatialytics : une suite open source de GeoBIGeoKettle, GeoMondrian et Spatialytics : une suite open source de GeoBI
GeoKettle, GeoMondrian et Spatialytics : une suite open source de GeoBI
 
Industry - The Evolution of Information Systems. A Case Study on Document Man...
Industry - The Evolution of Information Systems. A Case Study on Document Man...Industry - The Evolution of Information Systems. A Case Study on Document Man...
Industry - The Evolution of Information Systems. A Case Study on Document Man...
 
Stat Tech Reportv1
Stat Tech Reportv1Stat Tech Reportv1
Stat Tech Reportv1
 
Ac2017 8. metrics forprivacysafety-notes
Ac2017   8. metrics forprivacysafety-notesAc2017   8. metrics forprivacysafety-notes
Ac2017 8. metrics forprivacysafety-notes
 
KNIME tutorial
KNIME tutorialKNIME tutorial
KNIME tutorial
 
Hobbit presentation at Apache Big Data Europe 2016
Hobbit presentation at Apache Big Data Europe 2016Hobbit presentation at Apache Big Data Europe 2016
Hobbit presentation at Apache Big Data Europe 2016
 
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Modeling Requirements for the Management of Electronic Records
Modeling Requirements for the Management of Electronic RecordsModeling Requirements for the Management of Electronic Records
Modeling Requirements for the Management of Electronic Records
 
Towards an INSPIREd e-reporting & INSPIRE priority datasets in Slovakia
Towards an INSPIREd e-reporting & INSPIRE priority datasets in SlovakiaTowards an INSPIREd e-reporting & INSPIRE priority datasets in Slovakia
Towards an INSPIREd e-reporting & INSPIRE priority datasets in Slovakia
 
Ws2001 sessione8 cibella_tuoto
Ws2001 sessione8 cibella_tuotoWs2001 sessione8 cibella_tuoto
Ws2001 sessione8 cibella_tuoto
 

Mehr von IMPACT Centre of Competence

Mehr von IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Kürzlich hochgeladen

Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
fonyou31
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Kürzlich hochgeladen (20)

Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 

IMPACT Final Event 26-06-2012 - Automated metadata extraction from title pages. Report on the FEP pilot at the German National Library by Christa Schöning-Walter (DNB)

  • 1. Outline IMPACT final event – The Hague – 26 June 2012 1. Institutional background 2. IMPACT test case Metadata extraction from title pages 3. Strategic goals 4. Preliminary work Evaluation of the FEP pilot 5. Results at the German National Library 6. Perspective Christa Schöning-Walter 1 2 | IMPACT event | June 26, 2012 | The German National Library (DNB) The German National Library (DNB) – some facts and figures (I) – some facts and figures (II) − Legal deposit: − Collection size (January 2012): 27 million media units Collecting, cataloguing, archiving and making available to the − Daily input: 1.500 physical units (each with 2 copies) general public all German and − Since 2006: German-language publications, Collection mandate includes non-physical media publications about Germany etc (online publications) from 1913 − DNBG = Law regarding the German National Library − Bibliographic services: − PflAV = Legal Deposit Regulation − National Bibliography − Authority files − Since 2009: Considerations on and implementation of automated − Bibliographic standards cataloguing processes − 2 sites: Leipzig, Frankfurt am Main 3 | IMPACT event | June 26, 2012 | 4 | IMPACT event | June 26, 2012 | 1
  • 2. Target of the IMPACT scenario Starting point Opening questions (summer 2011): Since January 2012: − Experimental application studies in collaboration with − Can metadata extraction from title pages successfully the University of Innsbruck be done by a rule engine in case of simple structured monographic publications? − Using the rule-based exploitation features of FEP (Functional Extension Parser) − Is this useful in order to accelerate the cataloguing processes if no machine-readable metadata from What is FEP? other sources is available? − Software platform for the purpose of analysing the Test case: Theses logical structure of documents − Developed within IMPACT work package EE4 − 14.000 print units annually (Goal: enrichment of OCR output with structure − simple structure !? information) 5 | IMPACT event | June 26, 2012 | 6 | IMPACT event | June 26, 2012 | Strategic goals Conceptual design of the workflow Example: http://d-nb.info/1017138931 In particular: Accession Repository FEP results − Making descriptive cataloguing less time-consuming and (Printed media literature processing of printed media faster by units) OAI-Harvester Cataloguing − Partial digitisation − Automated metadata extraction Bibliographic Qualitiy − Result transfer into the bibliographic record Data Provider check record − Quality check and completion of cataloguing by the Statistics staff a Generally: Service partner Scan service OCR output/ Stack − Gaining experience in the area of automated metadata (title page + Indexing extraction / automated cataloguing ToC) 7 | IMPACT event | June 26, 2012 | 8 | IMPACT event | June 26, 2012 | 2
  • 3. The Objective: Automated exploitation of descriptive bibliographic data − Specification, implementation, evaluation and gradually improvement of − Appropriate structure types − Dictionaries (controlled vocabulary, The idea: indicating keywords, Taking bibliographic data over abbreviations etc) from metadata mining tools. − Expert rules − Etc Illustration: University of Innsbruck 9 | IMPACT event | June 26, 2012 | 10 | Preliminary work (I) Preliminary work (II) − Specification of the bibliographic statements to be mined − Going over some hundreds of title pages of theses from the title page (scans from 2009-2011 + documents from daily business) Attribute Value − Exploring typical structural patterns / regularities etc, Publication year 2010 such as Examples of indicating phrases to find out Language code /1ger − Prefixes the creator: /1eng von − Phrases von <Verfasser> vorgelegte Dissertation Creator <last name>,<first name> − Notation von Herrn/Frau: Title <full title>:<additional title information>/ vorgelegt von(:) − Position vorgelegt JJJJ von <author statement> vorgelegt dem Fachbereich ... von Size 30 cm Name: 21 cm Name des Verfassers: Theses statement <city name>, <corporate body name>, Expert rules Name der Verfasserin: verfasst von(:) <type of publication>,<year of graduation> eingereicht von 11 | IMPACT event | June 26, 2012 | 12 | IMPACT event | June 26, 2012 | ... 3
  • 4. Preliminary work (III) Preliminary work (IV) Theses statement items (examples): … Choosing / preparing Berlin, ESCP Europe Wirtschaftshochschule − Setting up a sample of documents for evaluation dictionaries for tagging, Berlin, Freie Univ. purposes: matching and mapping Berlin, Humboldt-Univ. Berlin, Steinbeis-Hochsch. − 1.000 theses from several universities purposes: Berlin, Techn. Univ. − Publication year: 2010 – 2011 Berlin, Univ. der Künste − List of universities … − Different dimensions (A- and B-size) which have the right to − Scans: 300 dpi, bitonal graduation (identifying Academic grades (examples): − Transfer format: Pdf (in future: XML files) the corporate bodies) … M.A. Master of Arts / Magister Artium − Ground truth determination: − Name Authority File M.Sc. Master of Science M.Eng. Master of Engineering − Manually region tagging on image files subset (identifying LL.M. Master of Laws / Legum Magister (done in Vietnam by the Aletheia tool) personal names) M.F.A. Master of Fine Arts M.Mus. Master of Music − List of academic grades M.Ed. Master of Education 13 | IMPACT event | June 26, 2012 | … 14 | IMPACT event | June 26, 2012 | Document processing in brief Results − Database: Storage of all Second test phase with a revised list of universities (June 2012): available information (OCR output, automatically or manually produced annotations, dictionaries, facts etc) − Input of expert rules − Rule engine: Stepwise proceeding taking intermediary results into account Illustration: University of Innsbruck (1) total conformity (2) complete title + noise (just to be deleted by the staff) 15 | IMPACT event | June 26, 2012 | 16 | IMPACT event | June 26, 2012 | 4
  • 5. Forecast: Feasibility study New ideas − Technical and organisational requirements: − Extraction of defined structures from the body of Operational aspects, technical workflow, interfaces etc monographic publications, such as table of contents, abstracts, pure text (without any introductory remarks, − Further functional enhancements needed: footers, references etc) − Dictionary maintenance: Expanding controlled vocabulary, sorting out unsuitable items etc Target: − Taking additional facts into account: Ground truth etc − Improvement of the results of current automated subject cataloguing projects, such as − Additional expert rules (?) − Thematic classification by machine learning − Additional functions: Language guesser, document techniques size etc − Subject headings obtainment by text analysis − Customising FEP (?) techniques Reducing the noise via preceding structure analysis processes 17 | IMPACT event | June 26, 2012 | 18 | IMPACT event | June 26, 2012 | Thank you for your attention. Christa Schöning-Walter Sandra Hamm Staff position ’Automated Cataloguing’ Project leader c.schoening@dnb.de s.hamm@dnb.de German National Library Digital Services Frankfurt am Main, Germany 19 | IMPACT event | June 26, 2012 | 5