Presentation of Claus Gravenhorst, BnF Information Day
1. Optical Layout Recognition (OLR)
From unstructured to structured newspaper data
Claus Gravenhorst, CCS Content Conversion Specialists GmbH
ENP information day, Paris, November 27, 2014
2. Agenda
• About CCS
• General OLR-workflow for mass digitization
• Layout and structure analysis
• ENP OLR workflow
• Quality assurance
• Output – METS/ALTO package
• Use of structural data – Access and presentation
3. About CCS
• CCS Content Conversion Specialists GmbH (Hamburg), as technical project
partner, will provide its expertise and docWorks technology to set up and operate
a mass digitization workflow for creating high quality structured content from 2
million scanned newspaper pages provided by 5 library partners
• Page volume:
BNF=1.000 k, NLE=500 k , SUB HH=480 k, NLF=90 k, SBB=10 k
• The distributed OLR workflow enables the contribution of project partners
(content providers) to the integrated quality assurance process
• CCS is also contributing to the specification of the ENMAP metadata model
4. General workflow for mass digitization
Re-Scan
Conversion
Imaging
Layout
Analysis
OCR
ISR
QA +
Correction
Reject
Condition
Final
Output
Delivery QA
random
Scanning
Image
Metadata
Database
----------------
Repository
• Automated QA
Document
UID
Barcode
Item Tracking
Manual QA
•in-house
•near-shore
•off-shore
•multiple locations
Manual QA
•in-house
•near-shore
Check in
Check out
Scanner
•Robot-
•Book-
•Document-
•Microfilm-
QA+Correcti
QA+Coornrecti
on
Z 39.50
Metadata
5. Layout and structure analysis
• Layout analysis based on „bottom up“ approach
• General rule system enables recognition of words,
text lines, text blocks, columns and classification of
text blocks, illustrations, advertisements, tables and
the following page types:
- title page (the title page of an issue)
- content page (a page that consists of content/text only)
- illustration page (a page that has at least one illustration)
- advertisement page (a page that contains adverts only)
• Structure analysis through classification of headlines
and grouping of zones into articles
(incl. article continuation)
6. ENP OLR workflow | Conversion without scanning
•Digital Image
•Metadata
Delivery
Digital Image
Metadata
Delivery
•Digital Object
Digital Object
Return
Return
Inspection /
Automatic QA
Inspection /
Automatic QA
••DDoocc DDeelliivveerryy
RReejjeecctt
Material location
Conversion facility
Conversion
MD Recording
7. Possible conversion scenarios
A) Conversion at library (on-site)
B) Conversion off-shore at CCS data center,
final QA at the library via internet transfer (remote QA solution)
C) Conversion off-shore at CCS,
final QA at the library by backup shipment
8. Scenario B | Remote QA at library
Internet
SSttoorraaggee
dW Share
Master
IN
dW Share
POOL OUT
Offshore
Processing
@ CCS
OUTPUT
METS ALTO
SSttoorraaggee
POOL
RQA
QA on-site
@ Library
INPUT
9. Quality assurance
• @ CCS | Automated markup and basic manual correction:
- Headlines, illustrations, tables, captions, advertisements, etc.
- Article segmentation and grouping of zones into articles (incl. continuation)
• @ Content Provider (Library)
Recommended:
- Zoning: correct classification of blocks as „text“ or „illustration“
- Article segmentation: correct identification of headlines/text blocks/captions
- Grouping: correct grouping of blocks (text, illustration) to articles
- Metadata: correct title, issue date and issue number
Optional:
- Page types: correct page types
- Page numbers: correct page sequence
- OCR: perform text correction of specific zones (e.g. headlines, captions)
10. Output | METS/ALTO package
• METS/ALTO metadata schemas to describe the structured digital ouput object
• A newspaper issue processed in docWorks is converted into one METS XML file. It
reflects the whole physical and logical structure, manages all links to the image files and
the related ALTO XML files. ALTO is based on a standardized page description schema
and contains all information of a page (print space, margins, coordinates, OCR results).
• Benefits of structural markup:
- better browsing and more precise text search
- better access and display on tablet and mobile devices
- automated article classification and clustering through data/text mining and
linguistic technologies
- user engagement for manual online text correction, article classification,
annotation, building personal collections, etc.
- sharing articles via social media platforms like Facebook, Twitter, etc.
_______________
METS = Metadada Encoding and Transmission Standard
ALTO = Analyzed Layout and Text Object
11. Access and Presentation (I)
• Sample presentation
system (Veridian)
• Browse by date, title
• Text search
• Article hit list
• Word highlighting
13. Access and Presentation (III)
• Text & image view
• User text correction
• Article clipping
• Print article
• Distribute via email and
social media platforms
14. Thank you for your attention!
c.gravenhorst@content-conversion.com
www.europeana-newspapers.eu