SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Optical Layout Recognition (OLR) 
From unstructured to structured newspaper data 
Claus Gravenhorst, CCS Content Conversion Specialists GmbH 
ENP information day, Paris, November 27, 2014
Agenda 
• About CCS 
• General OLR-workflow for mass digitization 
• Layout and structure analysis 
• ENP OLR workflow 
• Quality assurance 
• Output – METS/ALTO package 
• Use of structural data – Access and presentation
About CCS 
• CCS Content Conversion Specialists GmbH (Hamburg), as technical project 
partner, will provide its expertise and docWorks technology to set up and operate 
a mass digitization workflow for creating high quality structured content from 2 
million scanned newspaper pages provided by 5 library partners 
• Page volume: 
BNF=1.000 k, NLE=500 k , SUB HH=480 k, NLF=90 k, SBB=10 k 
• The distributed OLR workflow enables the contribution of project partners 
(content providers) to the integrated quality assurance process 
• CCS is also contributing to the specification of the ENMAP metadata model
General workflow for mass digitization 
Re-Scan 
Conversion 
Imaging 
Layout 
Analysis 
OCR 
ISR 
QA + 
Correction 
Reject 
Condition 
Final 
Output 
Delivery QA 
random 
Scanning 
Image 
Metadata 
Database 
---------------- 
Repository 
• Automated QA 
Document 
UID 
Barcode 
Item Tracking 
Manual QA 
•in-house 
•near-shore 
•off-shore 
•multiple locations 
Manual QA 
•in-house 
•near-shore 
Check in 
Check out 
Scanner 
•Robot- 
•Book- 
•Document- 
•Microfilm- 
QA+Correcti 
QA+Coornrecti 
on 
Z 39.50 
Metadata
Layout and structure analysis 
• Layout analysis based on „bottom up“ approach 
• General rule system enables recognition of words, 
text lines, text blocks, columns and classification of 
text blocks, illustrations, advertisements, tables and 
the following page types: 
- title page (the title page of an issue) 
- content page (a page that consists of content/text only) 
- illustration page (a page that has at least one illustration) 
- advertisement page (a page that contains adverts only) 
• Structure analysis through classification of headlines 
and grouping of zones into articles 
(incl. article continuation)
ENP OLR workflow | Conversion without scanning 
•Digital Image 
•Metadata 
Delivery 
Digital Image 
Metadata 
Delivery 
•Digital Object 
Digital Object 
Return 
Return 
Inspection / 
Automatic QA 
Inspection / 
Automatic QA 
••DDoocc DDeelliivveerryy 
RReejjeecctt 
Material location 
Conversion facility 
Conversion 
MD Recording
Possible conversion scenarios 
A) Conversion at library (on-site) 
B) Conversion off-shore at CCS data center, 
final QA at the library via internet transfer (remote QA solution) 
C) Conversion off-shore at CCS, 
final QA at the library by backup shipment
Scenario B | Remote QA at library 
Internet 
SSttoorraaggee 
dW Share 
Master 
IN 
dW Share 
POOL OUT 
Offshore 
Processing 
@ CCS 
OUTPUT 
METS ALTO 
SSttoorraaggee 
POOL 
RQA 
QA on-site 
@ Library 
INPUT
Quality assurance 
• @ CCS | Automated markup and basic manual correction: 
- Headlines, illustrations, tables, captions, advertisements, etc. 
- Article segmentation and grouping of zones into articles (incl. continuation) 
• @ Content Provider (Library) 
Recommended: 
- Zoning: correct classification of blocks as „text“ or „illustration“ 
- Article segmentation: correct identification of headlines/text blocks/captions 
- Grouping: correct grouping of blocks (text, illustration) to articles 
- Metadata: correct title, issue date and issue number 
Optional: 
- Page types: correct page types 
- Page numbers: correct page sequence 
- OCR: perform text correction of specific zones (e.g. headlines, captions)
Output | METS/ALTO package 
• METS/ALTO metadata schemas to describe the structured digital ouput object 
• A newspaper issue processed in docWorks is converted into one METS XML file. It 
reflects the whole physical and logical structure, manages all links to the image files and 
the related ALTO XML files. ALTO is based on a standardized page description schema 
and contains all information of a page (print space, margins, coordinates, OCR results). 
• Benefits of structural markup: 
- better browsing and more precise text search 
- better access and display on tablet and mobile devices 
- automated article classification and clustering through data/text mining and 
linguistic technologies 
- user engagement for manual online text correction, article classification, 
annotation, building personal collections, etc. 
- sharing articles via social media platforms like Facebook, Twitter, etc. 
_______________ 
METS = Metadada Encoding and Transmission Standard 
ALTO = Analyzed Layout and Text Object
Access and Presentation (I) 
• Sample presentation 
system (Veridian) 
• Browse by date, title 
• Text search 
• Article hit list 
• Word highlighting
Access and Presentation (II) 
• Issue 
• Table of contents
Access and Presentation (III) 
• Text & image view 
• User text correction 
• Article clipping 
• Print article 
• Distribute via email and 
social media platforms
Thank you for your attention! 
c.gravenhorst@content-conversion.com 
www.europeana-newspapers.eu

Weitere ähnliche Inhalte

Was ist angesagt?

DATABASE & WEBGIS - GIS BOOTCAMP
DATABASE & WEBGIS - GIS BOOTCAMPDATABASE & WEBGIS - GIS BOOTCAMP
DATABASE & WEBGIS - GIS BOOTCAMPKevin Ng'eno
 
Cruising in data lake from zero to scale
Cruising in data lake from zero to scaleCruising in data lake from zero to scale
Cruising in data lake from zero to scaleJohn Varghese
 
A University Web Team's Approach to Google Analytics
A University Web Team's Approach to Google AnalyticsA University Web Team's Approach to Google Analytics
A University Web Team's Approach to Google AnalyticsChris Traganos
 
Placement at Sahmakum Teang Tnaut funded by Engineers Without Borders Australia
Placement at Sahmakum Teang Tnaut funded by Engineers Without Borders AustraliaPlacement at Sahmakum Teang Tnaut funded by Engineers Without Borders Australia
Placement at Sahmakum Teang Tnaut funded by Engineers Without Borders AustraliaWilfred Waters
 
QGIS and Altas: Automatic map generation
QGIS and Altas: Automatic map generationQGIS and Altas: Automatic map generation
QGIS and Altas: Automatic map generationQGIS UK
 
PhD Projects in Cloudsim Research Assistance
PhD Projects in Cloudsim Research AssistancePhD Projects in Cloudsim Research Assistance
PhD Projects in Cloudsim Research AssistancePhD Services
 
Designing and Using Cached Map
Designing and Using Cached Map Designing and Using Cached Map
Designing and Using Cached Map M.Muneeb Ashraf
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
Lightweight Collection and Storage of Software Repository Data with DataRover
Lightweight Collection and Storage of  Software Repository Data with DataRoverLightweight Collection and Storage of  Software Repository Data with DataRover
Lightweight Collection and Storage of Software Repository Data with DataRoverChristoph Matthies
 
Karnataka Geospatial Experience FME World Tour 2017 India
Karnataka Geospatial Experience FME World Tour 2017 IndiaKarnataka Geospatial Experience FME World Tour 2017 India
Karnataka Geospatial Experience FME World Tour 2017 IndiaRaghavendran S
 
Dash plotly data visualization
Dash plotly data visualizationDash plotly data visualization
Dash plotly data visualizationCharu Gupta
 
Lei Liu Resume
Lei Liu ResumeLei Liu Resume
Lei Liu ResumeLei Liu
 
Media iQ fifth elephant teaser
Media iQ   fifth elephant teaser Media iQ   fifth elephant teaser
Media iQ fifth elephant teaser prabhuprakash
 

Was ist angesagt? (16)

AMIS ADF Weblogic 12c launch Event 08 DVT And Websockets by Lucas Jellema
AMIS ADF Weblogic 12c launch Event 08  DVT And Websockets by Lucas JellemaAMIS ADF Weblogic 12c launch Event 08  DVT And Websockets by Lucas Jellema
AMIS ADF Weblogic 12c launch Event 08 DVT And Websockets by Lucas Jellema
 
DATABASE & WEBGIS - GIS BOOTCAMP
DATABASE & WEBGIS - GIS BOOTCAMPDATABASE & WEBGIS - GIS BOOTCAMP
DATABASE & WEBGIS - GIS BOOTCAMP
 
Semantic based auto-completion of business process modeling in eGovernment
Semantic based auto-completion of business process modeling in eGovernmentSemantic based auto-completion of business process modeling in eGovernment
Semantic based auto-completion of business process modeling in eGovernment
 
Cruising in data lake from zero to scale
Cruising in data lake from zero to scaleCruising in data lake from zero to scale
Cruising in data lake from zero to scale
 
A University Web Team's Approach to Google Analytics
A University Web Team's Approach to Google AnalyticsA University Web Team's Approach to Google Analytics
A University Web Team's Approach to Google Analytics
 
Placement at Sahmakum Teang Tnaut funded by Engineers Without Borders Australia
Placement at Sahmakum Teang Tnaut funded by Engineers Without Borders AustraliaPlacement at Sahmakum Teang Tnaut funded by Engineers Without Borders Australia
Placement at Sahmakum Teang Tnaut funded by Engineers Without Borders Australia
 
QGIS and Altas: Automatic map generation
QGIS and Altas: Automatic map generationQGIS and Altas: Automatic map generation
QGIS and Altas: Automatic map generation
 
PhD Projects in Cloudsim Research Assistance
PhD Projects in Cloudsim Research AssistancePhD Projects in Cloudsim Research Assistance
PhD Projects in Cloudsim Research Assistance
 
Designing and Using Cached Map
Designing and Using Cached Map Designing and Using Cached Map
Designing and Using Cached Map
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
Grafana
GrafanaGrafana
Grafana
 
Lightweight Collection and Storage of Software Repository Data with DataRover
Lightweight Collection and Storage of  Software Repository Data with DataRoverLightweight Collection and Storage of  Software Repository Data with DataRover
Lightweight Collection and Storage of Software Repository Data with DataRover
 
Karnataka Geospatial Experience FME World Tour 2017 India
Karnataka Geospatial Experience FME World Tour 2017 IndiaKarnataka Geospatial Experience FME World Tour 2017 India
Karnataka Geospatial Experience FME World Tour 2017 India
 
Dash plotly data visualization
Dash plotly data visualizationDash plotly data visualization
Dash plotly data visualization
 
Lei Liu Resume
Lei Liu ResumeLei Liu Resume
Lei Liu Resume
 
Media iQ fifth elephant teaser
Media iQ   fifth elephant teaser Media iQ   fifth elephant teaser
Media iQ fifth elephant teaser
 

Andere mochten auch

Presentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information DayPresentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information DayEuropeana Newspapers
 
Presentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information DayPresentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information DayEuropeana Newspapers
 
Presentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information DayPresentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information DayEuropeana Newspapers
 
Presentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayEuropeana Newspapers
 
Presentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in ParisPresentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in ParisEuropeana Newspapers
 
Présentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information DayPrésentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information DayEuropeana Newspapers
 

Andere mochten auch (9)

Presentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information DayPresentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information Day
 
Presentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information DayPresentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information Day
 
What is a named entity
What is a named entityWhat is a named entity
What is a named entity
 
Presentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information DayPresentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information Day
 
Presentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information Day
 
Presentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in ParisPresentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in Paris
 
DocWorks Demo
DocWorks DemoDocWorks Demo
DocWorks Demo
 
Présentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information DayPrésentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information Day
 
ENP Belgrade WS Metadata
ENP Belgrade WS MetadataENP Belgrade WS Metadata
ENP Belgrade WS Metadata
 

Ähnlich wie Presentation of Claus Gravenhorst, BnF Information Day

A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesDataWorks Summit
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation WorkflowsSCAPE Project
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven productsLars Albertsson
 
Arches Getty Brownbag Talk
Arches Getty Brownbag TalkArches Getty Brownbag Talk
Arches Getty Brownbag Talkbenosteen
 
[DSC DACH 23] The Modern Data Stack - Bogdan Pirvu
[DSC DACH 23] The Modern Data Stack - Bogdan Pirvu[DSC DACH 23] The Modern Data Stack - Bogdan Pirvu
[DSC DACH 23] The Modern Data Stack - Bogdan PirvuDataScienceConferenc1
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...DataWorks Summit
 
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...Alex Zeltov
 
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Librarycneudecker
 
A Collaborative Data Science Development Workflow
A Collaborative Data Science Development WorkflowA Collaborative Data Science Development Workflow
A Collaborative Data Science Development WorkflowDatabricks
 
Using Cloud Automation Technologies to Deliver an Enterprise Data Fabric
Using Cloud Automation Technologies to Deliver an Enterprise Data FabricUsing Cloud Automation Technologies to Deliver an Enterprise Data Fabric
Using Cloud Automation Technologies to Deliver an Enterprise Data FabricCambridge Semantics
 
IncQuery Labs Models 2020 MIP Talk
IncQuery Labs Models 2020 MIP TalkIncQuery Labs Models 2020 MIP Talk
IncQuery Labs Models 2020 MIP TalkIncQuery Labs
 
Next-Generation Completeness and Consistency Management in the Digital Threa...
Next-Generation Completeness and Consistency Management in the Digital Threa...Next-Generation Completeness and Consistency Management in the Digital Threa...
Next-Generation Completeness and Consistency Management in the Digital Threa...Ákos Horváth
 
MACHBASE_NEO
MACHBASE_NEOMACHBASE_NEO
MACHBASE_NEOMACHBASE
 
Proplanner - AutoCAD Based Facility Layout Analysis & Improvement
Proplanner - AutoCAD Based Facility Layout Analysis & ImprovementProplanner - AutoCAD Based Facility Layout Analysis & Improvement
Proplanner - AutoCAD Based Facility Layout Analysis & ImprovementProplanner Asia
 
IncQuery_presentation_Incose_EMEA_WSEC.pptx
IncQuery_presentation_Incose_EMEA_WSEC.pptxIncQuery_presentation_Incose_EMEA_WSEC.pptx
IncQuery_presentation_Incose_EMEA_WSEC.pptxIncQuery Labs
 
Abinitio Experienced resume-Anilkumar
Abinitio Experienced resume-AnilkumarAbinitio Experienced resume-Anilkumar
Abinitio Experienced resume-Anilkumaranilkumar kagitha
 
How Docker EE is Finnish Railway’s Ticket to App Modernization
How Docker EE is Finnish Railway’s Ticket to App ModernizationHow Docker EE is Finnish Railway’s Ticket to App Modernization
How Docker EE is Finnish Railway’s Ticket to App ModernizationDocker, Inc.
 
Big data berlin
Big data berlinBig data berlin
Big data berlinkammeyer
 

Ähnlich wie Presentation of Claus Gravenhorst, BnF Information Day (20)

A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Arches Getty Brownbag Talk
Arches Getty Brownbag TalkArches Getty Brownbag Talk
Arches Getty Brownbag Talk
 
[DSC DACH 23] The Modern Data Stack - Bogdan Pirvu
[DSC DACH 23] The Modern Data Stack - Bogdan Pirvu[DSC DACH 23] The Modern Data Stack - Bogdan Pirvu
[DSC DACH 23] The Modern Data Stack - Bogdan Pirvu
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
 
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
 
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
 
A Collaborative Data Science Development Workflow
A Collaborative Data Science Development WorkflowA Collaborative Data Science Development Workflow
A Collaborative Data Science Development Workflow
 
ENP Belgrade WS OLR @ CCS
ENP Belgrade WS OLR @ CCSENP Belgrade WS OLR @ CCS
ENP Belgrade WS OLR @ CCS
 
Using Cloud Automation Technologies to Deliver an Enterprise Data Fabric
Using Cloud Automation Technologies to Deliver an Enterprise Data FabricUsing Cloud Automation Technologies to Deliver an Enterprise Data Fabric
Using Cloud Automation Technologies to Deliver an Enterprise Data Fabric
 
IncQuery Labs Models 2020 MIP Talk
IncQuery Labs Models 2020 MIP TalkIncQuery Labs Models 2020 MIP Talk
IncQuery Labs Models 2020 MIP Talk
 
Next-Generation Completeness and Consistency Management in the Digital Threa...
Next-Generation Completeness and Consistency Management in the Digital Threa...Next-Generation Completeness and Consistency Management in the Digital Threa...
Next-Generation Completeness and Consistency Management in the Digital Threa...
 
MACHBASE_NEO
MACHBASE_NEOMACHBASE_NEO
MACHBASE_NEO
 
Proplanner - AutoCAD Based Facility Layout Analysis & Improvement
Proplanner - AutoCAD Based Facility Layout Analysis & ImprovementProplanner - AutoCAD Based Facility Layout Analysis & Improvement
Proplanner - AutoCAD Based Facility Layout Analysis & Improvement
 
IncQuery_presentation_Incose_EMEA_WSEC.pptx
IncQuery_presentation_Incose_EMEA_WSEC.pptxIncQuery_presentation_Incose_EMEA_WSEC.pptx
IncQuery_presentation_Incose_EMEA_WSEC.pptx
 
Abinitio Experienced resume-Anilkumar
Abinitio Experienced resume-AnilkumarAbinitio Experienced resume-Anilkumar
Abinitio Experienced resume-Anilkumar
 
How Docker EE is Finnish Railway’s Ticket to App Modernization
How Docker EE is Finnish Railway’s Ticket to App ModernizationHow Docker EE is Finnish Railway’s Ticket to App Modernization
How Docker EE is Finnish Railway’s Ticket to App Modernization
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
 

Mehr von Europeana Newspapers

IFLA 2014 Europeana Newspapers Rossitza Atanassova
IFLA 2014 Europeana Newspapers Rossitza AtanassovaIFLA 2014 Europeana Newspapers Rossitza Atanassova
IFLA 2014 Europeana Newspapers Rossitza AtanassovaEuropeana Newspapers
 
Europeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne KoutsEuropeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne KoutsEuropeana Newspapers
 
Europeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel VeimannEuropeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel VeimannEuropeana Newspapers
 
Europeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista KiisaEuropeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista KiisaEuropeana Newspapers
 
Europeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista AruEuropeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista AruEuropeana Newspapers
 
Europeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred PussEuropeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred PussEuropeana Newspapers
 
Europeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday NeudeckerEuropeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday NeudeckerEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday ThompsonEuropeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday ThompsonEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday RossiEuropeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday RossiEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday MessinaEuropeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday MessinaEuropeana Newspapers
 
Europeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday MarchettiEuropeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday MarchettiEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday KempfEuropeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday KempfEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Genereux
Europeana Newspapers LFT Infoday GenereuxEuropeana Newspapers LFT Infoday Genereux
Europeana Newspapers LFT Infoday GenereuxEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Bolioli
Europeana Newspapers LFT Infoday BolioliEuropeana Newspapers LFT Infoday Bolioli
Europeana Newspapers LFT Infoday BolioliEuropeana Newspapers
 

Mehr von Europeana Newspapers (20)

IFLA 2014 Europeana Newspapers Rossitza Atanassova
IFLA 2014 Europeana Newspapers Rossitza AtanassovaIFLA 2014 Europeana Newspapers Rossitza Atanassova
IFLA 2014 Europeana Newspapers Rossitza Atanassova
 
Europeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne KoutsEuropeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne Kouts
 
Europeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel VeimannEuropeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel Veimann
 
Europeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista KiisaEuropeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista Kiisa
 
Europeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista AruEuropeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista Aru
 
Europeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred PussEuropeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred Puss
 
Europeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday NeudeckerEuropeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday Neudecker
 
Europeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday ThompsonEuropeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday Thompson
 
Europeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday RossiEuropeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday Rossi
 
Enp lft infoday_neudecker
Enp lft infoday_neudeckerEnp lft infoday_neudecker
Enp lft infoday_neudecker
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday Muehlberger
 
Europeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday MessinaEuropeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday Messina
 
Europeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday MarchettiEuropeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday Marchetti
 
Europeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday KempfEuropeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday Kempf
 
Europeana Newspapers LFT Infoday Genereux
Europeana Newspapers LFT Infoday GenereuxEuropeana Newspapers LFT Infoday Genereux
Europeana Newspapers LFT Infoday Genereux
 
Europeana Newspapers LFT Infoday Bolioli
Europeana Newspapers LFT Infoday BolioliEuropeana Newspapers LFT Infoday Bolioli
Europeana Newspapers LFT Infoday Bolioli
 
ENP_Dutch_Infoday_MWillems
ENP_Dutch_Infoday_MWillemsENP_Dutch_Infoday_MWillems
ENP_Dutch_Infoday_MWillems
 
ENP_Dutch_Infoday_LWilms
ENP_Dutch_Infoday_LWilmsENP_Dutch_Infoday_LWilms
ENP_Dutch_Infoday_LWilms
 
ENP_Dutch_Infoday_PHuijnen
ENP_Dutch_Infoday_PHuijnen ENP_Dutch_Infoday_PHuijnen
ENP_Dutch_Infoday_PHuijnen
 
ENP_Dutch_Infoday_SKruizinga
ENP_Dutch_Infoday_SKruizingaENP_Dutch_Infoday_SKruizinga
ENP_Dutch_Infoday_SKruizinga
 

Kürzlich hochgeladen

Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinojohnmickonozaleda
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 

Kürzlich hochgeladen (20)

Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipino
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 

Presentation of Claus Gravenhorst, BnF Information Day

  • 1. Optical Layout Recognition (OLR) From unstructured to structured newspaper data Claus Gravenhorst, CCS Content Conversion Specialists GmbH ENP information day, Paris, November 27, 2014
  • 2. Agenda • About CCS • General OLR-workflow for mass digitization • Layout and structure analysis • ENP OLR workflow • Quality assurance • Output – METS/ALTO package • Use of structural data – Access and presentation
  • 3. About CCS • CCS Content Conversion Specialists GmbH (Hamburg), as technical project partner, will provide its expertise and docWorks technology to set up and operate a mass digitization workflow for creating high quality structured content from 2 million scanned newspaper pages provided by 5 library partners • Page volume: BNF=1.000 k, NLE=500 k , SUB HH=480 k, NLF=90 k, SBB=10 k • The distributed OLR workflow enables the contribution of project partners (content providers) to the integrated quality assurance process • CCS is also contributing to the specification of the ENMAP metadata model
  • 4. General workflow for mass digitization Re-Scan Conversion Imaging Layout Analysis OCR ISR QA + Correction Reject Condition Final Output Delivery QA random Scanning Image Metadata Database ---------------- Repository • Automated QA Document UID Barcode Item Tracking Manual QA •in-house •near-shore •off-shore •multiple locations Manual QA •in-house •near-shore Check in Check out Scanner •Robot- •Book- •Document- •Microfilm- QA+Correcti QA+Coornrecti on Z 39.50 Metadata
  • 5. Layout and structure analysis • Layout analysis based on „bottom up“ approach • General rule system enables recognition of words, text lines, text blocks, columns and classification of text blocks, illustrations, advertisements, tables and the following page types: - title page (the title page of an issue) - content page (a page that consists of content/text only) - illustration page (a page that has at least one illustration) - advertisement page (a page that contains adverts only) • Structure analysis through classification of headlines and grouping of zones into articles (incl. article continuation)
  • 6. ENP OLR workflow | Conversion without scanning •Digital Image •Metadata Delivery Digital Image Metadata Delivery •Digital Object Digital Object Return Return Inspection / Automatic QA Inspection / Automatic QA ••DDoocc DDeelliivveerryy RReejjeecctt Material location Conversion facility Conversion MD Recording
  • 7. Possible conversion scenarios A) Conversion at library (on-site) B) Conversion off-shore at CCS data center, final QA at the library via internet transfer (remote QA solution) C) Conversion off-shore at CCS, final QA at the library by backup shipment
  • 8. Scenario B | Remote QA at library Internet SSttoorraaggee dW Share Master IN dW Share POOL OUT Offshore Processing @ CCS OUTPUT METS ALTO SSttoorraaggee POOL RQA QA on-site @ Library INPUT
  • 9. Quality assurance • @ CCS | Automated markup and basic manual correction: - Headlines, illustrations, tables, captions, advertisements, etc. - Article segmentation and grouping of zones into articles (incl. continuation) • @ Content Provider (Library) Recommended: - Zoning: correct classification of blocks as „text“ or „illustration“ - Article segmentation: correct identification of headlines/text blocks/captions - Grouping: correct grouping of blocks (text, illustration) to articles - Metadata: correct title, issue date and issue number Optional: - Page types: correct page types - Page numbers: correct page sequence - OCR: perform text correction of specific zones (e.g. headlines, captions)
  • 10. Output | METS/ALTO package • METS/ALTO metadata schemas to describe the structured digital ouput object • A newspaper issue processed in docWorks is converted into one METS XML file. It reflects the whole physical and logical structure, manages all links to the image files and the related ALTO XML files. ALTO is based on a standardized page description schema and contains all information of a page (print space, margins, coordinates, OCR results). • Benefits of structural markup: - better browsing and more precise text search - better access and display on tablet and mobile devices - automated article classification and clustering through data/text mining and linguistic technologies - user engagement for manual online text correction, article classification, annotation, building personal collections, etc. - sharing articles via social media platforms like Facebook, Twitter, etc. _______________ METS = Metadada Encoding and Transmission Standard ALTO = Analyzed Layout and Text Object
  • 11. Access and Presentation (I) • Sample presentation system (Veridian) • Browse by date, title • Text search • Article hit list • Word highlighting
  • 12. Access and Presentation (II) • Issue • Table of contents
  • 13. Access and Presentation (III) • Text & image view • User text correction • Article clipping • Print article • Distribute via email and social media platforms
  • 14. Thank you for your attention! c.gravenhorst@content-conversion.com www.europeana-newspapers.eu