What‘s up, Europeana Newspapers?
Clemens Neudecker (@cneudecker)
Staatsbibliothek zu Berlin –
Preußischer Kulturbesitz
A little bit of history
2012 – 2015: Europeana Newspapers
ICT-PSP Project (2012-2015)
31 Dec 2016: The European Library (TEL) closed
2017: DSI-2/3: Migration;
Newspapers Collection Plan
July 2018: Planned Re-Launch of Europeana
Newspapers as thematic collection
Main outcomes
– TEL Historic Newspapers Portal:
http://www.theeuropeanlibrary.org/tel4/newspapers
– Deliverables:
http://www.europeana-newspapers.eu/
public-materials/deliverables/
– Tools:
http://www.europeana-newspapers.eu/
public-materials/tools/
– Final Report:
http://europeananewspapers.github.io/
Data
• 1618 – 2016
• 12 countries
• 40 languages
• 120 TB
• Ca. 1,000 titles
• 3,3M issues
Data
• Metadata for more than >20 million pages
• 12 million pages processed with OCR
• 2 million pages processed with OLR
• Most content licensed as Public Domain
• All metadata licensed under CC0
• Copyright cut-off date
(„copyright cliff of death“)
Data
• JP2000 images for use with IIIPserver
• METS container with embedded MODS
for structural and bibliographic metadata
• ALTO for OCRed text
• EDM for Europeana
Europeana Newspapers METS/ALTO Profile
(ENMAP)
OCR/OLR
• OCR: ABBYY FineReader Engine 11
– Gothic license per page (A4!)
– 4 servers with 8 cores = 32 processing cores
– Average processing time of 5s per newspaper page
• OLR: CCS docWorks
– Article separation & page classification
– Possibility for post-correction/validation of results
Evaluation
• Scenario-based performance evaluation of
OCR/OLR using PAGE ground truth
• Ground truth dataset:
http://primaresearch.org/datasets/ENP
• Performance Evaluation Report:
http://www.europeana-newspapers.eu/wp-
content/uploads/2015/05/D3.5_Performance_Ev
aluation_Report_1.0.pdf
Evaluation
82.4%
85.3%
80.9%
75.9%
67.5%
83.4% 84.1%
68.1%
93.1%
57.6%
87.0%
68.3%
76.1%
82.6%
54.1%
32.7%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
SuccessRate
Language Setting
Bag of Words OCR Evaluation
Per Language
67.3%
81.4%
64.0%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Gothic Normal Mixed
SuccessRate
Font
Bag of Words OCR Evaluation
Per Font
79.1%
62.2%
55.9%
58.8%
94.7%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Keyword
search
Phrase search Access via
content
structure
Print/ebook
on demand
Content
based image
retrieval
SuccessRate(harmonic,areabased)
Evaluation Profile
Layout Analysis Performance
Per evaluationprofile
74.35%
75.31%
70%
71%
72%
73%
74%
75%
76%
77%
NCSR Binarisation Original Image
SuccessRate
Image Source
Bag of Words OCR Evaluation
Binarised image vs. original image
75.3%
53.78%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
SuccessRate(countbased)
OCR Engine
Bag of Words OCR Evaluation
FineReader vs. Tesseract
FineReader Tesseract
Use in Research
• Digital Humanities
– DHd AG Newspapers initiated at DHd 2018
– #HacktheNews workshop at DHNord 2018
– Roundtable on newspapers at DHBenelux 2018
• At the Berlin State Library:
– University Regensburg
– Technical University Dortmund
– Berlin-Brandenburg Academy of Sciences
Other Activities
• Rise of Literacy Generic Services Projekt
• IIIF Newspaper Interest Group
– http://iiif.io/community/groups/newspapers/
– https://github.com/IIIF/awesome-iiif#newspapers
• TEI SIG Newspapers & Periodicals
– https://wiki.tei-c.org/index.php/
SIG:Newspapers%26Periodicals
Berliner Schlagzeilen
• Created as part of Coding da Vinci Berlin 2017
• Twitterbot that tweets out daily about the
news from 100 years ago
• Source code available:
https://github.com/shoutrlabs/
berliner-schlagzeilen
Altpapier App
• Created as part of Coding da Vinci Berlin 2017
• Android (and soon also iOS) app that shows the
user newspaper articles with the possibility to
correct errors
• Available as source code
https://github.com/mariabecker/OldNews
and on the Play Store
https://play.google.com/store/apps/details?id=ol
dnews.de.oldnews
Visualizing European Newspapers
• Visualization prototype with large touch
interface composed of multiple screens
made by Sven Charleer of KU Leuven
The Situation in Germany
2012 – 2015:DFG Pilot Project
„Digitisation of historical newspapers“
Master Plan, Guidelines, etc.
2017: Relaunch of ZDB union catalog of serials
http://zdb-katalog.de/
2017: DFG Proposal (SBB, DDB involved)
„A national portal for digitised historical
newspapers at the Germany Digital Library“
2018: DFG Call for proposals
„Digitisation of historical newspapers“