SlideShare ist ein Scribd-Unternehmen logo
1 von 27
What‘s up, Europeana Newspapers?
Clemens Neudecker (@cneudecker)
Staatsbibliothek zu Berlin –
Preußischer Kulturbesitz
A little bit of history
2012 – 2015: Europeana Newspapers
ICT-PSP Project (2012-2015)
31 Dec 2016: The European Library (TEL) closed
2017: DSI-2/3: Migration;
Newspapers Collection Plan
July 2018: Planned Re-Launch of Europeana
Newspapers as thematic collection
Main outcomes
– TEL Historic Newspapers Portal:
http://www.theeuropeanlibrary.org/tel4/newspapers
– Deliverables:
http://www.europeana-newspapers.eu/
public-materials/deliverables/
– Tools:
http://www.europeana-newspapers.eu/
public-materials/tools/
– Final Report:
http://europeananewspapers.github.io/
Data
• 1618 – 2016
• 12 countries
• 40 languages
• 120 TB
• Ca. 1,000 titles
• 3,3M issues
Data
• Metadata for more than >20 million pages
• 12 million pages processed with OCR
• 2 million pages processed with OLR
• Most content licensed as Public Domain
• All metadata licensed under CC0
• Copyright cut-off date
(„copyright cliff of death“)
Data
• JP2000 images for use with IIIPserver
• METS container with embedded MODS
for structural and bibliographic metadata
• ALTO for OCRed text
• EDM for Europeana
 Europeana Newspapers METS/ALTO Profile
(ENMAP)
OCR/OLR
• OCR: ABBYY FineReader Engine 11
– Gothic license per page (A4!)
– 4 servers with 8 cores = 32 processing cores
– Average processing time of 5s per newspaper page
• OLR: CCS docWorks
– Article separation & page classification
– Possibility for post-correction/validation of results
Evaluation
• Scenario-based performance evaluation of
OCR/OLR using PAGE ground truth
• Ground truth dataset:
http://primaresearch.org/datasets/ENP
• Performance Evaluation Report:
http://www.europeana-newspapers.eu/wp-
content/uploads/2015/05/D3.5_Performance_Ev
aluation_Report_1.0.pdf
Evaluation
82.4%
85.3%
80.9%
75.9%
67.5%
83.4% 84.1%
68.1%
93.1%
57.6%
87.0%
68.3%
76.1%
82.6%
54.1%
32.7%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
SuccessRate
Language Setting
Bag of Words OCR Evaluation
Per Language
67.3%
81.4%
64.0%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Gothic Normal Mixed
SuccessRate
Font
Bag of Words OCR Evaluation
Per Font
79.1%
62.2%
55.9%
58.8%
94.7%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Keyword
search
Phrase search Access via
content
structure
Print/ebook
on demand
Content
based image
retrieval
SuccessRate(harmonic,areabased)
Evaluation Profile
Layout Analysis Performance
Per evaluationprofile
74.35%
75.31%
70%
71%
72%
73%
74%
75%
76%
77%
NCSR Binarisation Original Image
SuccessRate
Image Source
Bag of Words OCR Evaluation
Binarised image vs. original image
75.3%
53.78%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
SuccessRate(countbased)
OCR Engine
Bag of Words OCR Evaluation
FineReader vs. Tesseract
FineReader Tesseract
Use in Research
Use in Research
• Oceanic Exchanges (Digging Into Data, 2017-2019)
• impresso (Swiss National Fund, 2017 – 2020)
• NewsEye (EU H2020, 2018 – 2020)
• CLARIN (EU ERIC)
• Europeana Research, Interviews with Researchers
• At Scientific Conferences
– DAS, ICDAR: Europeana Newspapers Ground Truth
– LREC, ACL: Europeana Newspapers NER Corpora
Oceanic Exchanges
(Digging Into Data, 2017-2019)
impresso
(Swiss National Fund, 2017 – 2020)
Use in Research
• Digital Humanities
– DHd AG Newspapers initiated at DHd 2018
– #HacktheNews workshop at DHNord 2018
– Roundtable on newspapers at DHBenelux 2018
• At the Berlin State Library:
– University Regensburg
– Technical University Dortmund
– Berlin-Brandenburg Academy of Sciences
Other Activities
• Rise of Literacy Generic Services Projekt
• IIIF Newspaper Interest Group
– http://iiif.io/community/groups/newspapers/
– https://github.com/IIIF/awesome-iiif#newspapers
• TEI SIG Newspapers & Periodicals
– https://wiki.tei-c.org/index.php/
SIG:Newspapers%26Periodicals
Creative Reuse
Berliner Schlagzeilen
• Created as part of Coding da Vinci Berlin 2017
• Twitterbot that tweets out daily about the
news from 100 years ago
• Source code available:
https://github.com/shoutrlabs/
berliner-schlagzeilen
Altpapier App
• Created as part of Coding da Vinci Berlin 2017
• Android (and soon also iOS) app that shows the
user newspaper articles with the possibility to
correct errors
• Available as source code
https://github.com/mariabecker/OldNews
and on the Play Store
https://play.google.com/store/apps/details?id=ol
dnews.de.oldnews
Visualizing European Newspapers
• Visualization prototype with large touch
interface composed of multiple screens
made by Sven Charleer of KU Leuven
Future Plans
Europeana Newspapers
Thematic Collection
The Situation in Germany
2012 – 2015:DFG Pilot Project
„Digitisation of historical newspapers“
Master Plan, Guidelines, etc.
2017: Relaunch of ZDB union catalog of serials
http://zdb-katalog.de/
2017: DFG Proposal (SBB, DDB involved)
„A national portal for digitised historical
newspapers at the Germany Digital Library“
2018: DFG Call for proposals
„Digitisation of historical newspapers“
Europeana Newspapers Thematic Collection Re-Launch

Weitere ähnliche Inhalte

Was ist angesagt?

You’ve Digitised Your Collection. What Next ?
You’ve Digitised Your Collection. What Next ?You’ve Digitised Your Collection. What Next ?
You’ve Digitised Your Collection. What Next ?The European Library
 
Estermann Linked Data Ecosystem for Heritage Data - 29 Feb 2020
Estermann Linked Data Ecosystem for Heritage Data - 29 Feb 2020Estermann Linked Data Ecosystem for Heritage Data - 29 Feb 2020
Estermann Linked Data Ecosystem for Heritage Data - 29 Feb 2020Beat Estermann
 
Open Cultural Data in Switzerland
Open Cultural Data in SwitzerlandOpen Cultural Data in Switzerland
Open Cultural Data in SwitzerlandBeat Estermann
 
Reusing historical newspapers of KB in e-humanities - Case studies and exampl...
Reusing historical newspapers of KB in e-humanities - Case studies and exampl...Reusing historical newspapers of KB in e-humanities - Case studies and exampl...
Reusing historical newspapers of KB in e-humanities - Case studies and exampl...Olaf Janssen
 
Europeana Newspapers - Data, Tools & Future Plans
 Europeana Newspapers - Data, Tools & Future Plans  Europeana Newspapers - Data, Tools & Future Plans
Europeana Newspapers - Data, Tools & Future Plans cneudecker
 
Open Cultural Heritage Data @ the Rijksmuseum
Open Cultural Heritage Data @ the RijksmuseumOpen Cultural Heritage Data @ the Rijksmuseum
Open Cultural Heritage Data @ the RijksmuseumSaskia Scheltjens
 
From Open Acces to Open Collections to Open Minds
From Open Acces to Open Collections to Open MindsFrom Open Acces to Open Collections to Open Minds
From Open Acces to Open Collections to Open MindsSaskia Scheltjens
 
Text and Data Mining at the Royal Library in the Netherlands
Text and Data Mining at the Royal Library in the NetherlandsText and Data Mining at the Royal Library in the Netherlands
Text and Data Mining at the Royal Library in the Netherlandsopenminted_eu
 
The Great Twentieth-Century Hole Or, what the Digital Humanities Miss
The Great Twentieth-Century Hole Or, what the Digital Humanities MissThe Great Twentieth-Century Hole Or, what the Digital Humanities Miss
The Great Twentieth-Century Hole Or, what the Digital Humanities MissTU Delft, Netherlands
 
Representation and Absence in Digital Resources: The Case of Europeana Newspa...
Representation and Absence in Digital Resources: The Case of Europeana Newspa...Representation and Absence in Digital Resources: The Case of Europeana Newspa...
Representation and Absence in Digital Resources: The Case of Europeana Newspa...TU Delft, Netherlands
 
Estermann Panel on Authority Files, 3 June 2020
Estermann Panel on Authority Files, 3 June 2020Estermann Panel on Authority Files, 3 June 2020
Estermann Panel on Authority Files, 3 June 2020Beat Estermann
 
Consolidating Openness : Developing Rijksmuseum Research Services
Consolidating Openness : Developing Rijksmuseum Research ServicesConsolidating Openness : Developing Rijksmuseum Research Services
Consolidating Openness : Developing Rijksmuseum Research ServicesSaskia Scheltjens
 
Migration statistics in Eurostat - Definition, statistical production and dis...
Migration statistics in Eurostat - Definition, statistical production and dis...Migration statistics in Eurostat - Definition, statistical production and dis...
Migration statistics in Eurostat - Definition, statistical production and dis...Giampaolo Lanzieri
 
Alastair Dunning, The successes of the Europeana Libraries project, The Europ...
Alastair Dunning, The successes of the Europeana Libraries project, The Europ...Alastair Dunning, The successes of the Europeana Libraries project, The Europ...
Alastair Dunning, The successes of the Europeana Libraries project, The Europ...The European Library
 
Europeana Introduction at Creative Kick-Off event - Breandán Knowlton
Europeana Introduction at Creative Kick-Off event - Breandán KnowltonEuropeana Introduction at Creative Kick-Off event - Breandán Knowlton
Europeana Introduction at Creative Kick-Off event - Breandán KnowltonEuropeana
 
Historical Wiki of Vienna - the largest city wiki, Christoph Sonnlechner, SMW...
Historical Wiki of Vienna - the largest city wiki, Christoph Sonnlechner, SMW...Historical Wiki of Vienna - the largest city wiki, Christoph Sonnlechner, SMW...
Historical Wiki of Vienna - the largest city wiki, Christoph Sonnlechner, SMW...KDZ - Zentrum für Verwaltungsforschung
 

Was ist angesagt? (20)

You’ve Digitised Your Collection. What Next ?
You’ve Digitised Your Collection. What Next ?You’ve Digitised Your Collection. What Next ?
You’ve Digitised Your Collection. What Next ?
 
You've Digitised. What Next ?
You've Digitised. What Next ?You've Digitised. What Next ?
You've Digitised. What Next ?
 
Estermann Linked Data Ecosystem for Heritage Data - 29 Feb 2020
Estermann Linked Data Ecosystem for Heritage Data - 29 Feb 2020Estermann Linked Data Ecosystem for Heritage Data - 29 Feb 2020
Estermann Linked Data Ecosystem for Heritage Data - 29 Feb 2020
 
Open Cultural Data in Switzerland
Open Cultural Data in SwitzerlandOpen Cultural Data in Switzerland
Open Cultural Data in Switzerland
 
Europeana in a Research Context
Europeana in a Research ContextEuropeana in a Research Context
Europeana in a Research Context
 
Reusing historical newspapers of KB in e-humanities - Case studies and exampl...
Reusing historical newspapers of KB in e-humanities - Case studies and exampl...Reusing historical newspapers of KB in e-humanities - Case studies and exampl...
Reusing historical newspapers of KB in e-humanities - Case studies and exampl...
 
Europeana Newspapers - Data, Tools & Future Plans
 Europeana Newspapers - Data, Tools & Future Plans  Europeana Newspapers - Data, Tools & Future Plans
Europeana Newspapers - Data, Tools & Future Plans
 
Open Cultural Heritage Data @ the Rijksmuseum
Open Cultural Heritage Data @ the RijksmuseumOpen Cultural Heritage Data @ the Rijksmuseum
Open Cultural Heritage Data @ the Rijksmuseum
 
From Open Acces to Open Collections to Open Minds
From Open Acces to Open Collections to Open MindsFrom Open Acces to Open Collections to Open Minds
From Open Acces to Open Collections to Open Minds
 
Text and Data Mining at the Royal Library in the Netherlands
Text and Data Mining at the Royal Library in the NetherlandsText and Data Mining at the Royal Library in the Netherlands
Text and Data Mining at the Royal Library in the Netherlands
 
The Great Twentieth-Century Hole Or, what the Digital Humanities Miss
The Great Twentieth-Century Hole Or, what the Digital Humanities MissThe Great Twentieth-Century Hole Or, what the Digital Humanities Miss
The Great Twentieth-Century Hole Or, what the Digital Humanities Miss
 
Europeana Newspapers -
Europeana Newspapers - Europeana Newspapers -
Europeana Newspapers -
 
Representation and Absence in Digital Resources: The Case of Europeana Newspa...
Representation and Absence in Digital Resources: The Case of Europeana Newspa...Representation and Absence in Digital Resources: The Case of Europeana Newspa...
Representation and Absence in Digital Resources: The Case of Europeana Newspa...
 
Estermann Panel on Authority Files, 3 June 2020
Estermann Panel on Authority Files, 3 June 2020Estermann Panel on Authority Files, 3 June 2020
Estermann Panel on Authority Files, 3 June 2020
 
Consolidating Openness : Developing Rijksmuseum Research Services
Consolidating Openness : Developing Rijksmuseum Research ServicesConsolidating Openness : Developing Rijksmuseum Research Services
Consolidating Openness : Developing Rijksmuseum Research Services
 
Integrating IIIF and Mirador at Harvard
Integrating IIIF and Mirador at HarvardIntegrating IIIF and Mirador at Harvard
Integrating IIIF and Mirador at Harvard
 
Migration statistics in Eurostat - Definition, statistical production and dis...
Migration statistics in Eurostat - Definition, statistical production and dis...Migration statistics in Eurostat - Definition, statistical production and dis...
Migration statistics in Eurostat - Definition, statistical production and dis...
 
Alastair Dunning, The successes of the Europeana Libraries project, The Europ...
Alastair Dunning, The successes of the Europeana Libraries project, The Europ...Alastair Dunning, The successes of the Europeana Libraries project, The Europ...
Alastair Dunning, The successes of the Europeana Libraries project, The Europ...
 
Europeana Introduction at Creative Kick-Off event - Breandán Knowlton
Europeana Introduction at Creative Kick-Off event - Breandán KnowltonEuropeana Introduction at Creative Kick-Off event - Breandán Knowlton
Europeana Introduction at Creative Kick-Off event - Breandán Knowlton
 
Historical Wiki of Vienna - the largest city wiki, Christoph Sonnlechner, SMW...
Historical Wiki of Vienna - the largest city wiki, Christoph Sonnlechner, SMW...Historical Wiki of Vienna - the largest city wiki, Christoph Sonnlechner, SMW...
Historical Wiki of Vienna - the largest city wiki, Christoph Sonnlechner, SMW...
 

Ähnlich wie Europeana Newspapers Thematic Collection Re-Launch

Europeana Newspapers Aggregator Forum 2018 Berlin
Europeana Newspapers Aggregator Forum 2018 BerlinEuropeana Newspapers Aggregator Forum 2018 Berlin
Europeana Newspapers Aggregator Forum 2018 Berlincneudecker
 
LIBER, Europeana and the Europeana Newspapers Project
LIBER, Europeana and the Europeana Newspapers ProjectLIBER, Europeana and the Europeana Newspapers Project
LIBER, Europeana and the Europeana Newspapers ProjectEuropeana Newspapers
 
Des nouvelles d’Europeana
Des nouvelles d’EuropeanaDes nouvelles d’Europeana
Des nouvelles d’EuropeanaDouglas McCarthy
 
The European(a) Newspapers Project
The European(a) Newspapers ProjectThe European(a) Newspapers Project
The European(a) Newspapers ProjectEuropeana Newspapers
 
LIBER, Europeana and the Europeana Newspapers Project
LIBER, Europeana and the Europeana Newspapers ProjectLIBER, Europeana and the Europeana Newspapers Project
LIBER, Europeana and the Europeana Newspapers ProjectLIBER Europe
 
GI2012 pekarek-liber
GI2012 pekarek-liberGI2012 pekarek-liber
GI2012 pekarek-liberIGN Vorstand
 
Europeana Essentials (updated June 2014)
Europeana Essentials (updated June 2014)Europeana Essentials (updated June 2014)
Europeana Essentials (updated June 2014)Europeana
 
Europeana essentials June 2013
Europeana essentials June 2013Europeana essentials June 2013
Europeana essentials June 2013Europeana
 
From Catalogue 2.0 to the digital humanities: exploring the future of librari...
From Catalogue 2.0 to the digital humanities: exploring the future of librari...From Catalogue 2.0 to the digital humanities: exploring the future of librari...
From Catalogue 2.0 to the digital humanities: exploring the future of librari...Sally Chambers
 
Europeana essentials August 2013
Europeana essentials August 2013Europeana essentials August 2013
Europeana essentials August 2013Europeana
 
Digital cultural heritage as humanities data: a labs approach
Digital cultural heritage as humanities data: a labs approachDigital cultural heritage as humanities data: a labs approach
Digital cultural heritage as humanities data: a labs approachSally Chambers
 
Mate Toth: Digitisation and creative re-use of cultural content #blokexpertu
Mate Toth: Digitisation and creative re-use of cultural content #blokexpertuMate Toth: Digitisation and creative re-use of cultural content #blokexpertu
Mate Toth: Digitisation and creative re-use of cultural content #blokexpertuKISK FF MU
 
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...Stefan Schmunk
 
77. newsletter d andrea2012
77. newsletter d andrea201277. newsletter d andrea2012
77. newsletter d andrea2012Andrea D'Andrea
 

Ähnlich wie Europeana Newspapers Thematic Collection Re-Launch (20)

Europeana Newspapers Aggregator Forum 2018 Berlin
Europeana Newspapers Aggregator Forum 2018 BerlinEuropeana Newspapers Aggregator Forum 2018 Berlin
Europeana Newspapers Aggregator Forum 2018 Berlin
 
LIBER, Europeana and the Europeana Newspapers Project
LIBER, Europeana and the Europeana Newspapers ProjectLIBER, Europeana and the Europeana Newspapers Project
LIBER, Europeana and the Europeana Newspapers Project
 
Des nouvelles d’Europeana
Des nouvelles d’EuropeanaDes nouvelles d’Europeana
Des nouvelles d’Europeana
 
The European(a) Newspapers Project
The European(a) Newspapers ProjectThe European(a) Newspapers Project
The European(a) Newspapers Project
 
LIBER, Europeana and the Europeana Newspapers Project
LIBER, Europeana and the Europeana Newspapers ProjectLIBER, Europeana and the Europeana Newspapers Project
LIBER, Europeana and the Europeana Newspapers Project
 
GI2012 pekarek-liber
GI2012 pekarek-liberGI2012 pekarek-liber
GI2012 pekarek-liber
 
Museums and Europeana
Museums and EuropeanaMuseums and Europeana
Museums and Europeana
 
Europeana Essentials (updated June 2014)
Europeana Essentials (updated June 2014)Europeana Essentials (updated June 2014)
Europeana Essentials (updated June 2014)
 
Europeana essentials June 2013
Europeana essentials June 2013Europeana essentials June 2013
Europeana essentials June 2013
 
From Catalogue 2.0 to the digital humanities: exploring the future of librari...
From Catalogue 2.0 to the digital humanities: exploring the future of librari...From Catalogue 2.0 to the digital humanities: exploring the future of librari...
From Catalogue 2.0 to the digital humanities: exploring the future of librari...
 
Europeana essentials August 2013
Europeana essentials August 2013Europeana essentials August 2013
Europeana essentials August 2013
 
NECTAR_VRE1
NECTAR_VRE1NECTAR_VRE1
NECTAR_VRE1
 
Digital cultural heritage as humanities data: a labs approach
Digital cultural heritage as humanities data: a labs approachDigital cultural heritage as humanities data: a labs approach
Digital cultural heritage as humanities data: a labs approach
 
Mate Toth: Digitisation and creative re-use of cultural content #blokexpertu
Mate Toth: Digitisation and creative re-use of cultural content #blokexpertuMate Toth: Digitisation and creative re-use of cultural content #blokexpertu
Mate Toth: Digitisation and creative re-use of cultural content #blokexpertu
 
Europeana en CARARE
Europeana en CARAREEuropeana en CARARE
Europeana en CARARE
 
The Europeana Newspapers Project
The Europeana Newspapers ProjectThe Europeana Newspapers Project
The Europeana Newspapers Project
 
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
 
EurnewsLDN_Krzysztof_Nichczynski
EurnewsLDN_Krzysztof_NichczynskiEurnewsLDN_Krzysztof_Nichczynski
EurnewsLDN_Krzysztof_Nichczynski
 
77. newsletter d andrea2012
77. newsletter d andrea201277. newsletter d andrea2012
77. newsletter d andrea2012
 
digital humanities now and beyond-Erik Champion
digital humanities now and beyond-Erik Championdigital humanities now and beyond-Erik Champion
digital humanities now and beyond-Erik Champion
 

Mehr von cneudecker

EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Librarycneudecker
 
ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltextecneudecker
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungencneudecker
 
Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?cneudecker
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspaperscneudecker
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...cneudecker
 
AI for digitized cultural heritage
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritagecneudecker
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenzcneudecker
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-Dcneudecker
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspaperscneudecker
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...cneudecker
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...cneudecker
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentscneudecker
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Miningcneudecker
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltextecneudecker
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minutencneudecker
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlincneudecker
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspaperscneudecker
 
Active archives @SBB
Active archives @SBBActive archives @SBB
Active archives @SBBcneudecker
 
Coding da Vinci Berlin 2017 - Europeana Newspapers
Coding da Vinci Berlin 2017 - Europeana NewspapersCoding da Vinci Berlin 2017 - Europeana Newspapers
Coding da Vinci Berlin 2017 - Europeana Newspaperscneudecker
 

Mehr von cneudecker (20)

EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
 
ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltexte
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungen
 
Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
 
AI for digitized cultural heritage
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritage
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenz
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-D
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspapers
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documents
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Mining
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltexte
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minuten
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlin
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
 
Active archives @SBB
Active archives @SBBActive archives @SBB
Active archives @SBB
 
Coding da Vinci Berlin 2017 - Europeana Newspapers
Coding da Vinci Berlin 2017 - Europeana NewspapersCoding da Vinci Berlin 2017 - Europeana Newspapers
Coding da Vinci Berlin 2017 - Europeana Newspapers
 

Kürzlich hochgeladen

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Kürzlich hochgeladen (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Europeana Newspapers Thematic Collection Re-Launch

  • 1. What‘s up, Europeana Newspapers? Clemens Neudecker (@cneudecker) Staatsbibliothek zu Berlin – Preußischer Kulturbesitz
  • 2. A little bit of history 2012 – 2015: Europeana Newspapers ICT-PSP Project (2012-2015) 31 Dec 2016: The European Library (TEL) closed 2017: DSI-2/3: Migration; Newspapers Collection Plan July 2018: Planned Re-Launch of Europeana Newspapers as thematic collection
  • 3. Main outcomes – TEL Historic Newspapers Portal: http://www.theeuropeanlibrary.org/tel4/newspapers – Deliverables: http://www.europeana-newspapers.eu/ public-materials/deliverables/ – Tools: http://www.europeana-newspapers.eu/ public-materials/tools/ – Final Report: http://europeananewspapers.github.io/
  • 4. Data • 1618 – 2016 • 12 countries • 40 languages • 120 TB • Ca. 1,000 titles • 3,3M issues
  • 5. Data • Metadata for more than >20 million pages • 12 million pages processed with OCR • 2 million pages processed with OLR • Most content licensed as Public Domain • All metadata licensed under CC0 • Copyright cut-off date („copyright cliff of death“)
  • 6. Data • JP2000 images for use with IIIPserver • METS container with embedded MODS for structural and bibliographic metadata • ALTO for OCRed text • EDM for Europeana  Europeana Newspapers METS/ALTO Profile (ENMAP)
  • 7. OCR/OLR • OCR: ABBYY FineReader Engine 11 – Gothic license per page (A4!) – 4 servers with 8 cores = 32 processing cores – Average processing time of 5s per newspaper page • OLR: CCS docWorks – Article separation & page classification – Possibility for post-correction/validation of results
  • 8. Evaluation • Scenario-based performance evaluation of OCR/OLR using PAGE ground truth • Ground truth dataset: http://primaresearch.org/datasets/ENP • Performance Evaluation Report: http://www.europeana-newspapers.eu/wp- content/uploads/2015/05/D3.5_Performance_Ev aluation_Report_1.0.pdf
  • 9. Evaluation 82.4% 85.3% 80.9% 75.9% 67.5% 83.4% 84.1% 68.1% 93.1% 57.6% 87.0% 68.3% 76.1% 82.6% 54.1% 32.7% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% SuccessRate Language Setting Bag of Words OCR Evaluation Per Language 67.3% 81.4% 64.0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Gothic Normal Mixed SuccessRate Font Bag of Words OCR Evaluation Per Font 79.1% 62.2% 55.9% 58.8% 94.7% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Keyword search Phrase search Access via content structure Print/ebook on demand Content based image retrieval SuccessRate(harmonic,areabased) Evaluation Profile Layout Analysis Performance Per evaluationprofile 74.35% 75.31% 70% 71% 72% 73% 74% 75% 76% 77% NCSR Binarisation Original Image SuccessRate Image Source Bag of Words OCR Evaluation Binarised image vs. original image 75.3% 53.78% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% SuccessRate(countbased) OCR Engine Bag of Words OCR Evaluation FineReader vs. Tesseract FineReader Tesseract
  • 11. Use in Research • Oceanic Exchanges (Digging Into Data, 2017-2019) • impresso (Swiss National Fund, 2017 – 2020) • NewsEye (EU H2020, 2018 – 2020) • CLARIN (EU ERIC) • Europeana Research, Interviews with Researchers • At Scientific Conferences – DAS, ICDAR: Europeana Newspapers Ground Truth – LREC, ACL: Europeana Newspapers NER Corpora
  • 14. Use in Research • Digital Humanities – DHd AG Newspapers initiated at DHd 2018 – #HacktheNews workshop at DHNord 2018 – Roundtable on newspapers at DHBenelux 2018 • At the Berlin State Library: – University Regensburg – Technical University Dortmund – Berlin-Brandenburg Academy of Sciences
  • 15. Other Activities • Rise of Literacy Generic Services Projekt • IIIF Newspaper Interest Group – http://iiif.io/community/groups/newspapers/ – https://github.com/IIIF/awesome-iiif#newspapers • TEI SIG Newspapers & Periodicals – https://wiki.tei-c.org/index.php/ SIG:Newspapers%26Periodicals
  • 17. Berliner Schlagzeilen • Created as part of Coding da Vinci Berlin 2017 • Twitterbot that tweets out daily about the news from 100 years ago • Source code available: https://github.com/shoutrlabs/ berliner-schlagzeilen
  • 18.
  • 19.
  • 20. Altpapier App • Created as part of Coding da Vinci Berlin 2017 • Android (and soon also iOS) app that shows the user newspaper articles with the possibility to correct errors • Available as source code https://github.com/mariabecker/OldNews and on the Play Store https://play.google.com/store/apps/details?id=ol dnews.de.oldnews
  • 21.
  • 22. Visualizing European Newspapers • Visualization prototype with large touch interface composed of multiple screens made by Sven Charleer of KU Leuven
  • 23.
  • 26. The Situation in Germany 2012 – 2015:DFG Pilot Project „Digitisation of historical newspapers“ Master Plan, Guidelines, etc. 2017: Relaunch of ZDB union catalog of serials http://zdb-katalog.de/ 2017: DFG Proposal (SBB, DDB involved) „A national portal for digitised historical newspapers at the Germany Digital Library“ 2018: DFG Call for proposals „Digitisation of historical newspapers“