Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen
1. Digitale Zeitungen –
Verarbeitung in Europeana Newspapers
Information Day SBB
Berlin, 27 Februar 2014
Clemens Neudecker, KB, Twitter: @cneudecker
2. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Übersicht
• Ziele & Herausforderungen
• Zeitungen im Projekt
• Workflow & Technologien
• Fragen & Antworten
2
3. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Ziele
• Verarbeitung von 8 Mio. Zeitungsseiten mit OCR (UIBK)
• Verarbeitung von 2 Mio. Zeitungsseiten mit OLR (CCS)
• Erstellen von Software für NER in 3 Sprachen (KB)
• Entwicklung von Tools die den Workflow automatisieren
• Erstellen von Richtlinien und Empfehlungen (“best practices”)
3
4. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Herausforderungen
• Qualität vs. Durchsatz
• Komplexität von Zeitungslayouts (Spalten, Anzeigen, Abbildungen)
• Stark schwankende Qualität der Digitalisate (Microfilm, Bitonal)
• Unterschiedliche Dateiformate, Sprachen, Alphabete
• Historische Schreibvarianten
• Klar strukturierter und weitgehend automatisierter Workflow
4
5. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Die Zeitungen
6. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (1)
7. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (2)
8. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (3)
9. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (4)
10. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Workflow
10
11. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
OCR @ UIBK
• OCR = Optical Character Recognition (Optische Zeichenerkennung)
• Technologien: ABBYY FineReader SDK
• State-of-the-art OCR software, unterstützt Fraktur/Latin/Cyrillic out-of-the-box
• Export als METS/ALTO Paket bestehend aus Images, Metadaten & Volltext
11
12. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (BCT)
• BCT = Binarisation and Colour Reduction Tool
• Ziel: Konvertierung von Farb-/
Graustufenscans nach 1-bit
mit für OCR optimierter
Methode (GPP) + JP2k
• Hintergrund: Dateigrösse
der Images reduzieren um
Datenmenge handhabbar
zu machen (hunderte TBs)
12
13. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (FRT)
• FRT = File Rename Tool
• Ziel: Unterstützung der
Bibliotheken bei der Daten-
anlieferung – Umbenennung
von Dateien und Ordnern
• Hintergrund: Daten in der für
automatisierte Verarbeitung
notwendigen Struktur aufbereiten
13
14. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (FAT)
• FAT = File Analyzer Tool
• Ziel: Check und Validierung
der Datenstruktur vor
Anlieferung zur Verarbeitung
• Hintergrund: Garantie für
alle Beteiligten dass die Daten
für die weitere Verarbeitung
in geeigneter Form vorliegen
14
15. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
OLR @ CCS
• OLR = Optical Layout Recognition (Optische Layouterkennung)
• Technologien: docWorks
• Aufteilung der Seite nach Spalten, Artikeln, Überschriften, “Seitentypen” (Anzeigen)
• Export als METS/ALTO Paket bestehend aus Images, Metadaten & Volltext
15
16. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
OLR Artikelerkennung
16
17. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
NER @ KB
• NER = Named Entities Recognition
• Technologien: Stanford CRF-NER
• 3 Sprachen: Deutsch, Niederländisch, Französisch
• Open source: https://github.com/KBNLresearch/europeananp-ner
• Erkennung von 3 Klassen: Person, Ort, Organisation
17
18. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp 18
Ergebnisse für NL
Model trainiert auf manuell getaggten Zeitungsseiten von 1618 - 1900.
100 Seiten mit insgesamt 183.421 Tokens (“Wörtern”)
*
* K-fold cross validation = 1/4 der Trainingsdaten nur für die Evaluierung
Personen Orte Organisationen
Precision 0.940 0.950 0.942
Recall 0.588 0.760 0.559
F-measure 0.689 0.838 0.671
19. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
NER vs. OCR
19
0,25
0,35
0,45
0,55
0,65
0,75
0,85
0,95
NER
OCR
20. Danke für die Aufmerksamkeit!
Noch Fragen?
clemens.neudecker@kb.nl