SlideShare ist ein Scribd-Unternehmen logo
Digitale Zeitungen –
Verarbeitung in Europeana Newspapers
Information Day SBB
Berlin, 27 Februar 2014
Clemens Neudecker, KB, Twitter: @cneudecker
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Übersicht
• Ziele & Herausforderungen
• Zeitungen im Projekt
• Workflow & Technologien
• Fragen & Antworten
2
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Ziele
• Verarbeitung von 8 Mio. Zeitungsseiten mit OCR (UIBK)
• Verarbeitung von 2 Mio. Zeitungsseiten mit OLR (CCS)
• Erstellen von Software für NER in 3 Sprachen (KB)
• Entwicklung von Tools die den Workflow automatisieren
• Erstellen von Richtlinien und Empfehlungen (“best practices”)
3
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Herausforderungen
• Qualität vs. Durchsatz
• Komplexität von Zeitungslayouts (Spalten, Anzeigen, Abbildungen)
• Stark schwankende Qualität der Digitalisate (Microfilm, Bitonal)
• Unterschiedliche Dateiformate, Sprachen, Alphabete
• Historische Schreibvarianten
• Klar strukturierter und weitgehend automatisierter Workflow
4
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Die Zeitungen
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (1)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (2)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (3)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (4)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Workflow
10
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
OCR @ UIBK
• OCR = Optical Character Recognition (Optische Zeichenerkennung)
• Technologien: ABBYY FineReader SDK
• State-of-the-art OCR software, unterstützt Fraktur/Latin/Cyrillic out-of-the-box
• Export als METS/ALTO Paket bestehend aus Images, Metadaten & Volltext
11
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (BCT)
• BCT = Binarisation and Colour Reduction Tool
• Ziel: Konvertierung von Farb-/
Graustufenscans nach 1-bit
mit für OCR optimierter
Methode (GPP) + JP2k
• Hintergrund: Dateigrösse
der Images reduzieren um
Datenmenge handhabbar
zu machen (hunderte TBs)
12
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (FRT)
• FRT = File Rename Tool
• Ziel: Unterstützung der
Bibliotheken bei der Daten-
anlieferung – Umbenennung
von Dateien und Ordnern
• Hintergrund: Daten in der für
automatisierte Verarbeitung
notwendigen Struktur aufbereiten
13
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (FAT)
• FAT = File Analyzer Tool
• Ziel: Check und Validierung
der Datenstruktur vor
Anlieferung zur Verarbeitung
• Hintergrund: Garantie für
alle Beteiligten dass die Daten
für die weitere Verarbeitung
in geeigneter Form vorliegen
14
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
OLR @ CCS
• OLR = Optical Layout Recognition (Optische Layouterkennung)
• Technologien: docWorks
• Aufteilung der Seite nach Spalten, Artikeln, Überschriften, “Seitentypen” (Anzeigen)
• Export als METS/ALTO Paket bestehend aus Images, Metadaten & Volltext
15
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
OLR Artikelerkennung
16
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
NER @ KB
• NER = Named Entities Recognition
• Technologien: Stanford CRF-NER
• 3 Sprachen: Deutsch, Niederländisch, Französisch
• Open source: https://github.com/KBNLresearch/europeananp-ner
• Erkennung von 3 Klassen: Person, Ort, Organisation
17
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp 18
Ergebnisse für NL
Model trainiert auf manuell getaggten Zeitungsseiten von 1618 - 1900.
100 Seiten mit insgesamt 183.421 Tokens (“Wörtern”)
*
* K-fold cross validation = 1/4 der Trainingsdaten nur für die Evaluierung
Personen Orte Organisationen
Precision 0.940 0.950 0.942
Recall 0.588 0.760 0.559
F-measure 0.689 0.838 0.671
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
NER vs. OCR
19
0,25
0,35
0,45
0,55
0,65
0,75
0,85
0,95
NER
OCR
Danke für die Aufmerksamkeit!
Noch Fragen?
clemens.neudecker@kb.nl

Weitere ähnliche Inhalte

Andere mochten auch

Projekt Europeana Newspapers - online brána k evropským historickým novinám
Projekt Europeana Newspapers - online brána k evropským historickým novinámProjekt Europeana Newspapers - online brána k evropským historickým novinám
Projekt Europeana Newspapers - online brána k evropským historickým novinám
Europeana Newspapers
 
ENP Belgrade WS Introduction
ENP Belgrade WS IntroductionENP Belgrade WS Introduction
ENP Belgrade WS Introduction
Europeana Newspapers
 
Challenges and solutions in creating a european historic newspapers browser
Challenges and solutions in creating a european historic newspapers browser Challenges and solutions in creating a european historic newspapers browser
Challenges and solutions in creating a european historic newspapers browser
Europeana Newspapers
 
Europeana Newspapers Amsterdam workshop introduction
Europeana Newspapers Amsterdam workshop introductionEuropeana Newspapers Amsterdam workshop introduction
Europeana Newspapers Amsterdam workshop introduction
Europeana Newspapers
 
Europeana Newspapers: novo mesto susreta korisnika digitalnih sadržaja
Europeana Newspapers: novo mesto susreta korisnika digitalnih sadržaja Europeana Newspapers: novo mesto susreta korisnika digitalnih sadržaja
Europeana Newspapers: novo mesto susreta korisnika digitalnih sadržaja Europeana Newspapers
 
ENP_SEEDI_2013_UB
ENP_SEEDI_2013_UBENP_SEEDI_2013_UB
ENP_SEEDI_2013_UB
Europeana Newspapers
 
Presentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information DayPresentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information Day
Europeana Newspapers
 
Europeana Newspapers Polish Information Day
Europeana Newspapers Polish Information DayEuropeana Newspapers Polish Information Day
Europeana Newspapers Polish Information Day
Europeana Newspapers
 
eluxemburgensia: the portal for Luxembourg's historic newspapers
eluxemburgensia: the portal for Luxembourg's historic newspaperseluxemburgensia: the portal for Luxembourg's historic newspapers
eluxemburgensia: the portal for Luxembourg's historic newspapers
Europeana Newspapers
 
Historical newspapers in the context of Digital Library of Slovenia
Historical newspapers in the context of Digital Library of SloveniaHistorical newspapers in the context of Digital Library of Slovenia
Historical newspapers in the context of Digital Library of Slovenia
Europeana Newspapers
 
On the two sides of the pond
On the two sides of the pondOn the two sides of the pond
On the two sides of the pond
Europeana Newspapers
 
The European(a) Newspapers Project
The European(a) Newspapers ProjectThe European(a) Newspapers Project
The European(a) Newspapers Project
Europeana Newspapers
 
ENP Belgrade WS refinement introduction
ENP Belgrade WS refinement introductionENP Belgrade WS refinement introduction
ENP Belgrade WS refinement introduction
Europeana Newspapers
 
Metadata
MetadataMetadata
What is a named entity
What is a named entityWhat is a named entity
What is a named entity
Europeana Newspapers
 
ENP Belgrade WS Metadata
ENP Belgrade WS MetadataENP Belgrade WS Metadata
ENP Belgrade WS Metadata
Europeana Newspapers
 

Andere mochten auch (17)

Projekt Europeana Newspapers - online brána k evropským historickým novinám
Projekt Europeana Newspapers - online brána k evropským historickým novinámProjekt Europeana Newspapers - online brána k evropským historickým novinám
Projekt Europeana Newspapers - online brána k evropským historickým novinám
 
ENP Belgrade WS Introduction
ENP Belgrade WS IntroductionENP Belgrade WS Introduction
ENP Belgrade WS Introduction
 
Challenges and solutions in creating a european historic newspapers browser
Challenges and solutions in creating a european historic newspapers browser Challenges and solutions in creating a european historic newspapers browser
Challenges and solutions in creating a european historic newspapers browser
 
Europeana Newspapers Amsterdam workshop introduction
Europeana Newspapers Amsterdam workshop introductionEuropeana Newspapers Amsterdam workshop introduction
Europeana Newspapers Amsterdam workshop introduction
 
Europeana Newspapers: novo mesto susreta korisnika digitalnih sadržaja
Europeana Newspapers: novo mesto susreta korisnika digitalnih sadržaja Europeana Newspapers: novo mesto susreta korisnika digitalnih sadržaja
Europeana Newspapers: novo mesto susreta korisnika digitalnih sadržaja
 
ENP_SEEDI_2013_UB
ENP_SEEDI_2013_UBENP_SEEDI_2013_UB
ENP_SEEDI_2013_UB
 
Presentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information DayPresentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information Day
 
Europeana Newspapers Polish Information Day
Europeana Newspapers Polish Information DayEuropeana Newspapers Polish Information Day
Europeana Newspapers Polish Information Day
 
eluxemburgensia: the portal for Luxembourg's historic newspapers
eluxemburgensia: the portal for Luxembourg's historic newspaperseluxemburgensia: the portal for Luxembourg's historic newspapers
eluxemburgensia: the portal for Luxembourg's historic newspapers
 
Historical newspapers in the context of Digital Library of Slovenia
Historical newspapers in the context of Digital Library of SloveniaHistorical newspapers in the context of Digital Library of Slovenia
Historical newspapers in the context of Digital Library of Slovenia
 
On the two sides of the pond
On the two sides of the pondOn the two sides of the pond
On the two sides of the pond
 
The European(a) Newspapers Project
The European(a) Newspapers ProjectThe European(a) Newspapers Project
The European(a) Newspapers Project
 
Trtovac, dakic, september 2012
Trtovac, dakic, september 2012Trtovac, dakic, september 2012
Trtovac, dakic, september 2012
 
ENP Belgrade WS refinement introduction
ENP Belgrade WS refinement introductionENP Belgrade WS refinement introduction
ENP Belgrade WS refinement introduction
 
Metadata
MetadataMetadata
Metadata
 
What is a named entity
What is a named entityWhat is a named entity
What is a named entity
 
ENP Belgrade WS Metadata
ENP Belgrade WS MetadataENP Belgrade WS Metadata
ENP Belgrade WS Metadata
 

Ähnlich wie Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

ENP_ONB_infday_GMuehlberger
ENP_ONB_infday_GMuehlbergerENP_ONB_infday_GMuehlberger
ENP_ONB_infday_GMuehlberger
Europeana Newspapers
 
Europeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday NeudeckerEuropeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday Neudecker
Europeana Newspapers
 
Bessere Suchergebnisse durch Named Entity Recognition
Bessere Suchergebnisse durch Named Entity RecognitionBessere Suchergebnisse durch Named Entity Recognition
Bessere Suchergebnisse durch Named Entity Recognitioncneudecker
 
Linked Open Data Pilotprojekt Österreich - LOD Pilot AT
Linked Open Data Pilotprojekt Österreich - LOD Pilot ATLinked Open Data Pilotprojekt Österreich - LOD Pilot AT
Linked Open Data Pilotprojekt Österreich - LOD Pilot AT
Martin Kaltenböck
 
Dipl.-Ing. Christoph Raber (BMWA)
Dipl.-Ing. Christoph Raber (BMWA)Dipl.-Ing. Christoph Raber (BMWA)
Dipl.-Ing. Christoph Raber (BMWA)
Praxistage
 
BMVIT & Data Market Austria
BMVIT & Data Market AustriaBMVIT & Data Market Austria
BMVIT & Data Market Austria
Data Market Austria
 
Linked Open Data Pilot Österreich - Beta Launch
Linked Open Data Pilot Österreich - Beta LaunchLinked Open Data Pilot Österreich - Beta Launch
Linked Open Data Pilot Österreich - Beta Launch
Martin Kaltenböck
 
Open Data Portal (ODP) Österreich - Präsentation bei der opendata.ch 2014 in ...
Open Data Portal (ODP) Österreich - Präsentation bei der opendata.ch 2014 in ...Open Data Portal (ODP) Österreich - Präsentation bei der opendata.ch 2014 in ...
Open Data Portal (ODP) Österreich - Präsentation bei der opendata.ch 2014 in ...
Martin Kaltenböck
 
Dr. Harald Gruber (Leiter Digitale Infrastruktur EIB)
Dr. Harald Gruber (Leiter Digitale Infrastruktur EIB)Dr. Harald Gruber (Leiter Digitale Infrastruktur EIB)
Dr. Harald Gruber (Leiter Digitale Infrastruktur EIB)
Agenda Europe 2035
 
D4 Contentintegration CONET
D4 Contentintegration CONETD4 Contentintegration CONET
D4 Contentintegration CONET
Andreas Schulte
 
EU-Projekte an der Österreichischen Nationalbibliothek - Beispiel IMPACT
EU-Projekte an der Österreichischen Nationalbibliothek - Beispiel IMPACTEU-Projekte an der Österreichischen Nationalbibliothek - Beispiel IMPACT
EU-Projekte an der Österreichischen Nationalbibliothek - Beispiel IMPACTMax Kaiser
 
OkLab Leipzig (state: 2017)
OkLab Leipzig (state: 2017)OkLab Leipzig (state: 2017)
OkLab Leipzig (state: 2017)
joergreichert
 
KMU-innovativ: Vorfahrt für Spitzenforschung im Mittelstand
KMU-innovativ: Vorfahrt für Spitzenforschung im MittelstandKMU-innovativ: Vorfahrt für Spitzenforschung im Mittelstand
KMU-innovativ: Vorfahrt für Spitzenforschung im Mittelstand
förderbar GmbH Die Fördermittelmanufaktur
 
Jan Freese, Thomas Zergoi (FFG), Christoph Ferch (Preiser Records)
Jan Freese, Thomas Zergoi (FFG), Christoph Ferch (Preiser Records)Jan Freese, Thomas Zergoi (FFG), Christoph Ferch (Preiser Records)
Jan Freese, Thomas Zergoi (FFG), Christoph Ferch (Preiser Records)
Praxistage
 
Ing. Boris Werner, Ing. Reiner Reinbrech, MSc (BMVIT)
Ing. Boris Werner, Ing. Reiner Reinbrech, MSc (BMVIT)Ing. Boris Werner, Ing. Reiner Reinbrech, MSc (BMVIT)
Ing. Boris Werner, Ing. Reiner Reinbrech, MSc (BMVIT)
Agenda Europe 2035
 
DMA Ignite Night - BMVIT
DMA Ignite Night - BMVITDMA Ignite Night - BMVIT
DMA Ignite Night - BMVIT
Data Market Austria
 
Grosse Projekte in der Wissenschaft am Beispiel von NextGEOSS
Grosse Projekte in der Wissenschaft am Beispiel von NextGEOSSGrosse Projekte in der Wissenschaft am Beispiel von NextGEOSS
Grosse Projekte in der Wissenschaft am Beispiel von NextGEOSS
Wolfgang Ksoll
 
BSB Demo Day - Balk-Pennington de Jongh - Centre of Competence
BSB Demo Day - Balk-Pennington de Jongh - Centre of CompetenceBSB Demo Day - Balk-Pennington de Jongh - Centre of Competence
BSB Demo Day - Balk-Pennington de Jongh - Centre of CompetenceIMPACT Centre of Competence
 
Pivotal Digital Transformation Forum: Fraport AG
Pivotal Digital Transformation Forum: Fraport AGPivotal Digital Transformation Forum: Fraport AG
Pivotal Digital Transformation Forum: Fraport AGVMware Tanzu
 

Ähnlich wie Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen (20)

ENP_ONB_infday_GMuehlberger
ENP_ONB_infday_GMuehlbergerENP_ONB_infday_GMuehlberger
ENP_ONB_infday_GMuehlberger
 
Europeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday NeudeckerEuropeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday Neudecker
 
Bessere Suchergebnisse durch Named Entity Recognition
Bessere Suchergebnisse durch Named Entity RecognitionBessere Suchergebnisse durch Named Entity Recognition
Bessere Suchergebnisse durch Named Entity Recognition
 
Linked Open Data Pilotprojekt Österreich - LOD Pilot AT
Linked Open Data Pilotprojekt Österreich - LOD Pilot ATLinked Open Data Pilotprojekt Österreich - LOD Pilot AT
Linked Open Data Pilotprojekt Österreich - LOD Pilot AT
 
Dipl.-Ing. Christoph Raber (BMWA)
Dipl.-Ing. Christoph Raber (BMWA)Dipl.-Ing. Christoph Raber (BMWA)
Dipl.-Ing. Christoph Raber (BMWA)
 
BMVIT & Data Market Austria
BMVIT & Data Market AustriaBMVIT & Data Market Austria
BMVIT & Data Market Austria
 
Linked Open Data Pilot Österreich - Beta Launch
Linked Open Data Pilot Österreich - Beta LaunchLinked Open Data Pilot Österreich - Beta Launch
Linked Open Data Pilot Österreich - Beta Launch
 
Open Data Portal (ODP) Österreich - Präsentation bei der opendata.ch 2014 in ...
Open Data Portal (ODP) Österreich - Präsentation bei der opendata.ch 2014 in ...Open Data Portal (ODP) Österreich - Präsentation bei der opendata.ch 2014 in ...
Open Data Portal (ODP) Österreich - Präsentation bei der opendata.ch 2014 in ...
 
Dr. Harald Gruber (Leiter Digitale Infrastruktur EIB)
Dr. Harald Gruber (Leiter Digitale Infrastruktur EIB)Dr. Harald Gruber (Leiter Digitale Infrastruktur EIB)
Dr. Harald Gruber (Leiter Digitale Infrastruktur EIB)
 
Meyer Project Introduction IMPACT Workshop MUC
Meyer Project Introduction IMPACT Workshop MUCMeyer Project Introduction IMPACT Workshop MUC
Meyer Project Introduction IMPACT Workshop MUC
 
D4 Contentintegration CONET
D4 Contentintegration CONETD4 Contentintegration CONET
D4 Contentintegration CONET
 
EU-Projekte an der Österreichischen Nationalbibliothek - Beispiel IMPACT
EU-Projekte an der Österreichischen Nationalbibliothek - Beispiel IMPACTEU-Projekte an der Österreichischen Nationalbibliothek - Beispiel IMPACT
EU-Projekte an der Österreichischen Nationalbibliothek - Beispiel IMPACT
 
OkLab Leipzig (state: 2017)
OkLab Leipzig (state: 2017)OkLab Leipzig (state: 2017)
OkLab Leipzig (state: 2017)
 
KMU-innovativ: Vorfahrt für Spitzenforschung im Mittelstand
KMU-innovativ: Vorfahrt für Spitzenforschung im MittelstandKMU-innovativ: Vorfahrt für Spitzenforschung im Mittelstand
KMU-innovativ: Vorfahrt für Spitzenforschung im Mittelstand
 
Jan Freese, Thomas Zergoi (FFG), Christoph Ferch (Preiser Records)
Jan Freese, Thomas Zergoi (FFG), Christoph Ferch (Preiser Records)Jan Freese, Thomas Zergoi (FFG), Christoph Ferch (Preiser Records)
Jan Freese, Thomas Zergoi (FFG), Christoph Ferch (Preiser Records)
 
Ing. Boris Werner, Ing. Reiner Reinbrech, MSc (BMVIT)
Ing. Boris Werner, Ing. Reiner Reinbrech, MSc (BMVIT)Ing. Boris Werner, Ing. Reiner Reinbrech, MSc (BMVIT)
Ing. Boris Werner, Ing. Reiner Reinbrech, MSc (BMVIT)
 
DMA Ignite Night - BMVIT
DMA Ignite Night - BMVITDMA Ignite Night - BMVIT
DMA Ignite Night - BMVIT
 
Grosse Projekte in der Wissenschaft am Beispiel von NextGEOSS
Grosse Projekte in der Wissenschaft am Beispiel von NextGEOSSGrosse Projekte in der Wissenschaft am Beispiel von NextGEOSS
Grosse Projekte in der Wissenschaft am Beispiel von NextGEOSS
 
BSB Demo Day - Balk-Pennington de Jongh - Centre of Competence
BSB Demo Day - Balk-Pennington de Jongh - Centre of CompetenceBSB Demo Day - Balk-Pennington de Jongh - Centre of Competence
BSB Demo Day - Balk-Pennington de Jongh - Centre of Competence
 
Pivotal Digital Transformation Forum: Fraport AG
Pivotal Digital Transformation Forum: Fraport AGPivotal Digital Transformation Forum: Fraport AG
Pivotal Digital Transformation Forum: Fraport AG
 

Mehr von Europeana Newspapers

Presentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in ParisPresentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in Paris
Europeana Newspapers
 
Presentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information DayPresentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information Day
Europeana Newspapers
 
Presentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information Day
Europeana Newspapers
 
Presentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information DayPresentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information Day
Europeana Newspapers
 
Présentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information DayPrésentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information Day
Europeana Newspapers
 
Presentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information DayPresentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information Day
Europeana Newspapers
 
IFLA 2014 Europeana Newspapers Rossitza Atanassova
IFLA 2014 Europeana Newspapers Rossitza AtanassovaIFLA 2014 Europeana Newspapers Rossitza Atanassova
IFLA 2014 Europeana Newspapers Rossitza Atanassova
Europeana Newspapers
 
Europeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne KoutsEuropeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers
 
Europeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel VeimannEuropeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers
 
Europeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista KiisaEuropeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers
 
Europeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista AruEuropeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers
 
Europeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred PussEuropeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers
 
Europeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday ThompsonEuropeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday Thompson
Europeana Newspapers
 
Europeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday RossiEuropeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday Rossi
Europeana Newspapers
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers
 
Europeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday MessinaEuropeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday Messina
Europeana Newspapers
 
Europeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday MarchettiEuropeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday Marchetti
Europeana Newspapers
 
Europeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday KempfEuropeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday Kempf
Europeana Newspapers
 
Europeana Newspapers LFT Infoday Genereux
Europeana Newspapers LFT Infoday GenereuxEuropeana Newspapers LFT Infoday Genereux
Europeana Newspapers LFT Infoday Genereux
Europeana Newspapers
 
Europeana Newspapers LFT Infoday Bolioli
Europeana Newspapers LFT Infoday BolioliEuropeana Newspapers LFT Infoday Bolioli
Europeana Newspapers LFT Infoday Bolioli
Europeana Newspapers
 

Mehr von Europeana Newspapers (20)

Presentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in ParisPresentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in Paris
 
Presentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information DayPresentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information Day
 
Presentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information Day
 
Presentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information DayPresentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information Day
 
Présentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information DayPrésentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information Day
 
Presentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information DayPresentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information Day
 
IFLA 2014 Europeana Newspapers Rossitza Atanassova
IFLA 2014 Europeana Newspapers Rossitza AtanassovaIFLA 2014 Europeana Newspapers Rossitza Atanassova
IFLA 2014 Europeana Newspapers Rossitza Atanassova
 
Europeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne KoutsEuropeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne Kouts
 
Europeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel VeimannEuropeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel Veimann
 
Europeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista KiisaEuropeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista Kiisa
 
Europeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista AruEuropeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista Aru
 
Europeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred PussEuropeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred Puss
 
Europeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday ThompsonEuropeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday Thompson
 
Europeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday RossiEuropeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday Rossi
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday Muehlberger
 
Europeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday MessinaEuropeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday Messina
 
Europeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday MarchettiEuropeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday Marchetti
 
Europeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday KempfEuropeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday Kempf
 
Europeana Newspapers LFT Infoday Genereux
Europeana Newspapers LFT Infoday GenereuxEuropeana Newspapers LFT Infoday Genereux
Europeana Newspapers LFT Infoday Genereux
 
Europeana Newspapers LFT Infoday Bolioli
Europeana Newspapers LFT Infoday BolioliEuropeana Newspapers LFT Infoday Bolioli
Europeana Newspapers LFT Infoday Bolioli
 

Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

  • 1. Digitale Zeitungen – Verarbeitung in Europeana Newspapers Information Day SBB Berlin, 27 Februar 2014 Clemens Neudecker, KB, Twitter: @cneudecker
  • 2. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Übersicht • Ziele & Herausforderungen • Zeitungen im Projekt • Workflow & Technologien • Fragen & Antworten 2
  • 3. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Ziele • Verarbeitung von 8 Mio. Zeitungsseiten mit OCR (UIBK) • Verarbeitung von 2 Mio. Zeitungsseiten mit OLR (CCS) • Erstellen von Software für NER in 3 Sprachen (KB) • Entwicklung von Tools die den Workflow automatisieren • Erstellen von Richtlinien und Empfehlungen (“best practices”) 3
  • 4. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Herausforderungen • Qualität vs. Durchsatz • Komplexität von Zeitungslayouts (Spalten, Anzeigen, Abbildungen) • Stark schwankende Qualität der Digitalisate (Microfilm, Bitonal) • Unterschiedliche Dateiformate, Sprachen, Alphabete • Historische Schreibvarianten • Klar strukturierter und weitgehend automatisierter Workflow 4
  • 5. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Die Zeitungen
  • 6. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspaper Dataset (1)
  • 7. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspaper Dataset (2)
  • 8. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspapers Dataset (3)
  • 9. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspapers Dataset (4)
  • 10. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Workflow 10
  • 11. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp OCR @ UIBK • OCR = Optical Character Recognition (Optische Zeichenerkennung) • Technologien: ABBYY FineReader SDK • State-of-the-art OCR software, unterstützt Fraktur/Latin/Cyrillic out-of-the-box • Export als METS/ALTO Paket bestehend aus Images, Metadaten & Volltext 11
  • 12. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (BCT) • BCT = Binarisation and Colour Reduction Tool • Ziel: Konvertierung von Farb-/ Graustufenscans nach 1-bit mit für OCR optimierter Methode (GPP) + JP2k • Hintergrund: Dateigrösse der Images reduzieren um Datenmenge handhabbar zu machen (hunderte TBs) 12
  • 13. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (FRT) • FRT = File Rename Tool • Ziel: Unterstützung der Bibliotheken bei der Daten- anlieferung – Umbenennung von Dateien und Ordnern • Hintergrund: Daten in der für automatisierte Verarbeitung notwendigen Struktur aufbereiten 13
  • 14. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (FAT) • FAT = File Analyzer Tool • Ziel: Check und Validierung der Datenstruktur vor Anlieferung zur Verarbeitung • Hintergrund: Garantie für alle Beteiligten dass die Daten für die weitere Verarbeitung in geeigneter Form vorliegen 14
  • 15. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp OLR @ CCS • OLR = Optical Layout Recognition (Optische Layouterkennung) • Technologien: docWorks • Aufteilung der Seite nach Spalten, Artikeln, Überschriften, “Seitentypen” (Anzeigen) • Export als METS/ALTO Paket bestehend aus Images, Metadaten & Volltext 15
  • 16. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp OLR Artikelerkennung 16
  • 17. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp NER @ KB • NER = Named Entities Recognition • Technologien: Stanford CRF-NER • 3 Sprachen: Deutsch, Niederländisch, Französisch • Open source: https://github.com/KBNLresearch/europeananp-ner • Erkennung von 3 Klassen: Person, Ort, Organisation 17
  • 18. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 18 Ergebnisse für NL Model trainiert auf manuell getaggten Zeitungsseiten von 1618 - 1900. 100 Seiten mit insgesamt 183.421 Tokens (“Wörtern”) * * K-fold cross validation = 1/4 der Trainingsdaten nur für die Evaluierung Personen Orte Organisationen Precision 0.940 0.950 0.942 Recall 0.588 0.760 0.559 F-measure 0.689 0.838 0.671
  • 19. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp NER vs. OCR 19 0,25 0,35 0,45 0,55 0,65 0,75 0,85 0,95 NER OCR
  • 20. Danke für die Aufmerksamkeit! Noch Fragen? clemens.neudecker@kb.nl