SlideShare ist ein Scribd-Unternehmen logo
1 von 23
NER for Europeana Newspapers
Clemens Neudecker (@cneudecker)
Staatsbibliothek zu Berlin –
Preußischer Kulturbesitz
Background
Why Named Entity Recognition?
• Analysis* of query log files from the National Library of Wales
newspaper website: a vast majority of searches queries contain
either person or place names
* Paul Gooding, Exploring Usage of Digital Newspaper Archives through Web Log Analysis:
A Case Study of Welsh Newspapers Online, presented at DH2014, Lausanne)
• Improving Information
Retrieval
• Linking to authority files
(Linked Data)
• Historical Social Network
Analysis (HNA/SNA)
Languages
• Dutch (1614 – 1900)
• French (1814 – 1944)
• German (1721 – 1949)
• Together approx. 50% of the total collection
Many challenges
• Historical data (language)
• Noisy data (OCR)
• Multilingual data
• Lack of extensive metadata
• Lack of open resources
(tagged corpora, gazetteers)
• Lack of common annotation guidelines
• Limitations of annotation tools
Technology
Reuse of existing NER tools
• Simple evaluation of
– Apache OpenNLP
– Stanford CoreNLP
– GATE
• Choice of using Stanford CoreNLP since
– Java-based (thread safe, scalable)
– Good performance (f-measure)
– Strong and active community
– Rather robust against noisy input (CRF)
Approach
• Adaptation of Stanford CoreNLP by the
KB National Library of the Netherlands
to directly consume ENMAP (= Europeana
Newspapers METS/ALTO profile) objects
Approach
• Export option ALTO v3 with tags added
<String STYLEREFS="ID7" HEIGHT="132.0" WIDTH="570.0" HPOS="5937.0"
VPOS="3279.0" CONTENT="Reynolds" WC="0.95238096" TAGREFS="Tag5">
</String>
<String STYLEREFS="ID7" HEIGHT="102.0" WIDTH="540.0" HPOS="18438.0"
VPOS="22008.0" CONTENT="Baltimore" WC="0.82539684" TAGREFS="Tag10">
</String>
…
<Tags>
<NamedEntityTag ID="Tag5" TYPE="Person" LABEL="Reynolds"/>
<NamedEntityTag ID="Tag10" TYPE=”Location" LABEL=”Baltimore"/>
</Tags>
Annotation
• Quick evaluation of annotation tools:
– BRAT
– WebANNO
– INL Attestation Tool
• Choice of INL Attestation Tool since:
– Optimized for tagging speed
– Supported by consortium partner (INL/IVDNT)
Corpus creation
• Selection of 100 pages each per language
• Processing of the OCRed texts with
StanfordNER to get initial tagging results
• Manual verification and annotation
Corpus statistics
Language # tokens # PER # LOC # ORG
French 207,000 5,672 5,614 2,574
Dutch 182,483 4,492 4,448 1,160
German 96,735 7,914 6,143 2,784
Language # tokens # PER # LOC # ORG
French 100% 2,75% 2,71% 1,24%
Dutch 100% 2,46% 2,44% 0,64%
German 100% 8,18% 6,35% 2,88%
Language Word-Error-Rate (Bag of Words) Reading Order Success Rate
French 16,6% 19,9%
Dutch 17,6% 23,2%
German 15,9% / 21,9% 13,6%
ner-app
https://github.com/EuropeanaNewspapers/ner-app
ner-corpora
https://github.com/EuropeanaNewspapers/ner-corpora
Evaluation: NL
Evaluation FR
Evaluation DE
• A Named Entity Recognition Shootout for
German
M. Riedl and S. Padó. Proceedings of ACL,
Melbourne, Australia, (2018).To appear.
NER vs OCR success rate
0.25
0.35
0.45
0.55
0.65
0.75
0.85
0.95
NER
OCR
Future Plans
Improving performance
• Possible additional features
– Distributional similarity (Clark 2003)
– Semantic generalization (Faruqui & Padò 2010)
– Word embeddings (Braune 2017)
• Gazetteers
– Person names, historical place names
• Data cleanup and improvement
– https://github.com/EuropeanaNewspapers/
ner-corpora/wiki
Trias NER
• Combination and voting of different NER
classifiers, e.g.
– Stanford CoreNLP
– Spacy
– NLTK
• Inspiration:
https://github.com/KBNLresearch/Trias_NER
Disambiguation
• Disambiguation of person and place names
• Inspiration:
https://github.com/KBNLresearch/europeana
np-dbpedia-disambiguation
Linking
• Linking of recognised and disambiguated NE‘s
to authority files (e.g. Wikidata, GND)
• Inspiration:
https://github.com/KBNLresearch/dac

Weitere ähnliche Inhalte

Ähnlich wie Named Entity Recognition for Europeana Newspapers

Forum Tal 2014: Celi company presentation
Forum Tal 2014: Celi company presentationForum Tal 2014: Celi company presentation
Forum Tal 2014: Celi company presentation
CELI
 

Ähnlich wie Named Entity Recognition for Europeana Newspapers (20)

Data integration in ENFIN using standards. The EnCore DAS service.
Data integration in ENFIN using standards. The EnCore DAS service.Data integration in ENFIN using standards. The EnCore DAS service.
Data integration in ENFIN using standards. The EnCore DAS service.
 
Curation Technologies for Multilingual Europe
Curation Technologies for Multilingual EuropeCuration Technologies for Multilingual Europe
Curation Technologies for Multilingual Europe
 
DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...
DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...
DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchers
 
Celtic language technologies in the digital age
Celtic language technologies in the digital ageCeltic language technologies in the digital age
Celtic language technologies in the digital age
 
Audiovisual collections, the spoken word and user needs of scholars in the Hu...
Audiovisual collections, the spoken word and user needs of scholars in the Hu...Audiovisual collections, the spoken word and user needs of scholars in the Hu...
Audiovisual collections, the spoken word and user needs of scholars in the Hu...
 
20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen
 
2010 Digital Humanities London - Dutch Republic of Letters
2010 Digital Humanities London - Dutch Republic of Letters2010 Digital Humanities London - Dutch Republic of Letters
2010 Digital Humanities London - Dutch Republic of Letters
 
Pyathon Program.pdf
Pyathon Program.pdfPyathon Program.pdf
Pyathon Program.pdf
 
Iasa Presentatie
Iasa PresentatieIasa Presentatie
Iasa Presentatie
 
Correlating languages and sentiment analysis on the basis of text-based reviews
Correlating languages and sentiment analysis on the basis of text-based reviewsCorrelating languages and sentiment analysis on the basis of text-based reviews
Correlating languages and sentiment analysis on the basis of text-based reviews
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinity
 
Integration of an Automatic Indexing System within the Document Flow of a Gre...
Integration of an Automatic Indexing System within the Document Flow of a Gre...Integration of an Automatic Indexing System within the Document Flow of a Gre...
Integration of an Automatic Indexing System within the Document Flow of a Gre...
 
Smart Content - FREME Project - Presentation Frankfurt Book Fair
Smart Content - FREME Project - Presentation Frankfurt Book FairSmart Content - FREME Project - Presentation Frankfurt Book Fair
Smart Content - FREME Project - Presentation Frankfurt Book Fair
 
The Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital ArchivingThe Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital Archiving
 
Europeana Newspapers - Data, Tools & Future Plans
 Europeana Newspapers - Data, Tools & Future Plans  Europeana Newspapers - Data, Tools & Future Plans
Europeana Newspapers - Data, Tools & Future Plans
 
An HLT profile of the official South African languages
An HLT profile of the official South African languagesAn HLT profile of the official South African languages
An HLT profile of the official South African languages
 
Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)
Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)
Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)
 
Forum Tal 2014: Celi company presentation
Forum Tal 2014: Celi company presentationForum Tal 2014: Celi company presentation
Forum Tal 2014: Celi company presentation
 

Mehr von cneudecker

OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
cneudecker
 

Mehr von cneudecker (20)

EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
 
ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltexte
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungen
 
Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
 
AI for digitized cultural heritage
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritage
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenz
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-D
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspapers
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documents
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Mining
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltexte
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europe
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minuten
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshell
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlin
 
What's up, Europeana Newspapers?
What's up, Europeana Newspapers?What's up, Europeana Newspapers?
What's up, Europeana Newspapers?
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Named Entity Recognition for Europeana Newspapers

  • 1. NER for Europeana Newspapers Clemens Neudecker (@cneudecker) Staatsbibliothek zu Berlin – Preußischer Kulturbesitz
  • 3. Why Named Entity Recognition? • Analysis* of query log files from the National Library of Wales newspaper website: a vast majority of searches queries contain either person or place names * Paul Gooding, Exploring Usage of Digital Newspaper Archives through Web Log Analysis: A Case Study of Welsh Newspapers Online, presented at DH2014, Lausanne) • Improving Information Retrieval • Linking to authority files (Linked Data) • Historical Social Network Analysis (HNA/SNA)
  • 4. Languages • Dutch (1614 – 1900) • French (1814 – 1944) • German (1721 – 1949) • Together approx. 50% of the total collection
  • 5. Many challenges • Historical data (language) • Noisy data (OCR) • Multilingual data • Lack of extensive metadata • Lack of open resources (tagged corpora, gazetteers) • Lack of common annotation guidelines • Limitations of annotation tools
  • 7. Reuse of existing NER tools • Simple evaluation of – Apache OpenNLP – Stanford CoreNLP – GATE • Choice of using Stanford CoreNLP since – Java-based (thread safe, scalable) – Good performance (f-measure) – Strong and active community – Rather robust against noisy input (CRF)
  • 8. Approach • Adaptation of Stanford CoreNLP by the KB National Library of the Netherlands to directly consume ENMAP (= Europeana Newspapers METS/ALTO profile) objects
  • 9. Approach • Export option ALTO v3 with tags added <String STYLEREFS="ID7" HEIGHT="132.0" WIDTH="570.0" HPOS="5937.0" VPOS="3279.0" CONTENT="Reynolds" WC="0.95238096" TAGREFS="Tag5"> </String> <String STYLEREFS="ID7" HEIGHT="102.0" WIDTH="540.0" HPOS="18438.0" VPOS="22008.0" CONTENT="Baltimore" WC="0.82539684" TAGREFS="Tag10"> </String> … <Tags> <NamedEntityTag ID="Tag5" TYPE="Person" LABEL="Reynolds"/> <NamedEntityTag ID="Tag10" TYPE=”Location" LABEL=”Baltimore"/> </Tags>
  • 10. Annotation • Quick evaluation of annotation tools: – BRAT – WebANNO – INL Attestation Tool • Choice of INL Attestation Tool since: – Optimized for tagging speed – Supported by consortium partner (INL/IVDNT)
  • 11. Corpus creation • Selection of 100 pages each per language • Processing of the OCRed texts with StanfordNER to get initial tagging results • Manual verification and annotation
  • 12. Corpus statistics Language # tokens # PER # LOC # ORG French 207,000 5,672 5,614 2,574 Dutch 182,483 4,492 4,448 1,160 German 96,735 7,914 6,143 2,784 Language # tokens # PER # LOC # ORG French 100% 2,75% 2,71% 1,24% Dutch 100% 2,46% 2,44% 0,64% German 100% 8,18% 6,35% 2,88% Language Word-Error-Rate (Bag of Words) Reading Order Success Rate French 16,6% 19,9% Dutch 17,6% 23,2% German 15,9% / 21,9% 13,6%
  • 17. Evaluation DE • A Named Entity Recognition Shootout for German M. Riedl and S. Padó. Proceedings of ACL, Melbourne, Australia, (2018).To appear.
  • 18. NER vs OCR success rate 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 NER OCR
  • 20. Improving performance • Possible additional features – Distributional similarity (Clark 2003) – Semantic generalization (Faruqui & Padò 2010) – Word embeddings (Braune 2017) • Gazetteers – Person names, historical place names • Data cleanup and improvement – https://github.com/EuropeanaNewspapers/ ner-corpora/wiki
  • 21. Trias NER • Combination and voting of different NER classifiers, e.g. – Stanford CoreNLP – Spacy – NLTK • Inspiration: https://github.com/KBNLresearch/Trias_NER
  • 22. Disambiguation • Disambiguation of person and place names • Inspiration: https://github.com/KBNLresearch/europeana np-dbpedia-disambiguation
  • 23. Linking • Linking of recognised and disambiguated NE‘s to authority files (e.g. Wikidata, GND) • Inspiration: https://github.com/KBNLresearch/dac