Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Named Entity Recognition for Europeana Newspapers

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
Open sonar martinreynaert
Open sonar martinreynaert
Wird geladen in …3
×

Hier ansehen

1 von 23 Anzeige

Weitere Verwandte Inhalte

Ähnlich wie Named Entity Recognition for Europeana Newspapers (20)

Weitere von cneudecker (20)

Anzeige

Aktuellste (20)

Named Entity Recognition for Europeana Newspapers

  1. 1. NER for Europeana Newspapers Clemens Neudecker (@cneudecker) Staatsbibliothek zu Berlin – Preußischer Kulturbesitz
  2. 2. Background
  3. 3. Why Named Entity Recognition? • Analysis* of query log files from the National Library of Wales newspaper website: a vast majority of searches queries contain either person or place names * Paul Gooding, Exploring Usage of Digital Newspaper Archives through Web Log Analysis: A Case Study of Welsh Newspapers Online, presented at DH2014, Lausanne) • Improving Information Retrieval • Linking to authority files (Linked Data) • Historical Social Network Analysis (HNA/SNA)
  4. 4. Languages • Dutch (1614 – 1900) • French (1814 – 1944) • German (1721 – 1949) • Together approx. 50% of the total collection
  5. 5. Many challenges • Historical data (language) • Noisy data (OCR) • Multilingual data • Lack of extensive metadata • Lack of open resources (tagged corpora, gazetteers) • Lack of common annotation guidelines • Limitations of annotation tools
  6. 6. Technology
  7. 7. Reuse of existing NER tools • Simple evaluation of – Apache OpenNLP – Stanford CoreNLP – GATE • Choice of using Stanford CoreNLP since – Java-based (thread safe, scalable) – Good performance (f-measure) – Strong and active community – Rather robust against noisy input (CRF)
  8. 8. Approach • Adaptation of Stanford CoreNLP by the KB National Library of the Netherlands to directly consume ENMAP (= Europeana Newspapers METS/ALTO profile) objects
  9. 9. Approach • Export option ALTO v3 with tags added <String STYLEREFS="ID7" HEIGHT="132.0" WIDTH="570.0" HPOS="5937.0" VPOS="3279.0" CONTENT="Reynolds" WC="0.95238096" TAGREFS="Tag5"> </String> <String STYLEREFS="ID7" HEIGHT="102.0" WIDTH="540.0" HPOS="18438.0" VPOS="22008.0" CONTENT="Baltimore" WC="0.82539684" TAGREFS="Tag10"> </String> … <Tags> <NamedEntityTag ID="Tag5" TYPE="Person" LABEL="Reynolds"/> <NamedEntityTag ID="Tag10" TYPE=”Location" LABEL=”Baltimore"/> </Tags>
  10. 10. Annotation • Quick evaluation of annotation tools: – BRAT – WebANNO – INL Attestation Tool • Choice of INL Attestation Tool since: – Optimized for tagging speed – Supported by consortium partner (INL/IVDNT)
  11. 11. Corpus creation • Selection of 100 pages each per language • Processing of the OCRed texts with StanfordNER to get initial tagging results • Manual verification and annotation
  12. 12. Corpus statistics Language # tokens # PER # LOC # ORG French 207,000 5,672 5,614 2,574 Dutch 182,483 4,492 4,448 1,160 German 96,735 7,914 6,143 2,784 Language # tokens # PER # LOC # ORG French 100% 2,75% 2,71% 1,24% Dutch 100% 2,46% 2,44% 0,64% German 100% 8,18% 6,35% 2,88% Language Word-Error-Rate (Bag of Words) Reading Order Success Rate French 16,6% 19,9% Dutch 17,6% 23,2% German 15,9% / 21,9% 13,6%
  13. 13. ner-app https://github.com/EuropeanaNewspapers/ner-app
  14. 14. ner-corpora https://github.com/EuropeanaNewspapers/ner-corpora
  15. 15. Evaluation: NL
  16. 16. Evaluation FR
  17. 17. Evaluation DE • A Named Entity Recognition Shootout for German M. Riedl and S. Padó. Proceedings of ACL, Melbourne, Australia, (2018).To appear.
  18. 18. NER vs OCR success rate 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 NER OCR
  19. 19. Future Plans
  20. 20. Improving performance • Possible additional features – Distributional similarity (Clark 2003) – Semantic generalization (Faruqui & Padò 2010) – Word embeddings (Braune 2017) • Gazetteers – Person names, historical place names • Data cleanup and improvement – https://github.com/EuropeanaNewspapers/ ner-corpora/wiki
  21. 21. Trias NER • Combination and voting of different NER classifiers, e.g. – Stanford CoreNLP – Spacy – NLTK • Inspiration: https://github.com/KBNLresearch/Trias_NER
  22. 22. Disambiguation • Disambiguation of person and place names • Inspiration: https://github.com/KBNLresearch/europeana np-dbpedia-disambiguation
  23. 23. Linking • Linking of recognised and disambiguated NE‘s to authority files (e.g. Wikidata, GND) • Inspiration: https://github.com/KBNLresearch/dac

×