3. Why Named Entity Recognition?
• Analysis* of query log files from the National Library of Wales
newspaper website: a vast majority of searches queries contain
either person or place names
* Paul Gooding, Exploring Usage of Digital Newspaper Archives through Web Log Analysis:
A Case Study of Welsh Newspapers Online, presented at DH2014, Lausanne)
• Improving Information
Retrieval
• Linking to authority files
(Linked Data)
• Historical Social Network
Analysis (HNA/SNA)
4. Languages
• Dutch (1614 – 1900)
• French (1814 – 1944)
• German (1721 – 1949)
• Together approx. 50% of the total collection
5. Many challenges
• Historical data (language)
• Noisy data (OCR)
• Multilingual data
• Lack of extensive metadata
• Lack of open resources
(tagged corpora, gazetteers)
• Lack of common annotation guidelines
• Limitations of annotation tools
7. Reuse of existing NER tools
• Simple evaluation of
– Apache OpenNLP
– Stanford CoreNLP
– GATE
• Choice of using Stanford CoreNLP since
– Java-based (thread safe, scalable)
– Good performance (f-measure)
– Strong and active community
– Rather robust against noisy input (CRF)
8. Approach
• Adaptation of Stanford CoreNLP by the
KB National Library of the Netherlands
to directly consume ENMAP (= Europeana
Newspapers METS/ALTO profile) objects
10. Annotation
• Quick evaluation of annotation tools:
– BRAT
– WebANNO
– INL Attestation Tool
• Choice of INL Attestation Tool since:
– Optimized for tagging speed
– Supported by consortium partner (INL/IVDNT)
11. Corpus creation
• Selection of 100 pages each per language
• Processing of the OCRed texts with
StanfordNER to get initial tagging results
• Manual verification and annotation
12. Corpus statistics
Language # tokens # PER # LOC # ORG
French 207,000 5,672 5,614 2,574
Dutch 182,483 4,492 4,448 1,160
German 96,735 7,914 6,143 2,784
Language # tokens # PER # LOC # ORG
French 100% 2,75% 2,71% 1,24%
Dutch 100% 2,46% 2,44% 0,64%
German 100% 8,18% 6,35% 2,88%
Language Word-Error-Rate (Bag of Words) Reading Order Success Rate
French 16,6% 19,9%
Dutch 17,6% 23,2%
German 15,9% / 21,9% 13,6%
20. Improving performance
• Possible additional features
– Distributional similarity (Clark 2003)
– Semantic generalization (Faruqui & Padò 2010)
– Word embeddings (Braune 2017)
• Gazetteers
– Person names, historical place names
• Data cleanup and improvement
– https://github.com/EuropeanaNewspapers/
ner-corpora/wiki
21. Trias NER
• Combination and voting of different NER
classifiers, e.g.
– Stanford CoreNLP
– Spacy
– NLTK
• Inspiration:
https://github.com/KBNLresearch/Trias_NER
22. Disambiguation
• Disambiguation of person and place names
• Inspiration:
https://github.com/KBNLresearch/europeana
np-dbpedia-disambiguation
23. Linking
• Linking of recognised and disambiguated NE‘s
to authority files (e.g. Wikidata, GND)
• Inspiration:
https://github.com/KBNLresearch/dac