An Open Corpus for Named Entity Recognition in Historic Newspapers
1. An Open Corpus for Named Entity
Recognition in Historic Newspapers
Clemens Neudecker
Berlin State Library
@cneudecker
LREC2016, 23-28 May 2016, Portorož, Slovenia
2. Background
• Europeana Newspapers EU-project:
www.europeana-newspapers.eu
• OCRed 12m pages of historic newspapers
from Europe (an estimated 25 billion words!)
• Newspaper content from 23 libraries, in 40
languages, covering 4 centuries (1618-1990)
• Public domain full-text available for download
per language/content provider
3. Formats & Standards
• Full-text produced in ALTO
• Metadata (structural) in METS
• Metadata (bibliographic) in EDM
• Not a fan of XML?
Good ol‘ plain text (UTF-8) is also available…
research.europeana.eu/itemtype/newspapers
• Currently working on:
– API for text/search
– API for images (IIIF)
4. Approach
• 3 languages selected for NER:
Dutch, German, French – in collab. with
• Content in these languages constitutes about
50% of the overall full-text in the collection
5. Methodology
• Select 100 representative pages per language
– If a classifier already exists for given language –
run it on the selected 100 pages
– Ingest tagged/untagged pages to annotation tool
– Manually add/correct annotations
(>=2 librarians per language)
– Export and convert tagged data to BIO format
– Train classifier from BIO & gazetteers (if available)
– Evaluate derived classifier using 4-fold cross-eval
– Repeat until classification performance converges
6. NER software
• Tested Stanford NER, OpenNLP, NLTK, Gate
• Adaptation of Stanford NER package (CRF)
– Mature, well-documented, widely used
– Open source (GPL)
– Thread-safe & platform-independent (JVM)
– Machine learning scales out more easily
to multiple languages
– Prior experience working with CRF
7. NER encoding in ALTO
• In ALTO versions >2.1, this is possible:
<String STYLEREFS="ID7" HEIGHT="132.0" WIDTH="570.0" HPOS="5937.0"
VPOS="3279.0" CONTENT="Reynolds" WC="0.95238096" TAGREFS="Tag5">
</String>
<String STYLEREFS="ID7" HEIGHT="102.0" WIDTH="540.0" HPOS="18438.0"
VPOS="22008.0" CONTENT="Baltimore" WC="0.82539684" TAGREFS="Tag10">
</String>
…
<Tags>
<NamedEntityTag ID="Tag5" TYPE="Person" LABEL="Reynolds"/>
<NamedEntityTag ID="Tag10" TYPE=”Location" LABEL=”Baltimore"/>
</Tags>
8. Annotation
• Evaluated BRAT, WebAnno, INL Attestation
• Reasons for selection of INL Attestation:
– Speed
– Support
of ALTO
format
– Support
from INL
available
9. Annotation stats
Language # tokens # PER # LOC # ORG
French 207,000 5,672 5,614 2,574
Dutch 182,483 4,492 4,448 1,160
German 96,735 7,914 6,143 2,784
Language # tokens # PER # LOC # ORG
French 100% 2,75% 2,71% 1,24%
Dutch 100% 2,46% 2,44% 0,64%
German 100% 8,18% 6,35% 2,88%
Language Word-Error-Rate (Bag of Words) Reading Order Success Rate
French 16,6% 19,9%
Dutch 17,6% 23,2%
German 15,9% / 21,9% 13,6%
10. Challenges
• Clear, comprehensive & common guidelines
for manual annotation
• OCR quality – on average 80% word accuracy
• Wide variation in historical spelling
• Mix of languages on a single page
• Lack/loss of metadata on page/word level
• Some data corruption occured when ingesting
pre-tagged data into the annotation tool
11. Attempted workarounds
• Introduce OCR error patterns into training
data
actually yields less precision/recall
• Introduce a spelling variation module in the
NER classifier
rewrite rules (e.g. „frorn“ „from“)
high integration effort
requires reasonable amount of rules
abandoned due to high complexity
14. Use cases
• Improving search, information retrieval
– Within digital newspapers, a vast majority of
user queries are person and place names
• Linking of named entities to authority files
to create linked data
– The classification and disambiguation of named
entities allows the assignment of unique
identifiers from authorative sources – thus
enabling cross-language/cross-collection linking
15. Next steps
• Volunteers wanted!
Help correct corpus and collaboratively create a
free dataset – instructions on GitHub wiki:
– github.com/EuropeanaNewspapers/
ner-corpora/wiki/Corpus-cleanup
• Plans to improve performance:
– Add distributional similarity as feature (Clark 2003)
– Semantic generalisation (Faruqui & Padò 2010)
– Specialised gazetteers (e.g. list of historic place names)
– Data, data, data
16. Open resources
• European Newspapers NER dataset (CC0):
– github.com/EuropeanaNewspapers/ner-corpora
• Europeana Newspapers NER software (EUPL):
– github.com/EuropeanaNewspapers/europeananp-
ner
– github.com/EuropeanaNewspapers/europeananp-
dbpedia-disambiguation
• Annotated ALTO files:
– lab.kbresearch.nl/static/html/eunews.html
17. References
• C. Neudecker, W.J. Faber, L. Wilms, T. van Veen:
Large scale refinement of digital historical
newspapers with named entity recognition
Proceedings of the IFLA Newspaper Section
Satellite Meeting, 2014, Geneva, Switzerland.
• Y. Mossalam, A. Abi-Haidar, J.G. Ganascia:
Unsupervised named entity recognition and
disambiguation: An application to old French
journals
Advances in Data Mining. Applications and
Theoretical Aspects, Springer LNCS, 2014.
18. Thank you for your attention!
Questions?
Clemens Neudecker
Berlin State Library
@cneudecker