Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
An Open Corpus for Named Entity
Recognition in Historic Newspapers
Clemens Neudecker
Berlin State Library
@cneudecker
LREC...
Background
• Europeana Newspapers EU-project:
www.europeana-newspapers.eu
• OCRed 12m pages of historic newspapers
from Eu...
Formats & Standards
• Full-text produced in ALTO
• Metadata (structural) in METS
• Metadata (bibliographic) in EDM
• Not a...
Approach
• 3 languages selected for NER:
Dutch, German, French – in collab. with
• Content in these languages constitutes ...
Methodology
• Select 100 representative pages per language
– If a classifier already exists for given language –
run it on...
NER software
• Tested Stanford NER, OpenNLP, NLTK, Gate
• Adaptation of Stanford NER package (CRF)
– Mature, well-document...
NER encoding in ALTO
• In ALTO versions >2.1, this is possible:
<String STYLEREFS="ID7" HEIGHT="132.0" WIDTH="570.0" HPOS=...
Annotation
• Evaluated BRAT, WebAnno, INL Attestation
• Reasons for selection of INL Attestation:
– Speed
– Support
of ALT...
Annotation stats
Language # tokens # PER # LOC # ORG
French 207,000 5,672 5,614 2,574
Dutch 182,483 4,492 4,448 1,160
Germ...
Challenges
• Clear, comprehensive & common guidelines
for manual annotation
• OCR quality – on average 80% word accuracy
•...
Attempted workarounds
• Introduce OCR error patterns into training
data
 actually yields less precision/recall
• Introduc...
Evaluation NL
Derived via 4-fold cross-evaluation (25 out of 100 annotated pages)
Evaluation FR
Derived via 4-fold cross-evaluation (25 out of 100 annotated pages)
Use cases
• Improving search, information retrieval
– Within digital newspapers, a vast majority of
user queries are perso...
Next steps
• Volunteers wanted!
Help correct corpus and collaboratively create a
free dataset – instructions on GitHub wik...
Open resources
• European Newspapers NER dataset (CC0):
– github.com/EuropeanaNewspapers/ner-corpora
• Europeana Newspaper...
References
• C. Neudecker, W.J. Faber, L. Wilms, T. van Veen:
Large scale refinement of digital historical
newspapers with...
Thank you for your attention!
Questions?
Clemens Neudecker
Berlin State Library
@cneudecker
Nächste SlideShare
Wird geladen in …5
×

An Open Corpus for Named Entity Recognition in Historic Newspapers

408 Aufrufe

Veröffentlicht am

10th Language Resources and Evaluation Conference (LREC2016), 23-28 May 2016, Portorož, Slovenia

  • Als Erste(r) kommentieren

An Open Corpus for Named Entity Recognition in Historic Newspapers

  1. 1. An Open Corpus for Named Entity Recognition in Historic Newspapers Clemens Neudecker Berlin State Library @cneudecker LREC2016, 23-28 May 2016, Portorož, Slovenia
  2. 2. Background • Europeana Newspapers EU-project: www.europeana-newspapers.eu • OCRed 12m pages of historic newspapers from Europe (an estimated 25 billion words!) • Newspaper content from 23 libraries, in 40 languages, covering 4 centuries (1618-1990) • Public domain full-text available for download per language/content provider
  3. 3. Formats & Standards • Full-text produced in ALTO • Metadata (structural) in METS • Metadata (bibliographic) in EDM • Not a fan of XML? Good ol‘ plain text (UTF-8) is also available… research.europeana.eu/itemtype/newspapers • Currently working on: – API for text/search – API for images (IIIF)
  4. 4. Approach • 3 languages selected for NER: Dutch, German, French – in collab. with • Content in these languages constitutes about 50% of the overall full-text in the collection
  5. 5. Methodology • Select 100 representative pages per language – If a classifier already exists for given language – run it on the selected 100 pages – Ingest tagged/untagged pages to annotation tool – Manually add/correct annotations (>=2 librarians per language) – Export and convert tagged data to BIO format – Train classifier from BIO & gazetteers (if available) – Evaluate derived classifier using 4-fold cross-eval – Repeat until classification performance converges
  6. 6. NER software • Tested Stanford NER, OpenNLP, NLTK, Gate • Adaptation of Stanford NER package (CRF) – Mature, well-documented, widely used – Open source (GPL) – Thread-safe & platform-independent (JVM) – Machine learning scales out more easily to multiple languages – Prior experience working with CRF
  7. 7. NER encoding in ALTO • In ALTO versions >2.1, this is possible: <String STYLEREFS="ID7" HEIGHT="132.0" WIDTH="570.0" HPOS="5937.0" VPOS="3279.0" CONTENT="Reynolds" WC="0.95238096" TAGREFS="Tag5"> </String> <String STYLEREFS="ID7" HEIGHT="102.0" WIDTH="540.0" HPOS="18438.0" VPOS="22008.0" CONTENT="Baltimore" WC="0.82539684" TAGREFS="Tag10"> </String> … <Tags> <NamedEntityTag ID="Tag5" TYPE="Person" LABEL="Reynolds"/> <NamedEntityTag ID="Tag10" TYPE=”Location" LABEL=”Baltimore"/> </Tags>
  8. 8. Annotation • Evaluated BRAT, WebAnno, INL Attestation • Reasons for selection of INL Attestation: – Speed – Support of ALTO format – Support from INL available
  9. 9. Annotation stats Language # tokens # PER # LOC # ORG French 207,000 5,672 5,614 2,574 Dutch 182,483 4,492 4,448 1,160 German 96,735 7,914 6,143 2,784 Language # tokens # PER # LOC # ORG French 100% 2,75% 2,71% 1,24% Dutch 100% 2,46% 2,44% 0,64% German 100% 8,18% 6,35% 2,88% Language Word-Error-Rate (Bag of Words) Reading Order Success Rate French 16,6% 19,9% Dutch 17,6% 23,2% German 15,9% / 21,9% 13,6%
  10. 10. Challenges • Clear, comprehensive & common guidelines for manual annotation • OCR quality – on average 80% word accuracy • Wide variation in historical spelling • Mix of languages on a single page • Lack/loss of metadata on page/word level • Some data corruption occured when ingesting pre-tagged data into the annotation tool
  11. 11. Attempted workarounds • Introduce OCR error patterns into training data  actually yields less precision/recall • Introduce a spelling variation module in the NER classifier  rewrite rules (e.g. „frorn“  „from“)  high integration effort  requires reasonable amount of rules  abandoned due to high complexity
  12. 12. Evaluation NL Derived via 4-fold cross-evaluation (25 out of 100 annotated pages)
  13. 13. Evaluation FR Derived via 4-fold cross-evaluation (25 out of 100 annotated pages)
  14. 14. Use cases • Improving search, information retrieval – Within digital newspapers, a vast majority of user queries are person and place names • Linking of named entities to authority files to create linked data – The classification and disambiguation of named entities allows the assignment of unique identifiers from authorative sources – thus enabling cross-language/cross-collection linking
  15. 15. Next steps • Volunteers wanted! Help correct corpus and collaboratively create a free dataset – instructions on GitHub wiki: – github.com/EuropeanaNewspapers/ ner-corpora/wiki/Corpus-cleanup • Plans to improve performance: – Add distributional similarity as feature (Clark 2003) – Semantic generalisation (Faruqui & Padò 2010) – Specialised gazetteers (e.g. list of historic place names) – Data, data, data
  16. 16. Open resources • European Newspapers NER dataset (CC0): – github.com/EuropeanaNewspapers/ner-corpora • Europeana Newspapers NER software (EUPL): – github.com/EuropeanaNewspapers/europeananp- ner – github.com/EuropeanaNewspapers/europeananp- dbpedia-disambiguation • Annotated ALTO files: – lab.kbresearch.nl/static/html/eunews.html
  17. 17. References • C. Neudecker, W.J. Faber, L. Wilms, T. van Veen: Large scale refinement of digital historical newspapers with named entity recognition Proceedings of the IFLA Newspaper Section Satellite Meeting, 2014, Geneva, Switzerland. • Y. Mossalam, A. Abi-Haidar, J.G. Ganascia: Unsupervised named entity recognition and disambiguation: An application to old French journals Advances in Data Mining. Applications and Theoretical Aspects, Springer LNCS, 2014.
  18. 18. Thank you for your attention! Questions? Clemens Neudecker Berlin State Library @cneudecker

×