Information Extraction on Noisy Texts for Historical Research

Information Extraction on Noisy Texts for Historical Research

Mike Bryant
Kepa Joseba Rodriquez
Tobias Blanke
Reto Speck 19th July 2012
http://www.ehri-project.eu

Why EHRI?

Fragmentation and dispersal of archival sources

• Geographical scope of Holocaust
• Attempts to destroy the evidence
• Migration of Holocaust survivors
• Multiplicity documentation projects after the war

The Adler Case
5 - King’s College

2 - ITS International
Tracing Service

4 – NIOD

1 - Jewish Museum
Prague

3
YAD
VASHEM

CONNECTING COLLECTIONS

Connecting Collections

Collection-level metadata

Enhance existing services Develop new services
• Build a virtual observatory • Build a virtual research
– A digital infrastructure to environment
unlock sources – Problem-driven
– User-driven

Integrate multiple layers of Metadata

Archival
(Finding aids, thesaurus) Machine
Generated
(extracted entities)

User Generated
Metadata (annotations)

Services for partner archives

• OCR
– Provide a general-purpose OCR service tailored to the needs of
historical material
– Allow attaching scanned paper finding aids to “bare-bones” collection
descriptions and automatically storing/indexing OCR output

• Named Entity Extraction
– Integrate NEE services to bootstrap the process of tagging collection
descriptions
– Integrate NEE with the EHRI thesaurus, to filter and validate NEE
output
– Build “candidate” search indexes, with crowd-sourced validation

Workflow Tools – the Ocropodium Project

1. Workflow development 2. Batch
Process

3. Transcript correction

NEE Experiment – Corpus data

• Wiener Library: Holocaust
survivor testimonies
• 17 pages
• ~93% OCR word accuracy

• King’s College London:
H.M.S. Kelly Newsletters
• 33 pages
• ~92.5% OCR word accuracy

NEE Experiment - Tools

• Extracted entities “Find all information about
– Person prisoners arriving in Therezin from
– Location the Netherlands in 1944”
– Organisation

• Tools “Find all documentation from Hans
– Alchemy API Gunther Adler on SS guards in
– OpenCalais Therezin”
– Apache OpenNLP
– Stanford NER

• Manually annotated source data
– Tokenized and POS tagged using TreeTagger
– Imported into MMAX2 for manual entity tagging

NEE Experiment - Results

Low performance of the tools in corrected and raw text

Raw Corrected

P R F1 P R F1

Alchemy 0.61 0.38 0.47 0.63 0.38 0.48
OpenCalais 0.75 0.29 0.41 0.69 0.30 0.42
OpenNLP 0.42 0.12 0.19 0.53 0.13 0.21
Stanford 0.57 0.52 0.54 0.60 0.61 0.60

LOC extraction most accurate, ORG least

WL F1-Score

KCL F1-Score

NEE Experiment – Personal names

• Person names: commonly written in non-standard forms

• Person and location names are used for other kind of
entities, e.g. warships
• Warships frequently annotated as PER

NEE Experiment - Organisations

Performance of type ORG extraction is very low

• Names of organizations appear in non-standard forms
• Jargon and abbreviations abound, particularly in Kelly newsletters

• Many organizations no longer exist
• SS and other relevant Nazi organizations have not be detected

• Spelling errors and typos in the original files:
• OpenCalais used general knowledge to resolve this problem
• Use of general knowledge my be problematic.
• “Klan, Walter” → “Ku Klux Klan”

Relative performance

• Stanford NER best performance across both datasets
– Most effective on PER and LOC types

• Alchemy API best results on ORG type
– Biggest difference between raw OCR and manually corrected text
– Not massively ahead of OpenCalais/Stanford

• Apache OpenNLP worst performance on our data
– But: most open of the tools and theoretically trainable

Conclusions

• Manual correction of OCR output does not significantly
improve the performance (on our material)
– Raw output is enough to obtain provisional candidates for N-gram
indexing
• Best results likely to come from combinations of tools
– Specific workflows for specific material, no silver bullet
• Focus in near team:
– Identify most significant patterns of error
– Implement pre-processing pipeline using simple heuristics and
pattern matching tools
• Focus in longer term:
– Integrate EHRI thesaurus and other forms of knowledge to validate
and correct the output of NE extraction tools

Thanks

Any questions?

Publications:

• Tobias Blanke, Mike Bryant, Mark Hedges: Ocropodium: open source OCR
for small-scale historical archives. Journal of Information Science, Vol. 38,
No. 1.

• Tobias Blanke, Michael Bryant, Mark Hedges: Open source OCR for
Scientific Workflows in History. Journal of Documentation, Forthcoming.

Information Extraction on Noisy Texts for Historical Research

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Information Extraction on Noisy Texts for Historical Research

Ähnlich wie Information Extraction on Noisy Texts for Historical Research (20)

Mehr von Kepa J. Rodriguez

Mehr von Kepa J. Rodriguez (8)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Information Extraction on Noisy Texts for Historical Research