Call Girl Service Belur - 7001035870 with real photos and phone numbers
Information Extraction on Noisy Texts for Historical Research
1. Information Extraction on Noisy Texts for Historical Research
Mike Bryant
Kepa Joseba Rodriquez
Tobias Blanke
Reto Speck 19th July 2012
http://www.ehri-project.eu
2. Why EHRI?
Fragmentation and dispersal of archival sources
• Geographical scope of Holocaust
• Attempts to destroy the evidence
• Migration of Holocaust survivors
• Multiplicity documentation projects after the war
4. The Adler Case
5 - King’s College
2 - ITS International
Tracing Service
4 – NIOD
1 - Jewish Museum
Prague
3
YAD
VASHEM
CONNECTING COLLECTIONS
5. Connecting Collections
Collection-level metadata
Enhance existing services Develop new services
• Build a virtual observatory • Build a virtual research
– A digital infrastructure to environment
unlock sources – Problem-driven
– User-driven
6. Integrate multiple layers of Metadata
Archival
(Finding aids, thesaurus) Machine
Generated
(extracted entities)
User Generated
Metadata (annotations)
7. Services for partner archives
• OCR
– Provide a general-purpose OCR service tailored to the needs of
historical material
– Allow attaching scanned paper finding aids to “bare-bones” collection
descriptions and automatically storing/indexing OCR output
• Named Entity Extraction
– Integrate NEE services to bootstrap the process of tagging collection
descriptions
– Integrate NEE with the EHRI thesaurus, to filter and validate NEE
output
– Build “candidate” search indexes, with crowd-sourced validation
8. Workflow Tools – the Ocropodium Project
1. Workflow development 2. Batch
Process
3. Transcript correction
9.
10. NEE Experiment – Corpus data
• Wiener Library: Holocaust
survivor testimonies
• 17 pages
• ~93% OCR word accuracy
• King’s College London:
H.M.S. Kelly Newsletters
• 33 pages
• ~92.5% OCR word accuracy
11. NEE Experiment - Tools
• Extracted entities “Find all information about
– Person prisoners arriving in Therezin from
– Location the Netherlands in 1944”
– Organisation
• Tools “Find all documentation from Hans
– Alchemy API Gunther Adler on SS guards in
– OpenCalais Therezin”
– Apache OpenNLP
– Stanford NER
• Manually annotated source data
– Tokenized and POS tagged using TreeTagger
– Imported into MMAX2 for manual entity tagging
12. NEE Experiment - Results
Low performance of the tools in corrected and raw text
Raw Corrected
P R F1 P R F1
Alchemy 0.61 0.38 0.47 0.63 0.38 0.48
OpenCalais 0.75 0.29 0.41 0.69 0.30 0.42
OpenNLP 0.42 0.12 0.19 0.53 0.13 0.21
Stanford 0.57 0.52 0.54 0.60 0.61 0.60
14. NEE Experiment – Personal names
• Person names: commonly written in non-standard forms
• Person and location names are used for other kind of
entities, e.g. warships
• Warships frequently annotated as PER
15. NEE Experiment - Organisations
Performance of type ORG extraction is very low
• Names of organizations appear in non-standard forms
• Jargon and abbreviations abound, particularly in Kelly newsletters
• Many organizations no longer exist
• SS and other relevant Nazi organizations have not be detected
• Spelling errors and typos in the original files:
• OpenCalais used general knowledge to resolve this problem
• Use of general knowledge my be problematic.
• “Klan, Walter” → “Ku Klux Klan”
16. Relative performance
• Stanford NER best performance across both datasets
– Most effective on PER and LOC types
• Alchemy API best results on ORG type
– Biggest difference between raw OCR and manually corrected text
– Not massively ahead of OpenCalais/Stanford
• Apache OpenNLP worst performance on our data
– But: most open of the tools and theoretically trainable
17. Conclusions
• Manual correction of OCR output does not significantly
improve the performance (on our material)
– Raw output is enough to obtain provisional candidates for N-gram
indexing
• Best results likely to come from combinations of tools
– Specific workflows for specific material, no silver bullet
• Focus in near team:
– Identify most significant patterns of error
– Implement pre-processing pipeline using simple heuristics and
pattern matching tools
• Focus in longer term:
– Integrate EHRI thesaurus and other forms of knowledge to validate
and correct the output of NE extraction tools
18. Thanks
Any questions?
Publications:
• Tobias Blanke, Mike Bryant, Mark Hedges: Ocropodium: open source OCR
for small-scale historical archives. Journal of Information Science, Vol. 38,
No. 1.
• Tobias Blanke, Michael Bryant, Mark Hedges: Open source OCR for
Scientific Workflows in History. Journal of Documentation, Forthcoming.