SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
Information Extraction on Noisy Texts for Historical Research

Mike Bryant
Kepa Joseba Rodriquez
Tobias Blanke
Reto Speck                              19th July 2012
                                        http://www.ehri-project.eu
Why EHRI?



Fragmentation and dispersal of archival sources

  •   Geographical scope of Holocaust
  •   Attempts to destroy the evidence
  •   Migration of Holocaust survivors
  •   Multiplicity documentation projects after the war
The Adler case
The Adler Case
        5 - King’s College




                         2 - ITS International
                           Tracing Service


                     4 – NIOD


                                            1 - Jewish Museum
                                                  Prague




                                                                   3
                                                                  YAD
                                                                VASHEM




                CONNECTING COLLECTIONS
Connecting Collections


Collection-level metadata




Enhance existing services          Develop new services
• Build a virtual observatory      • Build a virtual research
   – A digital infrastructure to     environment
     unlock sources                   – Problem-driven
                                      – User-driven
Integrate multiple layers of Metadata




                Archival
        (Finding aids, thesaurus)           Machine
                                           Generated
                                       (extracted entities)




                                               User Generated
                            Metadata            (annotations)
Services for partner archives


• OCR
   – Provide a general-purpose OCR service tailored to the needs of
     historical material
   – Allow attaching scanned paper finding aids to “bare-bones” collection
     descriptions and automatically storing/indexing OCR output


• Named Entity Extraction
   – Integrate NEE services to bootstrap the process of tagging collection
     descriptions
   – Integrate NEE with the EHRI thesaurus, to filter and validate NEE
     output
   – Build “candidate” search indexes, with crowd-sourced validation
Workflow Tools – the Ocropodium Project

1. Workflow development               2. Batch
                                         Process




  3. Transcript correction
NEE Experiment – Corpus data


• Wiener Library: Holocaust
  survivor testimonies
   • 17 pages
   • ~93% OCR word accuracy




                              • King’s College London:
                                H.M.S. Kelly Newsletters
                                 • 33 pages
                                 • ~92.5% OCR word accuracy
NEE Experiment - Tools


• Extracted entities         “Find all information about
   – Person                  prisoners arriving in Therezin from
   – Location                the Netherlands in 1944”
   – Organisation


• Tools                      “Find all documentation from Hans
   –   Alchemy API           Gunther Adler on SS guards in
   –   OpenCalais            Therezin”
   –   Apache OpenNLP
   –   Stanford NER


• Manually annotated source data
   – Tokenized and POS tagged using TreeTagger
   – Imported into MMAX2 for manual entity tagging
NEE Experiment - Results


Low performance of the tools in corrected and raw text

                Raw                  Corrected

                  P      R     F1            P     R     F1

   Alchemy      0.61   0.38   0.47        0.63   0.38   0.48
   OpenCalais   0.75   0.29   0.41        0.69   0.30   0.42
   OpenNLP      0.42   0.12   0.19        0.53   0.13   0.21
   Stanford     0.57   0.52   0.54        0.60   0.61   0.60
LOC extraction most accurate, ORG least


                                  WL F1-Score




                                   KCL F1-Score
NEE Experiment – Personal names



• Person names: commonly written in non-standard forms




• Person and location names are used for other kind of
  entities, e.g. warships
   • Warships frequently annotated as PER
NEE Experiment - Organisations


Performance of type ORG extraction is very low

• Names of organizations appear in non-standard forms
   • Jargon and abbreviations abound, particularly in Kelly newsletters


• Many organizations no longer exist
   • SS and other relevant Nazi organizations have not be detected


• Spelling errors and typos in the original files:
   • OpenCalais used general knowledge to resolve this problem
   • Use of general knowledge my be problematic.
   •  “Klan, Walter” → “Ku Klux Klan”
Relative performance


• Stanford NER best performance across both datasets
   – Most effective on PER and LOC types


• Alchemy API best results on ORG type
   – Biggest difference between raw OCR and manually corrected text
   – Not massively ahead of OpenCalais/Stanford


• Apache OpenNLP worst performance on our data
   – But: most open of the tools and theoretically trainable
Conclusions


• Manual correction of OCR output does not significantly
  improve the performance (on our material)
   – Raw output is enough to obtain provisional candidates for N-gram
     indexing
• Best results likely to come from combinations of tools
   – Specific workflows for specific material, no silver bullet
• Focus in near team:
   – Identify most significant patterns of error
   – Implement pre-processing pipeline using simple heuristics and
     pattern matching tools
• Focus in longer term:
   – Integrate EHRI thesaurus and other forms of knowledge to validate
     and correct the output of NE extraction tools
Thanks


Any questions?



Publications:

•   Tobias Blanke, Mike Bryant, Mark Hedges: Ocropodium: open source OCR
    for small-scale historical archives. Journal of Information Science, Vol. 38,
    No. 1.

•   Tobias Blanke, Michael Bryant, Mark Hedges: Open source OCR for
    Scientific Workflows in History. Journal of Documentation, Forthcoming.

Weitere ähnliche Inhalte

Ähnlich wie Information Extraction on Noisy Texts for Historical Research

Archaeology, Informatics and Knowledge Representation
Archaeology, Informatics and Knowledge RepresentationArchaeology, Informatics and Knowledge Representation
Archaeology, Informatics and Knowledge RepresentationDART Project
 
Maria Theodoridou Semantic Integration Experiments
Maria Theodoridou Semantic Integration ExperimentsMaria Theodoridou Semantic Integration Experiments
Maria Theodoridou Semantic Integration Experimentsariadnenetwork
 
How to read a million books?
How to read a million books?How to read a million books?
How to read a million books?cneudecker
 
Curating and Preserving Collaborative Digital Experiments
Curating and Preserving Collaborative Digital ExperimentsCurating and Preserving Collaborative Digital Experiments
Curating and Preserving Collaborative Digital ExperimentsJose Enrique Ruiz
 
Lessons in Cross-Repository Interoperability learned from the aDORe effort
Lessons in Cross-Repository Interoperability learned from the aDORe effortLessons in Cross-Repository Interoperability learned from the aDORe effort
Lessons in Cross-Repository Interoperability learned from the aDORe effortHerbert Van de Sompel
 
Reaching the researcher
Reaching the researcherReaching the researcher
Reaching the researcherLIBER Europe
 
Exploratory querying of the Dutch GeoRegisters
Exploratory querying of the Dutch GeoRegistersExploratory querying of the Dutch GeoRegisters
Exploratory querying of the Dutch GeoRegistersStanislav Ronzhin
 
Scalable Identifiers for Natural History Collections
Scalable Identifiers for Natural History CollectionsScalable Identifiers for Natural History Collections
Scalable Identifiers for Natural History CollectionsJohn Kunze
 
Modeling a Microbial Community and Biodiversity Assay with OBI and PCO OBO Fo...
Modeling a Microbial Community and Biodiversity Assay with OBI and PCO OBO Fo...Modeling a Microbial Community and Biodiversity Assay with OBI and PCO OBO Fo...
Modeling a Microbial Community and Biodiversity Assay with OBI and PCO OBO Fo...Philippe Rocca-Serra
 
WP3: overzicht van de voortgang van WP# op de CLARIAH-dag
WP3: overzicht van de voortgang van WP# op de CLARIAH-dagWP3: overzicht van de voortgang van WP# op de CLARIAH-dag
WP3: overzicht van de voortgang van WP# op de CLARIAH-dagCLARIAH
 
Reborn Digital: coding text
Reborn Digital: coding textReborn Digital: coding text
Reborn Digital: coding textPip Willcox
 
Collaborative Digital Experiments
Collaborative Digital ExperimentsCollaborative Digital Experiments
Collaborative Digital ExperimentsJose Enrique Ruiz
 
Making the Leap Towards Linked Data
Making the Leap Towards Linked DataMaking the Leap Towards Linked Data
Making the Leap Towards Linked DataIris Lee
 
Open Access and Libraries
Open Access and LibrariesOpen Access and Libraries
Open Access and LibrariesEllyssa Kroski
 
DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...
DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...
DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...Felipe Albrecht
 
UCD Digital Library: Creating online access to historical and contemporary co...
UCD Digital Library: Creating online access to historical and contemporary co...UCD Digital Library: Creating online access to historical and contemporary co...
UCD Digital Library: Creating online access to historical and contemporary co...UCD Library
 
Update From OCLC Research May 2008
Update From OCLC Research May 2008Update From OCLC Research May 2008
Update From OCLC Research May 2008Nancy Elkington
 
Capturing Themed Evidence, a Hybrid Approach
Capturing Themed Evidence, a Hybrid ApproachCapturing Themed Evidence, a Hybrid Approach
Capturing Themed Evidence, a Hybrid ApproachEnrico Daga
 

Ähnlich wie Information Extraction on Noisy Texts for Historical Research (20)

Archaeology, Informatics and Knowledge Representation
Archaeology, Informatics and Knowledge RepresentationArchaeology, Informatics and Knowledge Representation
Archaeology, Informatics and Knowledge Representation
 
Maria Theodoridou Semantic Integration Experiments
Maria Theodoridou Semantic Integration ExperimentsMaria Theodoridou Semantic Integration Experiments
Maria Theodoridou Semantic Integration Experiments
 
How to read a million books?
How to read a million books?How to read a million books?
How to read a million books?
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Curating and Preserving Collaborative Digital Experiments
Curating and Preserving Collaborative Digital ExperimentsCurating and Preserving Collaborative Digital Experiments
Curating and Preserving Collaborative Digital Experiments
 
Lessons in Cross-Repository Interoperability learned from the aDORe effort
Lessons in Cross-Repository Interoperability learned from the aDORe effortLessons in Cross-Repository Interoperability learned from the aDORe effort
Lessons in Cross-Repository Interoperability learned from the aDORe effort
 
Reaching the researcher
Reaching the researcherReaching the researcher
Reaching the researcher
 
Exploratory querying of the Dutch GeoRegisters
Exploratory querying of the Dutch GeoRegistersExploratory querying of the Dutch GeoRegisters
Exploratory querying of the Dutch GeoRegisters
 
Scalable Identifiers for Natural History Collections
Scalable Identifiers for Natural History CollectionsScalable Identifiers for Natural History Collections
Scalable Identifiers for Natural History Collections
 
Research Objects in Wf4Ever
Research Objects in Wf4EverResearch Objects in Wf4Ever
Research Objects in Wf4Ever
 
Modeling a Microbial Community and Biodiversity Assay with OBI and PCO OBO Fo...
Modeling a Microbial Community and Biodiversity Assay with OBI and PCO OBO Fo...Modeling a Microbial Community and Biodiversity Assay with OBI and PCO OBO Fo...
Modeling a Microbial Community and Biodiversity Assay with OBI and PCO OBO Fo...
 
WP3: overzicht van de voortgang van WP# op de CLARIAH-dag
WP3: overzicht van de voortgang van WP# op de CLARIAH-dagWP3: overzicht van de voortgang van WP# op de CLARIAH-dag
WP3: overzicht van de voortgang van WP# op de CLARIAH-dag
 
Reborn Digital: coding text
Reborn Digital: coding textReborn Digital: coding text
Reborn Digital: coding text
 
Collaborative Digital Experiments
Collaborative Digital ExperimentsCollaborative Digital Experiments
Collaborative Digital Experiments
 
Making the Leap Towards Linked Data
Making the Leap Towards Linked DataMaking the Leap Towards Linked Data
Making the Leap Towards Linked Data
 
Open Access and Libraries
Open Access and LibrariesOpen Access and Libraries
Open Access and Libraries
 
DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...
DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...
DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...
 
UCD Digital Library: Creating online access to historical and contemporary co...
UCD Digital Library: Creating online access to historical and contemporary co...UCD Digital Library: Creating online access to historical and contemporary co...
UCD Digital Library: Creating online access to historical and contemporary co...
 
Update From OCLC Research May 2008
Update From OCLC Research May 2008Update From OCLC Research May 2008
Update From OCLC Research May 2008
 
Capturing Themed Evidence, a Hybrid Approach
Capturing Themed Evidence, a Hybrid ApproachCapturing Themed Evidence, a Hybrid Approach
Capturing Themed Evidence, a Hybrid Approach
 

Mehr von Kepa J. Rodriguez

LOD4JS - Linked Open Data for Jewish Studies
LOD4JS - Linked Open Data for Jewish StudiesLOD4JS - Linked Open Data for Jewish Studies
LOD4JS - Linked Open Data for Jewish StudiesKepa J. Rodriguez
 
The use of controlled and structured vocabularies in a digitally joined-up world
The use of controlled and structured vocabularies in a digitally joined-up worldThe use of controlled and structured vocabularies in a digitally joined-up world
The use of controlled and structured vocabularies in a digitally joined-up worldKepa J. Rodriguez
 
Use case: data edited as a book !!!
Use case: data edited as a book !!!Use case: data edited as a book !!!
Use case: data edited as a book !!!Kepa J. Rodriguez
 
Design and prototype of a Help Desk System for EHRI: an Information Retrieval...
Design and prototype of a Help Desk System for EHRI: an Information Retrieval...Design and prototype of a Help Desk System for EHRI: an Information Retrieval...
Design and prototype of a Help Desk System for EHRI: an Information Retrieval...Kepa J. Rodriguez
 
Named entity extraction tools for raw OCR text
Named entity extraction tools for raw OCR textNamed entity extraction tools for raw OCR text
Named entity extraction tools for raw OCR textKepa J. Rodriguez
 
Active Annotation of Corpora.
Active Annotation of Corpora.Active Annotation of Corpora.
Active Annotation of Corpora.Kepa J. Rodriguez
 
Resources for linguistically motivated Multilingual Anaphora Resolution
Resources for linguistically motivated Multilingual Anaphora ResolutionResources for linguistically motivated Multilingual Anaphora Resolution
Resources for linguistically motivated Multilingual Anaphora ResolutionKepa J. Rodriguez
 

Mehr von Kepa J. Rodriguez (8)

LOD4JS - Linked Open Data for Jewish Studies
LOD4JS - Linked Open Data for Jewish StudiesLOD4JS - Linked Open Data for Jewish Studies
LOD4JS - Linked Open Data for Jewish Studies
 
The use of controlled and structured vocabularies in a digitally joined-up world
The use of controlled and structured vocabularies in a digitally joined-up worldThe use of controlled and structured vocabularies in a digitally joined-up world
The use of controlled and structured vocabularies in a digitally joined-up world
 
Use case: data edited as a book !!!
Use case: data edited as a book !!!Use case: data edited as a book !!!
Use case: data edited as a book !!!
 
Design and prototype of a Help Desk System for EHRI: an Information Retrieval...
Design and prototype of a Help Desk System for EHRI: an Information Retrieval...Design and prototype of a Help Desk System for EHRI: an Information Retrieval...
Design and prototype of a Help Desk System for EHRI: an Information Retrieval...
 
Named entity extraction tools for raw OCR text
Named entity extraction tools for raw OCR textNamed entity extraction tools for raw OCR text
Named entity extraction tools for raw OCR text
 
Active Annotation of Corpora.
Active Annotation of Corpora.Active Annotation of Corpora.
Active Annotation of Corpora.
 
Resources for linguistically motivated Multilingual Anaphora Resolution
Resources for linguistically motivated Multilingual Anaphora ResolutionResources for linguistically motivated Multilingual Anaphora Resolution
Resources for linguistically motivated Multilingual Anaphora Resolution
 
Cross Document Coreference
Cross Document CoreferenceCross Document Coreference
Cross Document Coreference
 

Kürzlich hochgeladen

Call Girl Nashik Amaira 7001305949 Independent Escort Service Nashik
Call Girl Nashik Amaira 7001305949 Independent Escort Service NashikCall Girl Nashik Amaira 7001305949 Independent Escort Service Nashik
Call Girl Nashik Amaira 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Independent Joka Escorts ✔ 8250192130 ✔ Full Night With Room Online Booking 2...
Independent Joka Escorts ✔ 8250192130 ✔ Full Night With Room Online Booking 2...Independent Joka Escorts ✔ 8250192130 ✔ Full Night With Room Online Booking 2...
Independent Joka Escorts ✔ 8250192130 ✔ Full Night With Room Online Booking 2...noor ahmed
 
Model Call Girls In Velappanchavadi WhatsApp Booking 7427069034 call girl ser...
Model Call Girls In Velappanchavadi WhatsApp Booking 7427069034 call girl ser...Model Call Girls In Velappanchavadi WhatsApp Booking 7427069034 call girl ser...
Model Call Girls In Velappanchavadi WhatsApp Booking 7427069034 call girl ser... Shivani Pandey
 
𓀤Call On 6297143586 𓀤 Ultadanga Call Girls In All Kolkata 24/7 Provide Call W...
𓀤Call On 6297143586 𓀤 Ultadanga Call Girls In All Kolkata 24/7 Provide Call W...𓀤Call On 6297143586 𓀤 Ultadanga Call Girls In All Kolkata 24/7 Provide Call W...
𓀤Call On 6297143586 𓀤 Ultadanga Call Girls In All Kolkata 24/7 Provide Call W...rahim quresi
 
👙 Kolkata Call Girls Shyam Bazar 💫💫7001035870 Model escorts Service
👙  Kolkata Call Girls Shyam Bazar 💫💫7001035870 Model escorts Service👙  Kolkata Call Girls Shyam Bazar 💫💫7001035870 Model escorts Service
👙 Kolkata Call Girls Shyam Bazar 💫💫7001035870 Model escorts Serviceanamikaraghav4
 
Verified Trusted Call Girls Tambaram Chennai ✔✔7427069034 Independent Chenna...
Verified Trusted Call Girls Tambaram Chennai ✔✔7427069034  Independent Chenna...Verified Trusted Call Girls Tambaram Chennai ✔✔7427069034  Independent Chenna...
Verified Trusted Call Girls Tambaram Chennai ✔✔7427069034 Independent Chenna... Shivani Pandey
 
VIP Call Girls Nagpur Megha Call 7001035870 Meet With Nagpur Escorts
VIP Call Girls Nagpur Megha Call 7001035870 Meet With Nagpur EscortsVIP Call Girls Nagpur Megha Call 7001035870 Meet With Nagpur Escorts
VIP Call Girls Nagpur Megha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Independent Hatiara Escorts ✔ 9332606886✔ Full Night With Room Online Booking...
Independent Hatiara Escorts ✔ 9332606886✔ Full Night With Room Online Booking...Independent Hatiara Escorts ✔ 9332606886✔ Full Night With Room Online Booking...
Independent Hatiara Escorts ✔ 9332606886✔ Full Night With Room Online Booking...Riya Pathan
 
Call Girls Nashik Gayatri 7001305949 Independent Escort Service Nashik
Call Girls Nashik Gayatri 7001305949 Independent Escort Service NashikCall Girls Nashik Gayatri 7001305949 Independent Escort Service Nashik
Call Girls Nashik Gayatri 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Almora call girls 📞 8617697112 At Low Cost Cash Payment Booking
Almora call girls 📞 8617697112 At Low Cost Cash Payment BookingAlmora call girls 📞 8617697112 At Low Cost Cash Payment Booking
Almora call girls 📞 8617697112 At Low Cost Cash Payment BookingNitya salvi
 
Top Rated Kolkata Call Girls Khardah ⟟ 6297143586 ⟟ Call Me For Genuine Sex S...
Top Rated Kolkata Call Girls Khardah ⟟ 6297143586 ⟟ Call Me For Genuine Sex S...Top Rated Kolkata Call Girls Khardah ⟟ 6297143586 ⟟ Call Me For Genuine Sex S...
Top Rated Kolkata Call Girls Khardah ⟟ 6297143586 ⟟ Call Me For Genuine Sex S...ritikasharma
 
↑Top Model (Kolkata) Call Girls Behala ⟟ 8250192130 ⟟ High Class Call Girl In...
↑Top Model (Kolkata) Call Girls Behala ⟟ 8250192130 ⟟ High Class Call Girl In...↑Top Model (Kolkata) Call Girls Behala ⟟ 8250192130 ⟟ High Class Call Girl In...
↑Top Model (Kolkata) Call Girls Behala ⟟ 8250192130 ⟟ High Class Call Girl In...noor ahmed
 
👙 Kolkata Call Girls Sonagachi 💫💫7001035870 Model escorts Service
👙  Kolkata Call Girls Sonagachi 💫💫7001035870 Model escorts Service👙  Kolkata Call Girls Sonagachi 💫💫7001035870 Model escorts Service
👙 Kolkata Call Girls Sonagachi 💫💫7001035870 Model escorts Serviceanamikaraghav4
 
Call Girls in Barasat | 7001035870 At Low Cost Cash Payment Booking
Call Girls in Barasat | 7001035870 At Low Cost Cash Payment BookingCall Girls in Barasat | 7001035870 At Low Cost Cash Payment Booking
Call Girls in Barasat | 7001035870 At Low Cost Cash Payment Bookingnoor ahmed
 
Call Girl Nagpur Roshni Call 7001035870 Meet With Nagpur Escorts
Call Girl Nagpur Roshni Call 7001035870 Meet With Nagpur EscortsCall Girl Nagpur Roshni Call 7001035870 Meet With Nagpur Escorts
Call Girl Nagpur Roshni Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
5* Hotels Call Girls In Goa {{07028418221}} Call Girls In North Goa Escort Se...
5* Hotels Call Girls In Goa {{07028418221}} Call Girls In North Goa Escort Se...5* Hotels Call Girls In Goa {{07028418221}} Call Girls In North Goa Escort Se...
5* Hotels Call Girls In Goa {{07028418221}} Call Girls In North Goa Escort Se...Apsara Of India
 
College Call Girls New Alipore - For 7001035870 Cheap & Best with original Ph...
College Call Girls New Alipore - For 7001035870 Cheap & Best with original Ph...College Call Girls New Alipore - For 7001035870 Cheap & Best with original Ph...
College Call Girls New Alipore - For 7001035870 Cheap & Best with original Ph...anamikaraghav4
 
Call Girl Service Belur - 7001035870 with real photos and phone numbers
Call Girl Service Belur - 7001035870 with real photos and phone numbersCall Girl Service Belur - 7001035870 with real photos and phone numbers
Call Girl Service Belur - 7001035870 with real photos and phone numbersanamikaraghav4
 

Kürzlich hochgeladen (20)

Call Girl Nashik Amaira 7001305949 Independent Escort Service Nashik
Call Girl Nashik Amaira 7001305949 Independent Escort Service NashikCall Girl Nashik Amaira 7001305949 Independent Escort Service Nashik
Call Girl Nashik Amaira 7001305949 Independent Escort Service Nashik
 
Independent Joka Escorts ✔ 8250192130 ✔ Full Night With Room Online Booking 2...
Independent Joka Escorts ✔ 8250192130 ✔ Full Night With Room Online Booking 2...Independent Joka Escorts ✔ 8250192130 ✔ Full Night With Room Online Booking 2...
Independent Joka Escorts ✔ 8250192130 ✔ Full Night With Room Online Booking 2...
 
Model Call Girls In Velappanchavadi WhatsApp Booking 7427069034 call girl ser...
Model Call Girls In Velappanchavadi WhatsApp Booking 7427069034 call girl ser...Model Call Girls In Velappanchavadi WhatsApp Booking 7427069034 call girl ser...
Model Call Girls In Velappanchavadi WhatsApp Booking 7427069034 call girl ser...
 
𓀤Call On 6297143586 𓀤 Ultadanga Call Girls In All Kolkata 24/7 Provide Call W...
𓀤Call On 6297143586 𓀤 Ultadanga Call Girls In All Kolkata 24/7 Provide Call W...𓀤Call On 6297143586 𓀤 Ultadanga Call Girls In All Kolkata 24/7 Provide Call W...
𓀤Call On 6297143586 𓀤 Ultadanga Call Girls In All Kolkata 24/7 Provide Call W...
 
👙 Kolkata Call Girls Shyam Bazar 💫💫7001035870 Model escorts Service
👙  Kolkata Call Girls Shyam Bazar 💫💫7001035870 Model escorts Service👙  Kolkata Call Girls Shyam Bazar 💫💫7001035870 Model escorts Service
👙 Kolkata Call Girls Shyam Bazar 💫💫7001035870 Model escorts Service
 
Verified Trusted Call Girls Tambaram Chennai ✔✔7427069034 Independent Chenna...
Verified Trusted Call Girls Tambaram Chennai ✔✔7427069034  Independent Chenna...Verified Trusted Call Girls Tambaram Chennai ✔✔7427069034  Independent Chenna...
Verified Trusted Call Girls Tambaram Chennai ✔✔7427069034 Independent Chenna...
 
VIP Call Girls Nagpur Megha Call 7001035870 Meet With Nagpur Escorts
VIP Call Girls Nagpur Megha Call 7001035870 Meet With Nagpur EscortsVIP Call Girls Nagpur Megha Call 7001035870 Meet With Nagpur Escorts
VIP Call Girls Nagpur Megha Call 7001035870 Meet With Nagpur Escorts
 
Independent Hatiara Escorts ✔ 9332606886✔ Full Night With Room Online Booking...
Independent Hatiara Escorts ✔ 9332606886✔ Full Night With Room Online Booking...Independent Hatiara Escorts ✔ 9332606886✔ Full Night With Room Online Booking...
Independent Hatiara Escorts ✔ 9332606886✔ Full Night With Room Online Booking...
 
Call Girls Nashik Gayatri 7001305949 Independent Escort Service Nashik
Call Girls Nashik Gayatri 7001305949 Independent Escort Service NashikCall Girls Nashik Gayatri 7001305949 Independent Escort Service Nashik
Call Girls Nashik Gayatri 7001305949 Independent Escort Service Nashik
 
Almora call girls 📞 8617697112 At Low Cost Cash Payment Booking
Almora call girls 📞 8617697112 At Low Cost Cash Payment BookingAlmora call girls 📞 8617697112 At Low Cost Cash Payment Booking
Almora call girls 📞 8617697112 At Low Cost Cash Payment Booking
 
Top Rated Kolkata Call Girls Khardah ⟟ 6297143586 ⟟ Call Me For Genuine Sex S...
Top Rated Kolkata Call Girls Khardah ⟟ 6297143586 ⟟ Call Me For Genuine Sex S...Top Rated Kolkata Call Girls Khardah ⟟ 6297143586 ⟟ Call Me For Genuine Sex S...
Top Rated Kolkata Call Girls Khardah ⟟ 6297143586 ⟟ Call Me For Genuine Sex S...
 
↑Top Model (Kolkata) Call Girls Behala ⟟ 8250192130 ⟟ High Class Call Girl In...
↑Top Model (Kolkata) Call Girls Behala ⟟ 8250192130 ⟟ High Class Call Girl In...↑Top Model (Kolkata) Call Girls Behala ⟟ 8250192130 ⟟ High Class Call Girl In...
↑Top Model (Kolkata) Call Girls Behala ⟟ 8250192130 ⟟ High Class Call Girl In...
 
Russian ℂall gIRLS In Goa 9316020077 ℂall gIRLS Service In Goa
Russian ℂall gIRLS In Goa 9316020077  ℂall gIRLS Service  In GoaRussian ℂall gIRLS In Goa 9316020077  ℂall gIRLS Service  In Goa
Russian ℂall gIRLS In Goa 9316020077 ℂall gIRLS Service In Goa
 
👙 Kolkata Call Girls Sonagachi 💫💫7001035870 Model escorts Service
👙  Kolkata Call Girls Sonagachi 💫💫7001035870 Model escorts Service👙  Kolkata Call Girls Sonagachi 💫💫7001035870 Model escorts Service
👙 Kolkata Call Girls Sonagachi 💫💫7001035870 Model escorts Service
 
Call Girls in Barasat | 7001035870 At Low Cost Cash Payment Booking
Call Girls in Barasat | 7001035870 At Low Cost Cash Payment BookingCall Girls in Barasat | 7001035870 At Low Cost Cash Payment Booking
Call Girls in Barasat | 7001035870 At Low Cost Cash Payment Booking
 
Call Girl Nagpur Roshni Call 7001035870 Meet With Nagpur Escorts
Call Girl Nagpur Roshni Call 7001035870 Meet With Nagpur EscortsCall Girl Nagpur Roshni Call 7001035870 Meet With Nagpur Escorts
Call Girl Nagpur Roshni Call 7001035870 Meet With Nagpur Escorts
 
5* Hotels Call Girls In Goa {{07028418221}} Call Girls In North Goa Escort Se...
5* Hotels Call Girls In Goa {{07028418221}} Call Girls In North Goa Escort Se...5* Hotels Call Girls In Goa {{07028418221}} Call Girls In North Goa Escort Se...
5* Hotels Call Girls In Goa {{07028418221}} Call Girls In North Goa Escort Se...
 
College Call Girls New Alipore - For 7001035870 Cheap & Best with original Ph...
College Call Girls New Alipore - For 7001035870 Cheap & Best with original Ph...College Call Girls New Alipore - For 7001035870 Cheap & Best with original Ph...
College Call Girls New Alipore - For 7001035870 Cheap & Best with original Ph...
 
Goa Call "Girls Service 9316020077 Call "Girls in Goa
Goa Call "Girls  Service   9316020077 Call "Girls in GoaGoa Call "Girls  Service   9316020077 Call "Girls in Goa
Goa Call "Girls Service 9316020077 Call "Girls in Goa
 
Call Girl Service Belur - 7001035870 with real photos and phone numbers
Call Girl Service Belur - 7001035870 with real photos and phone numbersCall Girl Service Belur - 7001035870 with real photos and phone numbers
Call Girl Service Belur - 7001035870 with real photos and phone numbers
 

Information Extraction on Noisy Texts for Historical Research

  • 1. Information Extraction on Noisy Texts for Historical Research Mike Bryant Kepa Joseba Rodriquez Tobias Blanke Reto Speck 19th July 2012 http://www.ehri-project.eu
  • 2. Why EHRI? Fragmentation and dispersal of archival sources • Geographical scope of Holocaust • Attempts to destroy the evidence • Migration of Holocaust survivors • Multiplicity documentation projects after the war
  • 4. The Adler Case 5 - King’s College 2 - ITS International Tracing Service 4 – NIOD 1 - Jewish Museum Prague 3 YAD VASHEM CONNECTING COLLECTIONS
  • 5. Connecting Collections Collection-level metadata Enhance existing services Develop new services • Build a virtual observatory • Build a virtual research – A digital infrastructure to environment unlock sources – Problem-driven – User-driven
  • 6. Integrate multiple layers of Metadata Archival (Finding aids, thesaurus) Machine Generated (extracted entities) User Generated Metadata (annotations)
  • 7. Services for partner archives • OCR – Provide a general-purpose OCR service tailored to the needs of historical material – Allow attaching scanned paper finding aids to “bare-bones” collection descriptions and automatically storing/indexing OCR output • Named Entity Extraction – Integrate NEE services to bootstrap the process of tagging collection descriptions – Integrate NEE with the EHRI thesaurus, to filter and validate NEE output – Build “candidate” search indexes, with crowd-sourced validation
  • 8. Workflow Tools – the Ocropodium Project 1. Workflow development 2. Batch Process 3. Transcript correction
  • 9.
  • 10. NEE Experiment – Corpus data • Wiener Library: Holocaust survivor testimonies • 17 pages • ~93% OCR word accuracy • King’s College London: H.M.S. Kelly Newsletters • 33 pages • ~92.5% OCR word accuracy
  • 11. NEE Experiment - Tools • Extracted entities “Find all information about – Person prisoners arriving in Therezin from – Location the Netherlands in 1944” – Organisation • Tools “Find all documentation from Hans – Alchemy API Gunther Adler on SS guards in – OpenCalais Therezin” – Apache OpenNLP – Stanford NER • Manually annotated source data – Tokenized and POS tagged using TreeTagger – Imported into MMAX2 for manual entity tagging
  • 12. NEE Experiment - Results Low performance of the tools in corrected and raw text Raw Corrected P R F1 P R F1 Alchemy 0.61 0.38 0.47 0.63 0.38 0.48 OpenCalais 0.75 0.29 0.41 0.69 0.30 0.42 OpenNLP 0.42 0.12 0.19 0.53 0.13 0.21 Stanford 0.57 0.52 0.54 0.60 0.61 0.60
  • 13. LOC extraction most accurate, ORG least WL F1-Score KCL F1-Score
  • 14. NEE Experiment – Personal names • Person names: commonly written in non-standard forms • Person and location names are used for other kind of entities, e.g. warships • Warships frequently annotated as PER
  • 15. NEE Experiment - Organisations Performance of type ORG extraction is very low • Names of organizations appear in non-standard forms • Jargon and abbreviations abound, particularly in Kelly newsletters • Many organizations no longer exist • SS and other relevant Nazi organizations have not be detected • Spelling errors and typos in the original files: • OpenCalais used general knowledge to resolve this problem • Use of general knowledge my be problematic. • “Klan, Walter” → “Ku Klux Klan”
  • 16. Relative performance • Stanford NER best performance across both datasets – Most effective on PER and LOC types • Alchemy API best results on ORG type – Biggest difference between raw OCR and manually corrected text – Not massively ahead of OpenCalais/Stanford • Apache OpenNLP worst performance on our data – But: most open of the tools and theoretically trainable
  • 17. Conclusions • Manual correction of OCR output does not significantly improve the performance (on our material) – Raw output is enough to obtain provisional candidates for N-gram indexing • Best results likely to come from combinations of tools – Specific workflows for specific material, no silver bullet • Focus in near team: – Identify most significant patterns of error – Implement pre-processing pipeline using simple heuristics and pattern matching tools • Focus in longer term: – Integrate EHRI thesaurus and other forms of knowledge to validate and correct the output of NE extraction tools
  • 18. Thanks Any questions? Publications: • Tobias Blanke, Mike Bryant, Mark Hedges: Ocropodium: open source OCR for small-scale historical archives. Journal of Information Science, Vol. 38, No. 1. • Tobias Blanke, Michael Bryant, Mark Hedges: Open source OCR for Scientific Workflows in History. Journal of Documentation, Forthcoming.