SlideShare a Scribd company logo
1 of 29
Exploring Challenges in Mining Historical Text
Beatrice Alex, Claire Grover, Richard Tobin and Ewan Klein

      Working with text: Tools, techniques and approaches for text mining
                           Edinburgh - 07/07/2012
Overview
‣ Project
‣ Data
‣ Preprocessing historical text
      ‣ Improvements to OCR
      ‣ Language identification
      ‣ Text mining tables
‣    Text-mining
      ‣ Improved commodity identification
      ‣ Ports-based geo-grounding
      ‣ Relation extraction


    Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Project (01/2012-12/2014)
‣ Funded by Digging into Data (round 2)
‣ Partners
                    Ewan Klein, Claire Grover, Bea Alex (text mining)


                    Colin Coates, Jim Clifford (historical analysis)


                    James Reid (data integration)


                    Aaron Quigley, Uta Hinrichs (information
                    visualisation)
 Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Trading Consequences
‣ What does archival text say about the
     economic and environmental consequences of
     global commodity trading during the
     nineteenth century?
‣    Help historians to discover novel patters and
     explore new hypotheses.
‣    Example questions:
      ‣ What were the routes and volumes of international
            trade in resource commodities 1850-1914?
      ‣     What were the local environmental consequences of
            this demand for these resources?

    Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Geolocating Cinchona




 Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Trading Consequences
‣ Scope: global but with focus on Canadian
     natural resource flows to test reliability and
     efficacy of our methods
‣    Methods:
      ‣ Text mining and geo-parsing to transform the text
            into structured data, e.g. relational database
      ‣     Query interface targeted at historians
      ‣     Information visualisation for interactive exploration




    Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Historical Data
‣ Digitised sources from the 19th century
   British Empire, currently processing
    ‣ Early Canadiana Online: 83,038 files
    ‣ JSTOR data: 1,000 XML files
    ‣ House of Commons Parliamentary Papers: 4,135
          files
    ‣     Books: selected books on nineteenth century trade

‣ Further sources:
    ‣ ProQuest data
    ‣ Encyclopaedia Britannica, Jstor Plants, Forestry
          Journals?, The Botanist?

  Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Processing Historical Data
‣ Challenges so far:
    ‣ Different formats
    ‣ Low-quality OCRed text
      ‣ Old/low-quality prints, quality of OCR
             technology
         ‣ Historical English: historical word variants,
             ſ (long s) characters mixed up with f by OCR
         ‣ Artefacts in original documents: headers/footers,
             page numbers, notes in margins, end-of-line
             hyphenation
    ‣     Text in different languages
    ‣     Information in tables


  Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Processing Historical Data
‣ Challenges so far:
    ‣ Different formats
    ‣ Low-quality OCRed text
      ‣ Old/low-quality prints, quality of OCR
             technology
         ‣ Historical English: historical word variants,
             ſ (long s) characters mixed up with f by OCR
         ‣ Artefacts in original documents: headers/footers,
             page numbers, notes in margins, end-of-line
             hyphenation
    ‣     Text in different languages
    ‣     Information in tables


  Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Improvements to OCR
‣ Normalisation and post-correction
‣ Fixed end-of-line hyphenation
    ‣ Dehyphen all token-splitting hyphens using a
          dictionary-based approach (dictionary is the system
          dictionary + the text of the current document)
‣ Added f-to-s conversion
    ‣ Convert all false f characters to s using a corpus-
          based a approach (corpus is a collection of historical
          documents from the Gutenberg Project)
‣ Example: reduced number of words
  unrecognised by spell checker from 61 to 21 -
  > approx. 67% improvement
  Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Improvements to OCR




Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Improvements to OCR




Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Improvements to OCR
‣ Extensive evaluation of both tools against
    human corrected/normalised gold standard
‣   Reduce word error rate by 12.5% in a random
    Canadiana sample (word acc: 0.776 -> 0.804)
‣   Improvements have an effect on later text
    mining steps and would also be beneficial for
    searching text in any IR system (e.g. Jstor
    database search for “French colonifts”)



    Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Language Identification
‣    Most sources do not                           ISO Code
                                                      eng
                                                                      Language
                                                                         English
                                                                                           Frequency
                                                                                               2,677,498
     contain language                                  fra               French                1,208,811
                                                      deu               German                     2,886
     information like                                 chn            Chinook jargon                2,488
     Canadiana does                                   moh               Mohawk                     1,547
                                                       oji               Ojibwa                    1,395
‣    The table displays                               emg               Eastern
                                                                        Meohang
                                                                                                     835
     the number of text                               enb
                                                      cre
                                                                       Markweeta
                                                                          Cree
                                                                                                     666
                                                                                                     501
     elements in                                       iro             Iroquoian                     324
                                                       alg             Algonquian                    210
     Canadiana per                                    nge               Ngemba                       157
     language ignoring                                nld
                                                       lat
                                                                         Dutch
                                                                          Latin
                                                                                                     131
                                                                                                     119
     notes and titles                                 mic               Micmac                        61
                                                       gla           Scottish Gaelic                  22

    Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Language Identification
‣ Make use of automatic language
     identification using TextCat, especially for the
     JSTOR data which is also multi-lingual.
‣    LID is done for each paragraph and for the
     entire document by taking the most frequent
     language tag assigned.
‣    Can limit processing to English (and French)
     documents only.
‣    740 English documents (out of 1,000)


    Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Text Mining Tables




 Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Text Mining Tables
‣ Tables contain a lot of relevant information
     but are difficult to mine.
‣    HCPP documents contain coordinates for
     each table entry.
                <w p="961,1777,1026,1807" v="d">Rio</w>
                <w p="1026,1777,1170,1807" v="d">Janeiro</w>
                ...
                <w p="961,1892,1087,1921" v="n">Culcutta</w>
                <w p="1496,1530,1565,1555" v="o">141</w>
                <w p="1565,1525,1631,1555" v="d">bags</w>
                <w p="1227,1774,1336,1804" v="d">Wood</w>
                <w p="1353,1791,1366,1799" v="o">-</w>
                <w p="1494,1776,1565,1804" v="o">338</w>
                <w p="1565,1783,1676,1803" v="d">planks</w>
                <w p="1704,1791,1718,1799" v="o">-</w>

‣ Planning to do a feasibility study for a table
     mining algorithm.
    Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Text Mining Pipeline
‣ Steps after that OCR improvements and LID:
   ‣     Tokenisation
   ‣     Part-of-speech tagging
   ‣     Lemmatisation
   ‣     Wordnet lookup to find commodities
   ‣     Named-entity recognition including commodity
         lexicon lookup
   ‣     Port-based Geo-grounding
   ‣     Chunking




 Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Text Mining Pipeline
‣ Steps after that OCR improvements and LID:
   ‣     Tokenisation
   ‣     Part-of-speech tagging
   ‣     Lemmatisation
   ‣     Wordnet lookup to find commodities
   ‣     Named-entity recognition including commodity
         lexicon lookup
   ‣     Port-based Geo-grounding
   ‣     Chunking




 Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Commodities Identification
‣ WordNet lookup using an approximation of
  commodity named entities:
   ‣ Noun phrases with hypernyms such as substance,
         physical matter, plant or animal in WordNet.
   ‣     Each NP which leads to a match is assigned a
         wn=”true” attribute.
‣ Commodities gazetteer lookup using a list of
  commodities derived by historians.
   ‣ Strings matching the entries in the gazetteer are
         assigned a commlex=”true” attribute.
‣ Words/phrases with wn=”true” and
  commlex=”true” are good candidates.
 Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Ports-based Geo-grounding
‣ Started with non-optimised geo-resolution.
‣ Incorporated the list of ports.
      Locations are assigned with an is_port="1" or an
      is_port="0" attribute.
      ‣ Grounding now ignores non-port candidates in case
         of ambiguous location mentions.
      ‣ is_port locations are also given a higher weight in
         the scoring.
‣ Hypothesis: ports are more likely to be
     significant locations in historic documents
     about trade.
‣    Not tested yet as need gold standard data.
    Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Ports-based Geo-grounding
‣ Example:
Dalhousie is in the list of ports as:
DALHOUSIE                -66.4   48.1

Geo-grounding in non-optimised resolver:
<ent id="rb3" type="location" lat="32.5333300" long="75.9833300" in-country="IN"
gazref="geonames:1273648" feat-type="ppl" pop-size="7601">
  <parts>
   <part ew="w136" sw="w136">Dalhousie</part>
  </parts>
 </ent>

Geo-grounding in ports-dependent resolver:
 <ent id="rb2" type="location" lat="48.0550200" long="-66.3847200" in-
country="CA" gazref="geonames:6943599" feat-type="ppl" pop-size="0">
  <parts>
   <part ew="w97" sw="w97">Dalhousie</part>
  </parts>
 </ent>
   Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Ports-based Geo-grounding
‣ Geo-grounding assumes that each text is a
     coherent whole. All locations contribute to
     the resolution of all others. May have to
     change that.
‣    Segmentation (e.g. of books) into smaller
     units might improve the resolution.
‣    Need to consider old spellings of place
     names.



    Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Relation Extraction
‣ Crude way to identify commodity-location
   relations:
    ‣ Sentences (s) containing words (w) with the
          commlex="true" and wn="true" and a location.

 Good: The quantity of raw cotton imported annually into the United Kingdom—take for
 example, the year 1854—amounted to, at least, 887,335,9041bs., of which the United States
 supplied 722,154,101 lbs.

 Of interest: Another kind of quinine-yieldmg bark has been discovered on the western side of the
 Cordillera, which produces more sulphate than the common cinchona; and as the cinchona
 grows on both sides of the Cordillera, it may be inferred that the new plant will be found also in
 the lands of Gualaquiza and Canelos.

 Bad: The first-class refreshment room, Central Station, Leeds, has a notice that only five-year old
 whisky is sold there. OR
 This paper was concealed in the handle of a spear, carried from Omdurman to Gedarif in that
 way.


 Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Relation Extraction




 Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Relation Extraction
‣ Need to improve the relation extraction.
‣ Will look at pattern-based relation extraction
     exploiting vocabulary like "import", "export",
     "ship", "shipment", "trade", “manufacture”,
     “grow” etc.
‣    Will annotate a small test corpus for
     evaluation.
‣    Need to distinguish between irrelevant or
     false commodity-location relations and
     commodity-location relations referring to
     trade.
    Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Thank You
‣ Questions?




 Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Example Input
‣ Different sources converted into common
  XML format




 Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Example Output




Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012

More Related Content

Viewers also liked

Domadoras 2017
Domadoras 2017Domadoras 2017
Domadoras 2017CSJ-1-2
 
AS PARTES DO CORPO
AS PARTES DO CORPOAS PARTES DO CORPO
AS PARTES DO CORPOCSJ-1-2
 
IMPROPER INTEGRAL
   IMPROPER INTEGRAL   IMPROPER INTEGRAL
IMPROPER INTEGRALkishan619
 
skimming and previewing
skimming and previewingskimming and previewing
skimming and previewingmardiatun nisa
 
Thuat ngu marketing online obs.com.vn
Thuat ngu marketing online obs.com.vnThuat ngu marketing online obs.com.vn
Thuat ngu marketing online obs.com.vnOBS Việt Nam
 
Carnaval
CarnavalCarnaval
CarnavalCSJ-1-2
 

Viewers also liked (8)

Lec2
Lec2Lec2
Lec2
 
Domadoras 2017
Domadoras 2017Domadoras 2017
Domadoras 2017
 
AS PARTES DO CORPO
AS PARTES DO CORPOAS PARTES DO CORPO
AS PARTES DO CORPO
 
IMPROPER INTEGRAL
   IMPROPER INTEGRAL   IMPROPER INTEGRAL
IMPROPER INTEGRAL
 
skimming and previewing
skimming and previewingskimming and previewing
skimming and previewing
 
Thuat ngu marketing online obs.com.vn
Thuat ngu marketing online obs.com.vnThuat ngu marketing online obs.com.vn
Thuat ngu marketing online obs.com.vn
 
Carnaval
CarnavalCarnaval
Carnaval
 
AA Resume 2016
AA Resume 2016AA Resume 2016
AA Resume 2016
 

Similar to Exploring Challenges in Mining Historical Text

The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6Andrei Zmievski
 
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Project
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...Dr. Haxel Consult
 
Language adaptability and performance evaluation of historical text normaliza...
Language adaptability and performance evaluation of historical text normaliza...Language adaptability and performance evaluation of historical text normaliza...
Language adaptability and performance evaluation of historical text normaliza...DH Benelux
 
Targeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsTargeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsEmma Huber
 
How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...agileware
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspaperscneudecker
 
The EPO document collection: A technical treasure chest
The EPO document collection:A technical treasure chestThe EPO document collection:A technical treasure chest
The EPO document collection: A technical treasure chestGO opleidingen
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Péter Király
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Edureka!
 
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011Datalift
 
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011François Scharffe
 
The Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital ArchivingThe Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital ArchivingSven Schlarb
 
Digitizing a newspaper clippings collection: a case study in small-scale digi...
Digitizing a newspaper clippings collection: a case study in small-scale digi...Digitizing a newspaper clippings collection: a case study in small-scale digi...
Digitizing a newspaper clippings collection: a case study in small-scale digi...Molly Knapp
 
Curation Technologies for Multilingual Europe
Curation Technologies for Multilingual EuropeCuration Technologies for Multilingual Europe
Curation Technologies for Multilingual EuropeGeorg Rehm
 
Dirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz ProjectDirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz Projectmbruemmer
 
Oc wg-nif-20130711
Oc wg-nif-20130711Oc wg-nif-20130711
Oc wg-nif-20130711STIinnsbruck
 

Similar to Exploring Challenges in Mining Historical Text (20)

The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
 
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
 
Language adaptability and performance evaluation of historical text normaliza...
Language adaptability and performance evaluation of historical text normaliza...Language adaptability and performance evaluation of historical text normaliza...
Language adaptability and performance evaluation of historical text normaliza...
 
Targeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsTargeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical Collections
 
How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
 
About programming languages
About programming languagesAbout programming languages
About programming languages
 
The EPO document collection: A technical treasure chest
The EPO document collection:A technical treasure chestThe EPO document collection:A technical treasure chest
The EPO document collection: A technical treasure chest
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
 
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
 
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
 
The Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital ArchivingThe Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital Archiving
 
Digitizing a newspaper clippings collection: a case study in small-scale digi...
Digitizing a newspaper clippings collection: a case study in small-scale digi...Digitizing a newspaper clippings collection: a case study in small-scale digi...
Digitizing a newspaper clippings collection: a case study in small-scale digi...
 
Towards a Common Approach for Access to Digital Archival Records in Europe. A...
Towards a Common Approach for Access to Digital Archival Records in Europe. A...Towards a Common Approach for Access to Digital Archival Records in Europe. A...
Towards a Common Approach for Access to Digital Archival Records in Europe. A...
 
Curation Technologies for Multilingual Europe
Curation Technologies for Multilingual EuropeCuration Technologies for Multilingual Europe
Curation Technologies for Multilingual Europe
 
Dirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz ProjectDirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz Project
 
Accessing Treasure on lands and peoples
Accessing Treasure on lands and peoplesAccessing Treasure on lands and peoples
Accessing Treasure on lands and peoples
 
Oc wg-nif-20130711
Oc wg-nif-20130711Oc wg-nif-20130711
Oc wg-nif-20130711
 

Recently uploaded

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 

Exploring Challenges in Mining Historical Text

  • 1. Exploring Challenges in Mining Historical Text Beatrice Alex, Claire Grover, Richard Tobin and Ewan Klein Working with text: Tools, techniques and approaches for text mining Edinburgh - 07/07/2012
  • 2. Overview ‣ Project ‣ Data ‣ Preprocessing historical text ‣ Improvements to OCR ‣ Language identification ‣ Text mining tables ‣ Text-mining ‣ Improved commodity identification ‣ Ports-based geo-grounding ‣ Relation extraction Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 3. Project (01/2012-12/2014) ‣ Funded by Digging into Data (round 2) ‣ Partners Ewan Klein, Claire Grover, Bea Alex (text mining) Colin Coates, Jim Clifford (historical analysis) James Reid (data integration) Aaron Quigley, Uta Hinrichs (information visualisation) Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 4. Trading Consequences ‣ What does archival text say about the economic and environmental consequences of global commodity trading during the nineteenth century? ‣ Help historians to discover novel patters and explore new hypotheses. ‣ Example questions: ‣ What were the routes and volumes of international trade in resource commodities 1850-1914? ‣ What were the local environmental consequences of this demand for these resources? Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 5. Geolocating Cinchona Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 6. Trading Consequences ‣ Scope: global but with focus on Canadian natural resource flows to test reliability and efficacy of our methods ‣ Methods: ‣ Text mining and geo-parsing to transform the text into structured data, e.g. relational database ‣ Query interface targeted at historians ‣ Information visualisation for interactive exploration Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 7. Historical Data ‣ Digitised sources from the 19th century British Empire, currently processing ‣ Early Canadiana Online: 83,038 files ‣ JSTOR data: 1,000 XML files ‣ House of Commons Parliamentary Papers: 4,135 files ‣ Books: selected books on nineteenth century trade ‣ Further sources: ‣ ProQuest data ‣ Encyclopaedia Britannica, Jstor Plants, Forestry Journals?, The Botanist? Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 8. Processing Historical Data ‣ Challenges so far: ‣ Different formats ‣ Low-quality OCRed text ‣ Old/low-quality prints, quality of OCR technology ‣ Historical English: historical word variants, ſ (long s) characters mixed up with f by OCR ‣ Artefacts in original documents: headers/footers, page numbers, notes in margins, end-of-line hyphenation ‣ Text in different languages ‣ Information in tables Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 9. Processing Historical Data ‣ Challenges so far: ‣ Different formats ‣ Low-quality OCRed text ‣ Old/low-quality prints, quality of OCR technology ‣ Historical English: historical word variants, ſ (long s) characters mixed up with f by OCR ‣ Artefacts in original documents: headers/footers, page numbers, notes in margins, end-of-line hyphenation ‣ Text in different languages ‣ Information in tables Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 10. Improvements to OCR ‣ Normalisation and post-correction ‣ Fixed end-of-line hyphenation ‣ Dehyphen all token-splitting hyphens using a dictionary-based approach (dictionary is the system dictionary + the text of the current document) ‣ Added f-to-s conversion ‣ Convert all false f characters to s using a corpus- based a approach (corpus is a collection of historical documents from the Gutenberg Project) ‣ Example: reduced number of words unrecognised by spell checker from 61 to 21 - > approx. 67% improvement Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 11. Improvements to OCR Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 12. Improvements to OCR Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 13. Improvements to OCR ‣ Extensive evaluation of both tools against human corrected/normalised gold standard ‣ Reduce word error rate by 12.5% in a random Canadiana sample (word acc: 0.776 -> 0.804) ‣ Improvements have an effect on later text mining steps and would also be beneficial for searching text in any IR system (e.g. Jstor database search for “French colonifts”) Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 14. Language Identification ‣ Most sources do not ISO Code eng Language English Frequency 2,677,498 contain language fra French 1,208,811 deu German 2,886 information like chn Chinook jargon 2,488 Canadiana does moh Mohawk 1,547 oji Ojibwa 1,395 ‣ The table displays emg Eastern Meohang 835 the number of text enb cre Markweeta Cree 666 501 elements in iro Iroquoian 324 alg Algonquian 210 Canadiana per nge Ngemba 157 language ignoring nld lat Dutch Latin 131 119 notes and titles mic Micmac 61 gla Scottish Gaelic 22 Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 15. Language Identification ‣ Make use of automatic language identification using TextCat, especially for the JSTOR data which is also multi-lingual. ‣ LID is done for each paragraph and for the entire document by taking the most frequent language tag assigned. ‣ Can limit processing to English (and French) documents only. ‣ 740 English documents (out of 1,000) Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 16. Text Mining Tables Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 17. Text Mining Tables ‣ Tables contain a lot of relevant information but are difficult to mine. ‣ HCPP documents contain coordinates for each table entry. <w p="961,1777,1026,1807" v="d">Rio</w> <w p="1026,1777,1170,1807" v="d">Janeiro</w> ... <w p="961,1892,1087,1921" v="n">Culcutta</w> <w p="1496,1530,1565,1555" v="o">141</w> <w p="1565,1525,1631,1555" v="d">bags</w> <w p="1227,1774,1336,1804" v="d">Wood</w> <w p="1353,1791,1366,1799" v="o">-</w> <w p="1494,1776,1565,1804" v="o">338</w> <w p="1565,1783,1676,1803" v="d">planks</w> <w p="1704,1791,1718,1799" v="o">-</w> ‣ Planning to do a feasibility study for a table mining algorithm. Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 18. Text Mining Pipeline ‣ Steps after that OCR improvements and LID: ‣ Tokenisation ‣ Part-of-speech tagging ‣ Lemmatisation ‣ Wordnet lookup to find commodities ‣ Named-entity recognition including commodity lexicon lookup ‣ Port-based Geo-grounding ‣ Chunking Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 19. Text Mining Pipeline ‣ Steps after that OCR improvements and LID: ‣ Tokenisation ‣ Part-of-speech tagging ‣ Lemmatisation ‣ Wordnet lookup to find commodities ‣ Named-entity recognition including commodity lexicon lookup ‣ Port-based Geo-grounding ‣ Chunking Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 20. Commodities Identification ‣ WordNet lookup using an approximation of commodity named entities: ‣ Noun phrases with hypernyms such as substance, physical matter, plant or animal in WordNet. ‣ Each NP which leads to a match is assigned a wn=”true” attribute. ‣ Commodities gazetteer lookup using a list of commodities derived by historians. ‣ Strings matching the entries in the gazetteer are assigned a commlex=”true” attribute. ‣ Words/phrases with wn=”true” and commlex=”true” are good candidates. Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 21. Ports-based Geo-grounding ‣ Started with non-optimised geo-resolution. ‣ Incorporated the list of ports. Locations are assigned with an is_port="1" or an is_port="0" attribute. ‣ Grounding now ignores non-port candidates in case of ambiguous location mentions. ‣ is_port locations are also given a higher weight in the scoring. ‣ Hypothesis: ports are more likely to be significant locations in historic documents about trade. ‣ Not tested yet as need gold standard data. Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 22. Ports-based Geo-grounding ‣ Example: Dalhousie is in the list of ports as: DALHOUSIE                -66.4   48.1 Geo-grounding in non-optimised resolver: <ent id="rb3" type="location" lat="32.5333300" long="75.9833300" in-country="IN" gazref="geonames:1273648" feat-type="ppl" pop-size="7601"> <parts> <part ew="w136" sw="w136">Dalhousie</part> </parts> </ent> Geo-grounding in ports-dependent resolver: <ent id="rb2" type="location" lat="48.0550200" long="-66.3847200" in- country="CA" gazref="geonames:6943599" feat-type="ppl" pop-size="0"> <parts> <part ew="w97" sw="w97">Dalhousie</part> </parts> </ent> Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 23. Ports-based Geo-grounding ‣ Geo-grounding assumes that each text is a coherent whole. All locations contribute to the resolution of all others. May have to change that. ‣ Segmentation (e.g. of books) into smaller units might improve the resolution. ‣ Need to consider old spellings of place names. Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 24. Relation Extraction ‣ Crude way to identify commodity-location relations: ‣ Sentences (s) containing words (w) with the commlex="true" and wn="true" and a location. Good: The quantity of raw cotton imported annually into the United Kingdom—take for example, the year 1854—amounted to, at least, 887,335,9041bs., of which the United States supplied 722,154,101 lbs. Of interest: Another kind of quinine-yieldmg bark has been discovered on the western side of the Cordillera, which produces more sulphate than the common cinchona; and as the cinchona grows on both sides of the Cordillera, it may be inferred that the new plant will be found also in the lands of Gualaquiza and Canelos. Bad: The first-class refreshment room, Central Station, Leeds, has a notice that only five-year old whisky is sold there. OR This paper was concealed in the handle of a spear, carried from Omdurman to Gedarif in that way. Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 25. Relation Extraction Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 26. Relation Extraction ‣ Need to improve the relation extraction. ‣ Will look at pattern-based relation extraction exploiting vocabulary like "import", "export", "ship", "shipment", "trade", “manufacture”, “grow” etc. ‣ Will annotate a small test corpus for evaluation. ‣ Need to distinguish between irrelevant or false commodity-location relations and commodity-location relations referring to trade. Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 27. Thank You ‣ Questions? Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 28. Example Input ‣ Different sources converted into common XML format Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 29. Example Output Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n