Exploring Challenges in Mining Historical Text

Exploring Challenges in Mining Historical Text
Beatrice Alex, Claire Grover, Richard Tobin and Ewan Klein

Working with text: Tools, techniques and approaches for text mining
Edinburgh - 07/07/2012

Overview
‣ Project
‣ Data
‣ Preprocessing historical text
‣ Improvements to OCR
‣ Language identification
‣ Text mining tables
‣ Text-mining
‣ Improved commodity identification
‣ Ports-based geo-grounding
‣ Relation extraction

Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012

Project (01/2012-12/2014)
‣ Funded by Digging into Data (round 2)
‣ Partners
Ewan Klein, Claire Grover, Bea Alex (text mining)

Colin Coates, Jim Clifford (historical analysis)

James Reid (data integration)

Aaron Quigley, Uta Hinrichs (information
visualisation)

Trading Consequences
‣ What does archival text say about the
economic and environmental consequences of
global commodity trading during the
nineteenth century?
‣ Help historians to discover novel patters and
explore new hypotheses.
‣ Example questions:
‣ What were the routes and volumes of international
trade in resource commodities 1850-1914?
‣ What were the local environmental consequences of
this demand for these resources?


Geolocating Cinchona


Trading Consequences
‣ Scope: global but with focus on Canadian
natural resource flows to test reliability and
efficacy of our methods
‣ Methods:
‣ Text mining and geo-parsing to transform the text
into structured data, e.g. relational database
‣ Query interface targeted at historians
‣ Information visualisation for interactive exploration


Historical Data
‣ Digitised sources from the 19th century
British Empire, currently processing
‣ Early Canadiana Online: 83,038 files
‣ JSTOR data: 1,000 XML files
‣ House of Commons Parliamentary Papers: 4,135
files
‣ Books: selected books on nineteenth century trade

‣ Further sources:
‣ ProQuest data
‣ Encyclopaedia Britannica, Jstor Plants, Forestry
Journals?, The Botanist?


Processing Historical Data
‣ Challenges so far:
‣ Different formats
‣ Low-quality OCRed text
‣ Old/low-quality prints, quality of OCR
technology
‣ Historical English: historical word variants,
ſ (long s) characters mixed up with f by OCR
‣ Artefacts in original documents: headers/footers,
page numbers, notes in margins, end-of-line
hyphenation
‣ Text in different languages
‣ Information in tables


Improvements to OCR
‣ Normalisation and post-correction
‣ Fixed end-of-line hyphenation
‣ Dehyphen all token-splitting hyphens using a
dictionary-based approach (dictionary is the system
dictionary + the text of the current document)
‣ Added f-to-s conversion
‣ Convert all false f characters to s using a corpus-
based a approach (corpus is a collection of historical
documents from the Gutenberg Project)
‣ Example: reduced number of words
unrecognised by spell checker from 61 to 21 -
> approx. 67% improvement

Improvements to OCR


Improvements to OCR
‣ Extensive evaluation of both tools against
human corrected/normalised gold standard
‣ Reduce word error rate by 12.5% in a random
Canadiana sample (word acc: 0.776 -> 0.804)
‣ Improvements have an effect on later text
mining steps and would also be beneficial for
searching text in any IR system (e.g. Jstor
database search for “French colonifts”)


Language Identification
‣ Most sources do not ISO Code
eng
Language
English
Frequency
2,677,498
contain language fra French 1,208,811
deu German 2,886
information like chn Chinook jargon 2,488
Canadiana does moh Mohawk 1,547
oji Ojibwa 1,395
‣ The table displays emg Eastern
Meohang
835
the number of text enb
cre
Markweeta
Cree
666
501
elements in iro Iroquoian 324
alg Algonquian 210
Canadiana per nge Ngemba 157
language ignoring nld
lat
Dutch
Latin
131
119
notes and titles mic Micmac 61
gla Scottish Gaelic 22


Language Identification
‣ Make use of automatic language
identification using TextCat, especially for the
JSTOR data which is also multi-lingual.
‣ LID is done for each paragraph and for the
entire document by taking the most frequent
language tag assigned.
‣ Can limit processing to English (and French)
documents only.
‣ 740 English documents (out of 1,000)


Text Mining Tables


Text Mining Tables
‣ Tables contain a lot of relevant information
but are difficult to mine.
‣ HCPP documents contain coordinates for
each table entry.
<w p="961,1777,1026,1807" v="d">Rio</w>
<w p="1026,1777,1170,1807" v="d">Janeiro</w>
...
<w p="961,1892,1087,1921" v="n">Culcutta</w>
<w p="1496,1530,1565,1555" v="o">141</w>
<w p="1565,1525,1631,1555" v="d">bags</w>
<w p="1227,1774,1336,1804" v="d">Wood</w>
<w p="1353,1791,1366,1799" v="o">-</w>
<w p="1494,1776,1565,1804" v="o">338</w>
<w p="1565,1783,1676,1803" v="d">planks</w>
<w p="1704,1791,1718,1799" v="o">-</w>

‣ Planning to do a feasibility study for a table
mining algorithm.

Text Mining Pipeline
‣ Steps after that OCR improvements and LID:
‣ Tokenisation
‣ Part-of-speech tagging
‣ Lemmatisation
‣ Wordnet lookup to find commodities
‣ Named-entity recognition including commodity
lexicon lookup
‣ Port-based Geo-grounding
‣ Chunking


Commodities Identification
‣ WordNet lookup using an approximation of
commodity named entities:
‣ Noun phrases with hypernyms such as substance,
physical matter, plant or animal in WordNet.
‣ Each NP which leads to a match is assigned a
wn=”true” attribute.
‣ Commodities gazetteer lookup using a list of
commodities derived by historians.
‣ Strings matching the entries in the gazetteer are
assigned a commlex=”true” attribute.
‣ Words/phrases with wn=”true” and
commlex=”true” are good candidates.

Ports-based Geo-grounding
‣ Started with non-optimised geo-resolution.
‣ Incorporated the list of ports.
Locations are assigned with an is_port="1" or an
is_port="0" attribute.
‣ Grounding now ignores non-port candidates in case
of ambiguous location mentions.
‣ is_port locations are also given a higher weight in
the scoring.
‣ Hypothesis: ports are more likely to be
significant locations in historic documents
about trade.
‣ Not tested yet as need gold standard data.

‣ Example:
Dalhousie is in the list of ports as:
DALHOUSIE -66.4 48.1

Geo-grounding in non-optimised resolver:
<ent id="rb3" type="location" lat="32.5333300" long="75.9833300" in-country="IN"
gazref="geonames:1273648" feat-type="ppl" pop-size="7601">
<parts>
<part ew="w136" sw="w136">Dalhousie</part>
</parts>
</ent>

Geo-grounding in ports-dependent resolver:
<ent id="rb2" type="location" lat="48.0550200" long="-66.3847200" in-
country="CA" gazref="geonames:6943599" feat-type="ppl" pop-size="0">
<parts>
<part ew="w97" sw="w97">Dalhousie</part>
</parts>
</ent>

‣ Geo-grounding assumes that each text is a
coherent whole. All locations contribute to
the resolution of all others. May have to
change that.
‣ Segmentation (e.g. of books) into smaller
units might improve the resolution.
‣ Need to consider old spellings of place
names.


Relation Extraction
‣ Crude way to identify commodity-location
relations:
‣ Sentences (s) containing words (w) with the
commlex="true" and wn="true" and a location.

Good: The quantity of raw cotton imported annually into the United Kingdom—take for
example, the year 1854—amounted to, at least, 887,335,9041bs., of which the United States
supplied 722,154,101 lbs.

Of interest: Another kind of quinine-yieldmg bark has been discovered on the western side of the
Cordillera, which produces more sulphate than the common cinchona; and as the cinchona
grows on both sides of the Cordillera, it may be inferred that the new plant will be found also in
the lands of Gualaquiza and Canelos.

Bad: The first-class refreshment room, Central Station, Leeds, has a notice that only five-year old
whisky is sold there. OR
This paper was concealed in the handle of a spear, carried from Omdurman to Gedarif in that
way.


Relation Extraction


Relation Extraction
‣ Need to improve the relation extraction.
‣ Will look at pattern-based relation extraction
exploiting vocabulary like "import", "export",
"ship", "shipment", "trade", “manufacture”,
“grow” etc.
‣ Will annotate a small test corpus for
evaluation.
‣ Need to distinguish between irrelevant or
false commodity-location relations and
commodity-location relations referring to
trade.

Thank You
‣ Questions?


Example Input
‣ Different sources converted into common
XML format


Example Output


Exploring Challenges in Mining Historical Text

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to Exploring Challenges in Mining Historical Text

Similar to Exploring Challenges in Mining Historical Text (20)

Recently uploaded

Recently uploaded (20)

Exploring Challenges in Mining Historical Text

Editor's Notes