Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Exploring Challenges in Mining Historical Text
1. Exploring Challenges in Mining Historical Text
Beatrice Alex, Claire Grover, Richard Tobin and Ewan Klein
Working with text: Tools, techniques and approaches for text mining
Edinburgh - 07/07/2012
2. Overview
‣ Project
‣ Data
‣ Preprocessing historical text
‣ Improvements to OCR
‣ Language identification
‣ Text mining tables
‣ Text-mining
‣ Improved commodity identification
‣ Ports-based geo-grounding
‣ Relation extraction
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
3. Project (01/2012-12/2014)
‣ Funded by Digging into Data (round 2)
‣ Partners
Ewan Klein, Claire Grover, Bea Alex (text mining)
Colin Coates, Jim Clifford (historical analysis)
James Reid (data integration)
Aaron Quigley, Uta Hinrichs (information
visualisation)
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
4. Trading Consequences
‣ What does archival text say about the
economic and environmental consequences of
global commodity trading during the
nineteenth century?
‣ Help historians to discover novel patters and
explore new hypotheses.
‣ Example questions:
‣ What were the routes and volumes of international
trade in resource commodities 1850-1914?
‣ What were the local environmental consequences of
this demand for these resources?
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
5. Geolocating Cinchona
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
6. Trading Consequences
‣ Scope: global but with focus on Canadian
natural resource flows to test reliability and
efficacy of our methods
‣ Methods:
‣ Text mining and geo-parsing to transform the text
into structured data, e.g. relational database
‣ Query interface targeted at historians
‣ Information visualisation for interactive exploration
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
7. Historical Data
‣ Digitised sources from the 19th century
British Empire, currently processing
‣ Early Canadiana Online: 83,038 files
‣ JSTOR data: 1,000 XML files
‣ House of Commons Parliamentary Papers: 4,135
files
‣ Books: selected books on nineteenth century trade
‣ Further sources:
‣ ProQuest data
‣ Encyclopaedia Britannica, Jstor Plants, Forestry
Journals?, The Botanist?
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
8. Processing Historical Data
‣ Challenges so far:
‣ Different formats
‣ Low-quality OCRed text
‣ Old/low-quality prints, quality of OCR
technology
‣ Historical English: historical word variants,
ſ (long s) characters mixed up with f by OCR
‣ Artefacts in original documents: headers/footers,
page numbers, notes in margins, end-of-line
hyphenation
‣ Text in different languages
‣ Information in tables
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
9. Processing Historical Data
‣ Challenges so far:
‣ Different formats
‣ Low-quality OCRed text
‣ Old/low-quality prints, quality of OCR
technology
‣ Historical English: historical word variants,
ſ (long s) characters mixed up with f by OCR
‣ Artefacts in original documents: headers/footers,
page numbers, notes in margins, end-of-line
hyphenation
‣ Text in different languages
‣ Information in tables
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
10. Improvements to OCR
‣ Normalisation and post-correction
‣ Fixed end-of-line hyphenation
‣ Dehyphen all token-splitting hyphens using a
dictionary-based approach (dictionary is the system
dictionary + the text of the current document)
‣ Added f-to-s conversion
‣ Convert all false f characters to s using a corpus-
based a approach (corpus is a collection of historical
documents from the Gutenberg Project)
‣ Example: reduced number of words
unrecognised by spell checker from 61 to 21 -
> approx. 67% improvement
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
11. Improvements to OCR
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
12. Improvements to OCR
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
13. Improvements to OCR
‣ Extensive evaluation of both tools against
human corrected/normalised gold standard
‣ Reduce word error rate by 12.5% in a random
Canadiana sample (word acc: 0.776 -> 0.804)
‣ Improvements have an effect on later text
mining steps and would also be beneficial for
searching text in any IR system (e.g. Jstor
database search for “French colonifts”)
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
14. Language Identification
‣ Most sources do not ISO Code
eng
Language
English
Frequency
2,677,498
contain language fra French 1,208,811
deu German 2,886
information like chn Chinook jargon 2,488
Canadiana does moh Mohawk 1,547
oji Ojibwa 1,395
‣ The table displays emg Eastern
Meohang
835
the number of text enb
cre
Markweeta
Cree
666
501
elements in iro Iroquoian 324
alg Algonquian 210
Canadiana per nge Ngemba 157
language ignoring nld
lat
Dutch
Latin
131
119
notes and titles mic Micmac 61
gla Scottish Gaelic 22
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
15. Language Identification
‣ Make use of automatic language
identification using TextCat, especially for the
JSTOR data which is also multi-lingual.
‣ LID is done for each paragraph and for the
entire document by taking the most frequent
language tag assigned.
‣ Can limit processing to English (and French)
documents only.
‣ 740 English documents (out of 1,000)
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
16. Text Mining Tables
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
17. Text Mining Tables
‣ Tables contain a lot of relevant information
but are difficult to mine.
‣ HCPP documents contain coordinates for
each table entry.
<w p="961,1777,1026,1807" v="d">Rio</w>
<w p="1026,1777,1170,1807" v="d">Janeiro</w>
...
<w p="961,1892,1087,1921" v="n">Culcutta</w>
<w p="1496,1530,1565,1555" v="o">141</w>
<w p="1565,1525,1631,1555" v="d">bags</w>
<w p="1227,1774,1336,1804" v="d">Wood</w>
<w p="1353,1791,1366,1799" v="o">-</w>
<w p="1494,1776,1565,1804" v="o">338</w>
<w p="1565,1783,1676,1803" v="d">planks</w>
<w p="1704,1791,1718,1799" v="o">-</w>
‣ Planning to do a feasibility study for a table
mining algorithm.
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
18. Text Mining Pipeline
‣ Steps after that OCR improvements and LID:
‣ Tokenisation
‣ Part-of-speech tagging
‣ Lemmatisation
‣ Wordnet lookup to find commodities
‣ Named-entity recognition including commodity
lexicon lookup
‣ Port-based Geo-grounding
‣ Chunking
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
19. Text Mining Pipeline
‣ Steps after that OCR improvements and LID:
‣ Tokenisation
‣ Part-of-speech tagging
‣ Lemmatisation
‣ Wordnet lookup to find commodities
‣ Named-entity recognition including commodity
lexicon lookup
‣ Port-based Geo-grounding
‣ Chunking
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
20. Commodities Identification
‣ WordNet lookup using an approximation of
commodity named entities:
‣ Noun phrases with hypernyms such as substance,
physical matter, plant or animal in WordNet.
‣ Each NP which leads to a match is assigned a
wn=”true” attribute.
‣ Commodities gazetteer lookup using a list of
commodities derived by historians.
‣ Strings matching the entries in the gazetteer are
assigned a commlex=”true” attribute.
‣ Words/phrases with wn=”true” and
commlex=”true” are good candidates.
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
21. Ports-based Geo-grounding
‣ Started with non-optimised geo-resolution.
‣ Incorporated the list of ports.
Locations are assigned with an is_port="1" or an
is_port="0" attribute.
‣ Grounding now ignores non-port candidates in case
of ambiguous location mentions.
‣ is_port locations are also given a higher weight in
the scoring.
‣ Hypothesis: ports are more likely to be
significant locations in historic documents
about trade.
‣ Not tested yet as need gold standard data.
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
22. Ports-based Geo-grounding
‣ Example:
Dalhousie is in the list of ports as:
DALHOUSIE -66.4 48.1
Geo-grounding in non-optimised resolver:
<ent id="rb3" type="location" lat="32.5333300" long="75.9833300" in-country="IN"
gazref="geonames:1273648" feat-type="ppl" pop-size="7601">
<parts>
<part ew="w136" sw="w136">Dalhousie</part>
</parts>
</ent>
Geo-grounding in ports-dependent resolver:
<ent id="rb2" type="location" lat="48.0550200" long="-66.3847200" in-
country="CA" gazref="geonames:6943599" feat-type="ppl" pop-size="0">
<parts>
<part ew="w97" sw="w97">Dalhousie</part>
</parts>
</ent>
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
23. Ports-based Geo-grounding
‣ Geo-grounding assumes that each text is a
coherent whole. All locations contribute to
the resolution of all others. May have to
change that.
‣ Segmentation (e.g. of books) into smaller
units might improve the resolution.
‣ Need to consider old spellings of place
names.
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
24. Relation Extraction
‣ Crude way to identify commodity-location
relations:
‣ Sentences (s) containing words (w) with the
commlex="true" and wn="true" and a location.
Good: The quantity of raw cotton imported annually into the United Kingdom—take for
example, the year 1854—amounted to, at least, 887,335,9041bs., of which the United States
supplied 722,154,101 lbs.
Of interest: Another kind of quinine-yieldmg bark has been discovered on the western side of the
Cordillera, which produces more sulphate than the common cinchona; and as the cinchona
grows on both sides of the Cordillera, it may be inferred that the new plant will be found also in
the lands of Gualaquiza and Canelos.
Bad: The first-class refreshment room, Central Station, Leeds, has a notice that only five-year old
whisky is sold there. OR
This paper was concealed in the handle of a spear, carried from Omdurman to Gedarif in that
way.
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
25. Relation Extraction
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
26. Relation Extraction
‣ Need to improve the relation extraction.
‣ Will look at pattern-based relation extraction
exploiting vocabulary like "import", "export",
"ship", "shipment", "trade", “manufacture”,
“grow” etc.
‣ Will annotate a small test corpus for
evaluation.
‣ Need to distinguish between irrelevant or
false commodity-location relations and
commodity-location relations referring to
trade.
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
27. Thank You
‣ Questions?
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
28. Example Input
‣ Different sources converted into common
XML format
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012