Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Edin pelagios
1. Institute for Language,
Cognition and Computation
The Edinburgh Geoparser
and Chalice
Claire Grover
Kate Byrne, Richard Tobin, Jo Walsh
www.inf.ed.ac.uk
2. Institute for Language,
Cognition and Computation
Overview of the Edinburgh Geoparser
• System to automatically recognise place names in text and
disambiguate them with respect to a gazetteer. (Athens, Springfield)
• Patchy development over past few years funded by a variety of
projects applied to a range of data sets:
– GeoCrossWalk
– BOPCRIS
– GeoDigRef (Histpop, BOPCRIS, BL)
– Embedding GeoCrossWalk (Stormont Papers)
– SYNC3 (online news)
– Chalice (EPNS)
– Unlock
• Main concern has been to keep it generally usable while applying it to
specific data sets.
3. Institute for Language,
Cognition and Computation
Overview of the Edinburgh Geoparser
Geotagging
.txt
.html
Format
Tokenisation
POS
Lemmatis-
Named
Entity
.geotagged.xml
.xml
conversion
tagging
ation
Recognition
.geotagged.xml
Gazetteer
lookup
Resolution
.gaz.xml
Georesolution
8. Institute for Language,
Cognition and Computation
Evaluation (2009)
SpatialML (gold geotagging) GeoNames Unlock
No. of place names 3628 3628
No. for which gaz entries found 3538 3049
Correct within 5km 2946 2143
As % of total 81.2% 59.0%
SpatialML (end-to-end) GeoNames
No. of place names 3628
No. for which gaz entries found 2923
Correct within 5km 2504
As % of total 69.0%
9. Institute for Language,
Cognition and Computation
Current Development Issues
• Open source release
• Increased configurability
– Input formats: plain text, HTML, simple XML, ...
– User’s own text analysis: paragraphs, sentences, word tokens,
place name mark-up
– Output formats: map visualisation, text mark-up, …
– User input: constrain by area, bounding box, …
• Choice of gazetteer: GeoNames, Unlock, geonames-local, Pleiades+,
Chalice historical gazetteer, ...
• Performance monitoring/evaluation against test sets
10. Institute for Language,
Cognition and Computation
GAP project: Pleiades+
• Based on Pleiades set of ancient place names but extended in two ways:
• by matching Pleiades place names against GeoNames place names in the
same location and adding the GeoNames alternative names to the Pleiades+
list:
– adds three alternative names for the single Pleiades entry for
Autricum (Chartrez, Chartres, Shartr), because Autricum” is present
in both Pleiades and GeoNames, with the same approximate location
• at run-time, looking up place names found in the text against GeoNames (as
well as against Pleiades+) and the using the alternative names from GeoNames
to match against the Pleiades+ list
– Pleiades has no entry for Egypt”. We look up the name in GeoNames and
use its alternative names (which include Aegyptus) to match back against
Pleiades (which does include Aegyptus). (We don't want to simply take
places directly from GeoNames because, when we tried it, we were
swamped with irrelevant modern places having names corresponding to
ancient toponyms.)
11. Institute for Language,
Cognition and Computation
Chalice
• Connecting Historical Authorities with Linked Data, Contexts, and Entities.
• Funded under the JISC jiscEXPO programme on exposing digital content
for education and research.
• The project is exploring the viability of creating a historical gazetteer from
digitized volumes from the English Place-Name Society (EPNS).
• Partners:
– CDDA, Queen’s University, Belfast
– School of Informatics, Edinburgh
– EDINA, Edinburgh
– CeRch, Kings College London
• Informatics role is to adapt our existing text mining/geoparsing technology
to convert the textual documents that are output from OCR into structured
data.
12. Institute for Language,
Cognition and Computation
Chalice data
• Cheshire
– Cheshire Part I. EPNS Volume 44, 1970
– Cheshire Part II. EPNS Volume 45, 1970
– Cheshire Part III. EPNS Volume 46, 1971
– Cheshire Part IV. EPNS Volume 47, 1972
– Cheshire Part V (1 :i). EPNS Volume 48, 1981
– Cheshire Part V (1 :ii). EPNS Volume 54, 1981
• Small samples from:
– Berkshire, Buckinghamshire (Vol. 2), Cambridgeshire (Vol 19),
Derbyshire (Vols 27-29), Hertfordshire (Vol. 15)
• Shropshire: Pimhill Hundred (born digital)
13. Institute for Language,
Cognition and Computation
EPNS
• Parishes are usually organised in terms of the hundreds in which they belong.
• Towns and villages are usually referred to as townships and are organised in
terms of the parish in which they belong.
• Township descriptions often contain relatively unstructured information about
smaller associated places such as buildings, bridges, lanes, woods and
farms.
• Township descriptions also frequently contain separately marked sections of
information about field names and street names.
• Information about river and major road names are described separately from
the inhabited place descriptions.
• Place names are the primary object of interest and descriptions of them
contain information about alternative names and spellings that have been
attested in historical sources and the etymology of names or name parts.
• In Chalice we focus on capturing parishes, townships, sub-townships,
attestation. We don’t deal with hundreds, field names, street names, rivers,
roads etc.
22. Institute for Language,
Cognition and Computation
Issues
• OCR quality needs to be high: not just recognising characters correctly but
getting font and layout information right. Failure to recognise bold and small
caps fonts or the difference between a line break and a paragraph break can
lead to major errors in the recognition process.
• EPNS volumes vary in the use of layout and font to indicate structure (e.g.
Cheshire parishes are signaled by centering combined with numbering with
roman numerals while Hertfordshire ones are unnumbered but centered and in
bold font.) In some volumes potentially useful information is contained in
footnotes.
• Different volumes reflect different decisions about where place name information
should be put. In most cases the information about the parish name occurs next
to the town in the parish that has the same name. In the Shropshire text some
place name information occurs in an earlier volume and is not subsequently
repeated, e.g. the description of the parish of Baschurch, containing a township
of the same name, has no attestation or etymological information provided
because the name was discussed in Part 1.