SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
Institute for Language, 
                             Cognition and Computation	





 The Edinburgh Geoparser 
       and Chalice	


           Claire Grover
Kate Byrne, Richard Tobin, Jo Walsh




                                      www.inf.ed.ac.uk
Institute for Language,
                                                             Cognition and Computation	





Overview of the Edinburgh Geoparser
                                  	

•  System to automatically recognise place names in text and
   disambiguate them with respect to a gazetteer. (Athens, Springfield)
•  Patchy development over past few years funded by a variety of
   projects applied to a range of data sets:
   –  GeoCrossWalk
   –  BOPCRIS
   –  GeoDigRef (Histpop, BOPCRIS, BL)
   –  Embedding GeoCrossWalk (Stormont Papers)
   –  SYNC3 (online news)
   –  Chalice (EPNS)
   –  Unlock
•  Main concern has been to keep it generally usable while applying it to
   specific data sets.
Institute for Language,
                                                                                                                      Cognition and Computation	





Overview of the Edinburgh Geoparser	


                                            Geotagging	


    .txt	

   .html	

                     Format 	

                                      Tokenisation	

                                                                POS	

           Lemmatis-	

                                                                                                     Named	

                                                                                                     Entity	

      .geotagged.xml   	

   .xml  	

       conversion	

                              tagging	

           ation	

                                                                                                   Recognition	





               .geotagged.xml   	

        Gazetteer	

                                            lookup      	

        Resolution   	

             .gaz.xml	



                                          Georesolution
Institute for Language,
Cognition and Computation
Institute for Language,
Cognition and Computation
Institute for Language,
Cognition and Computation
Institute for Language,
Cognition and Computation
Institute for Language,
                                                           Cognition and Computation	





                      Evaluation (2009)
                                      	

SpatialML (gold geotagging)        GeoNames       Unlock
No. of place names                 3628           3628
No. for which gaz entries found    3538           3049
Correct within 5km                 2946           2143
As % of total                      81.2%          59.0%


      SpatialML (end-to-end)              GeoNames
      No. of place names                  3628
      No. for which gaz entries found     2923
      Correct within 5km                  2504
      As % of total                       69.0%
Institute for Language,
                                                            Cognition and Computation	





           Current Development Issues
                                    	

•  Open source release
•  Increased configurability
    –  Input formats: plain text, HTML, simple XML, ...
    –  User’s own text analysis: paragraphs, sentences, word tokens,
       place name mark-up
    –  Output formats: map visualisation, text mark-up, …
    –  User input: constrain by area, bounding box, …
•  Choice of gazetteer: GeoNames, Unlock, geonames-local, Pleiades+,
   Chalice historical gazetteer, ...
•  Performance monitoring/evaluation against test sets
Institute for Language,
                                                                  Cognition and Computation	




                  GAP project: Pleiades+	

•  Based on Pleiades set of ancient place names but extended in two ways:
•  by matching Pleiades place names against GeoNames place names in the
   same location and adding the GeoNames alternative names to the Pleiades+
   list:
   –  adds three alternative names for the single Pleiades entry for
      Autricum (Chartrez, Chartres, Shartr), because Autricum” is present
      in both Pleiades and GeoNames, with the same approximate location
•  at run-time, looking up place names found in the text against GeoNames (as
   well as against Pleiades+) and the using the alternative names from GeoNames
   to match against the Pleiades+ list
   –  Pleiades has no entry for Egypt”. We look up the name in GeoNames and
      use its alternative names (which include Aegyptus) to match back against
      Pleiades (which does include Aegyptus). (We don't want to simply take
      places directly from GeoNames because, when we tried it, we were
      swamped with irrelevant modern places having names corresponding to
      ancient toponyms.)
Institute for Language,
                                                                  Cognition and Computation	




                                Chalice
                                      	

•  Connecting Historical Authorities with Linked Data, Contexts, and Entities.
•  Funded under the JISC jiscEXPO programme on exposing digital content
   for education and research.
•  The project is exploring the viability of creating a historical gazetteer from
   digitized volumes from the English Place-Name Society (EPNS).
•  Partners:
    –  CDDA, Queen’s University, Belfast
    –  School of Informatics, Edinburgh
    –  EDINA, Edinburgh
    –  CeRch, Kings College London
•  Informatics role is to adapt our existing text mining/geoparsing technology
   to convert the textual documents that are output from OCR into structured
   data.
Institute for Language,
                                                           Cognition and Computation	





                         Chalice data
                                    	

•  Cheshire
   –  Cheshire Part I. EPNS Volume 44, 1970
   –  Cheshire Part II. EPNS Volume 45, 1970
   –  Cheshire Part III. EPNS Volume 46, 1971
   –  Cheshire Part IV. EPNS Volume 47, 1972
   –  Cheshire Part V (1 :i). EPNS Volume 48, 1981
   –  Cheshire Part V (1 :ii). EPNS Volume 54, 1981
•  Small samples from:
   –  Berkshire, Buckinghamshire (Vol. 2), Cambridgeshire (Vol 19),
      Derbyshire (Vols 27-29), Hertfordshire (Vol. 15)
•  Shropshire: Pimhill Hundred (born digital)
Institute for Language,
                                                                Cognition and Computation	




                                 EPNS	

•  Parishes are usually organised in terms of the hundreds in which they belong.
•  Towns and villages are usually referred to as townships and are organised in
   terms of the parish in which they belong.
•  Township descriptions often contain relatively unstructured information about
   smaller associated places such as buildings, bridges, lanes, woods and
   farms.
•  Township descriptions also frequently contain separately marked sections of
   information about field names and street names.
•  Information about river and major road names are described separately from
   the inhabited place descriptions.
•  Place names are the primary object of interest and descriptions of them
   contain information about alternative names and spellings that have been
   attested in historical sources and the etymology of names or name parts.
•  In Chalice we focus on capturing parishes, townships, sub-townships,
   attestation. We don’t deal with hundreds, field names, street names, rivers,
   roads etc.
Institute for Language,
Cognition and Computation
Institute for Language,
                      Cognition and Computation	





The start of the
entry for the
township of
Willaston in the
parish of Neston in
Wirral Hundred.
Institute for Language,
Cognition and Computation
Institute for Language,
Cognition and Computation
Institute for Language,
Cognition and Computation
Institute for Language,
Cognition and Computation
Institute for Language,
Cognition and Computation
Institute for Language,
Cognition and Computation
Institute for Language,
                                                                    Cognition and Computation	




                                    Issues
                                         	

•  OCR quality needs to be high: not just recognising characters correctly but
   getting font and layout information right. Failure to recognise bold and small
   caps fonts or the difference between a line break and a paragraph break can
   lead to major errors in the recognition process.
•  EPNS volumes vary in the use of layout and font to indicate structure (e.g.
   Cheshire parishes are signaled by centering combined with numbering with
   roman numerals while Hertfordshire ones are unnumbered but centered and in
   bold font.) In some volumes potentially useful information is contained in
   footnotes.
•  Different volumes reflect different decisions about where place name information
   should be put. In most cases the information about the parish name occurs next
   to the town in the parish that has the same name. In the Shropshire text some
   place name information occurs in an earlier volume and is not subsequently
   repeated, e.g. the description of the parish of Baschurch, containing a township
   of the same name, has no attestation or etymological information provided
   because the name was discussed in Part 1.
Institute for Language,
Cognition and Computation
Institute for Language,
Cognition and Computation
Institute for Language,
Cognition and Computation
Institute for Language,
Cognition and Computation
Institute for Language,
Cognition and Computation
Institute for Language,
               Cognition and Computation	





Thank you!

Weitere ähnliche Inhalte

Ähnlich wie Edin pelagios

LoCloud Historic Place Names Service, Rimvydas Laužikas, Justinas Jaronis and...
LoCloud Historic Place Names Service, Rimvydas Laužikas, Justinas Jaronis and...LoCloud Historic Place Names Service, Rimvydas Laužikas, Justinas Jaronis and...
LoCloud Historic Place Names Service, Rimvydas Laužikas, Justinas Jaronis and...locloud
 
Chalice / Edinburgh Geoparser at CA2011
Chalice / Edinburgh Geoparser at CA2011Chalice / Edinburgh Geoparser at CA2011
Chalice / Edinburgh Geoparser at CA2011Jo Walsh
 
UCT GIS Labs
UCT GIS LabsUCT GIS Labs
UCT GIS Labspvhead123
 
Real-time Visualisation of Cultural Heritage and Environmental Archaeology Da...
Real-time Visualisation of Cultural Heritage and Environmental Archaeology Da...Real-time Visualisation of Cultural Heritage and Environmental Archaeology Da...
Real-time Visualisation of Cultural Heritage and Environmental Archaeology Da...Marcus Smith
 
Drupal mapping
Drupal mappingDrupal mapping
Drupal mappingLev Tsypin
 
Archaeology, Informatics and Knowledge Representation
Archaeology, Informatics and Knowledge RepresentationArchaeology, Informatics and Knowledge Representation
Archaeology, Informatics and Knowledge RepresentationDART Project
 
Geo tagging & spatial indexing of text-specified data
Geo tagging & spatial indexing of text-specified dataGeo tagging & spatial indexing of text-specified data
Geo tagging & spatial indexing of text-specified dataShiv Shakti Ghosh
 
Efficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesEfficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesYen-Yu Chen
 
The Expert Library: Emergent needs in academic and special libraries
The Expert Library: Emergent needs in academic and special librariesThe Expert Library: Emergent needs in academic and special libraries
The Expert Library: Emergent needs in academic and special librariesLAICDG
 
Dmdh winter 2015 session #2
Dmdh winter 2015 session #2Dmdh winter 2015 session #2
Dmdh winter 2015 session #2sarahkh12
 
CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...
CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...
CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...Keith.May
 
Demystifying Digital Humanities: Winter 2014 Workshop #2: Programming on the ...
Demystifying Digital Humanities: Winter 2014 Workshop #2: Programming on the ...Demystifying Digital Humanities: Winter 2014 Workshop #2: Programming on the ...
Demystifying Digital Humanities: Winter 2014 Workshop #2: Programming on the ...Paige Morgan
 
SQLBits X SQL Server 2012 Spatial
SQLBits X SQL Server 2012 SpatialSQLBits X SQL Server 2012 Spatial
SQLBits X SQL Server 2012 SpatialMichael Rys
 

Ähnlich wie Edin pelagios (16)

LoCloud Historic Place Names Service, Rimvydas Laužikas, Justinas Jaronis and...
LoCloud Historic Place Names Service, Rimvydas Laužikas, Justinas Jaronis and...LoCloud Historic Place Names Service, Rimvydas Laužikas, Justinas Jaronis and...
LoCloud Historic Place Names Service, Rimvydas Laužikas, Justinas Jaronis and...
 
Chalice / Edinburgh Geoparser at CA2011
Chalice / Edinburgh Geoparser at CA2011Chalice / Edinburgh Geoparser at CA2011
Chalice / Edinburgh Geoparser at CA2011
 
Ai for cultural history
Ai for cultural historyAi for cultural history
Ai for cultural history
 
UCT GIS Labs
UCT GIS LabsUCT GIS Labs
UCT GIS Labs
 
Real-time Visualisation of Cultural Heritage and Environmental Archaeology Da...
Real-time Visualisation of Cultural Heritage and Environmental Archaeology Da...Real-time Visualisation of Cultural Heritage and Environmental Archaeology Da...
Real-time Visualisation of Cultural Heritage and Environmental Archaeology Da...
 
Drupal mapping
Drupal mappingDrupal mapping
Drupal mapping
 
Archaeology, Informatics and Knowledge Representation
Archaeology, Informatics and Knowledge RepresentationArchaeology, Informatics and Knowledge Representation
Archaeology, Informatics and Knowledge Representation
 
Geo tagging & spatial indexing of text-specified data
Geo tagging & spatial indexing of text-specified dataGeo tagging & spatial indexing of text-specified data
Geo tagging & spatial indexing of text-specified data
 
Efficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesEfficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search Engines
 
The Expert Library: Emergent needs in academic and special libraries
The Expert Library: Emergent needs in academic and special librariesThe Expert Library: Emergent needs in academic and special libraries
The Expert Library: Emergent needs in academic and special libraries
 
Dmdh winter 2015 session #2
Dmdh winter 2015 session #2Dmdh winter 2015 session #2
Dmdh winter 2015 session #2
 
CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...
CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...
CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Demystifying Digital Humanities: Winter 2014 Workshop #2: Programming on the ...
Demystifying Digital Humanities: Winter 2014 Workshop #2: Programming on the ...Demystifying Digital Humanities: Winter 2014 Workshop #2: Programming on the ...
Demystifying Digital Humanities: Winter 2014 Workshop #2: Programming on the ...
 
GIS ANALYTICS-2011
GIS ANALYTICS-2011GIS ANALYTICS-2011
GIS ANALYTICS-2011
 
SQLBits X SQL Server 2012 Spatial
SQLBits X SQL Server 2012 SpatialSQLBits X SQL Server 2012 Spatial
SQLBits X SQL Server 2012 Spatial
 

Kürzlich hochgeladen

Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.MateoGardella
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...KokoStevan
 

Kürzlich hochgeladen (20)

Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 

Edin pelagios

  • 1. Institute for Language, Cognition and Computation The Edinburgh Geoparser and Chalice Claire Grover Kate Byrne, Richard Tobin, Jo Walsh www.inf.ed.ac.uk
  • 2. Institute for Language, Cognition and Computation Overview of the Edinburgh Geoparser •  System to automatically recognise place names in text and disambiguate them with respect to a gazetteer. (Athens, Springfield) •  Patchy development over past few years funded by a variety of projects applied to a range of data sets: –  GeoCrossWalk –  BOPCRIS –  GeoDigRef (Histpop, BOPCRIS, BL) –  Embedding GeoCrossWalk (Stormont Papers) –  SYNC3 (online news) –  Chalice (EPNS) –  Unlock •  Main concern has been to keep it generally usable while applying it to specific data sets.
  • 3. Institute for Language, Cognition and Computation Overview of the Edinburgh Geoparser Geotagging .txt .html Format Tokenisation POS Lemmatis- Named Entity .geotagged.xml .xml conversion tagging ation Recognition .geotagged.xml Gazetteer lookup Resolution .gaz.xml Georesolution
  • 8. Institute for Language, Cognition and Computation Evaluation (2009) SpatialML (gold geotagging) GeoNames Unlock No. of place names 3628 3628 No. for which gaz entries found 3538 3049 Correct within 5km 2946 2143 As % of total 81.2% 59.0% SpatialML (end-to-end) GeoNames No. of place names 3628 No. for which gaz entries found 2923 Correct within 5km 2504 As % of total 69.0%
  • 9. Institute for Language, Cognition and Computation Current Development Issues •  Open source release •  Increased configurability –  Input formats: plain text, HTML, simple XML, ... –  User’s own text analysis: paragraphs, sentences, word tokens, place name mark-up –  Output formats: map visualisation, text mark-up, … –  User input: constrain by area, bounding box, … •  Choice of gazetteer: GeoNames, Unlock, geonames-local, Pleiades+, Chalice historical gazetteer, ... •  Performance monitoring/evaluation against test sets
  • 10. Institute for Language, Cognition and Computation GAP project: Pleiades+ •  Based on Pleiades set of ancient place names but extended in two ways: •  by matching Pleiades place names against GeoNames place names in the same location and adding the GeoNames alternative names to the Pleiades+ list: –  adds three alternative names for the single Pleiades entry for Autricum (Chartrez, Chartres, Shartr), because Autricum” is present in both Pleiades and GeoNames, with the same approximate location •  at run-time, looking up place names found in the text against GeoNames (as well as against Pleiades+) and the using the alternative names from GeoNames to match against the Pleiades+ list –  Pleiades has no entry for Egypt”. We look up the name in GeoNames and use its alternative names (which include Aegyptus) to match back against Pleiades (which does include Aegyptus). (We don't want to simply take places directly from GeoNames because, when we tried it, we were swamped with irrelevant modern places having names corresponding to ancient toponyms.)
  • 11. Institute for Language, Cognition and Computation Chalice •  Connecting Historical Authorities with Linked Data, Contexts, and Entities. •  Funded under the JISC jiscEXPO programme on exposing digital content for education and research. •  The project is exploring the viability of creating a historical gazetteer from digitized volumes from the English Place-Name Society (EPNS). •  Partners: –  CDDA, Queen’s University, Belfast –  School of Informatics, Edinburgh –  EDINA, Edinburgh –  CeRch, Kings College London •  Informatics role is to adapt our existing text mining/geoparsing technology to convert the textual documents that are output from OCR into structured data.
  • 12. Institute for Language, Cognition and Computation Chalice data •  Cheshire –  Cheshire Part I. EPNS Volume 44, 1970 –  Cheshire Part II. EPNS Volume 45, 1970 –  Cheshire Part III. EPNS Volume 46, 1971 –  Cheshire Part IV. EPNS Volume 47, 1972 –  Cheshire Part V (1 :i). EPNS Volume 48, 1981 –  Cheshire Part V (1 :ii). EPNS Volume 54, 1981 •  Small samples from: –  Berkshire, Buckinghamshire (Vol. 2), Cambridgeshire (Vol 19), Derbyshire (Vols 27-29), Hertfordshire (Vol. 15) •  Shropshire: Pimhill Hundred (born digital)
  • 13. Institute for Language, Cognition and Computation EPNS •  Parishes are usually organised in terms of the hundreds in which they belong. •  Towns and villages are usually referred to as townships and are organised in terms of the parish in which they belong. •  Township descriptions often contain relatively unstructured information about smaller associated places such as buildings, bridges, lanes, woods and farms. •  Township descriptions also frequently contain separately marked sections of information about field names and street names. •  Information about river and major road names are described separately from the inhabited place descriptions. •  Place names are the primary object of interest and descriptions of them contain information about alternative names and spellings that have been attested in historical sources and the etymology of names or name parts. •  In Chalice we focus on capturing parishes, townships, sub-townships, attestation. We don’t deal with hundreds, field names, street names, rivers, roads etc.
  • 15. Institute for Language, Cognition and Computation The start of the entry for the township of Willaston in the parish of Neston in Wirral Hundred.
  • 22. Institute for Language, Cognition and Computation Issues •  OCR quality needs to be high: not just recognising characters correctly but getting font and layout information right. Failure to recognise bold and small caps fonts or the difference between a line break and a paragraph break can lead to major errors in the recognition process. •  EPNS volumes vary in the use of layout and font to indicate structure (e.g. Cheshire parishes are signaled by centering combined with numbering with roman numerals while Hertfordshire ones are unnumbered but centered and in bold font.) In some volumes potentially useful information is contained in footnotes. •  Different volumes reflect different decisions about where place name information should be put. In most cases the information about the parish name occurs next to the town in the parish that has the same name. In the Shropshire text some place name information occurs in an earlier volume and is not subsequently repeated, e.g. the description of the parish of Baschurch, containing a township of the same name, has no attestation or etymological information provided because the name was discussed in Part 1.
  • 28. Institute for Language, Cognition and Computation Thank you!