SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Downloaden Sie, um offline zu lesen
Digitising Natural
     History
     Marieke van Erp
     marieke@cs.vu.nl




            1
2
Why Digitise?

New technology offers many new possibilities

 •   improves collection management

 •   opens up new avenues of research

 •   digital collection access




                                 3
Digitisation at Naturalis
•   goal is to have 7 million objects digitised by mid-2015
    (out of 37 million) + robust infrastructure for
    continuation of digitisation

•   3 million within Naturalis digitisation streets

•   4 million elsewhere

•   other 30 million objects will be digitised at less detailed
    level



                               4
5
6
7
• Leposoma Guianense, Sipaliwini, 4 km e. of
  airport, near base camp, forest ground,
  among leaves, 28-VIII-1968, 12.45 u. reg. nr.
  13879




                      8
But what you really want...


 Genus                    Leposoma
 Species                  Guianense
 Region                    Sipaliwini
Location               4 km e. of airport
                 near base camp, forest ground
Biotope
                          among leaves
 Date                     28-08-1968
 Time                        12:45
 Reg #                      13879
             9
•   Leposoma Guianense, Sipaliwini, 4 km e. of airport,
    near base camp, forest ground, among leaves, 28-
    VIII-1968, 12.45 u. reg. nr. 13879

•   ask a computer to learn to segment and classify
    text snippets




                          10
• Manually annotate 500 text snippets (~3h)
• 300 for training
• 200 for testing


                     11
• 49,688 new database records (547,528
  database cells) at ~84.57 accuracy




                     12
The Manually Created Reptiles and
     Amphibians Database
• 16,870 records describing characteristics and
  history of animal specimens in a natural
  history database
• 39 columns
• Dutch, English, German and Portuguese
• numeric and textual values (both atomic and
  elaborate)

                       13
column Name                       value
     order                         Anura
     genus                      Megophrys
    country                      Indonesia
    biotope                  in rain near road
collection date                 01.02.1888
      type                       holotype
 determinator                    A. Dubois
  defined by                  (Linnaeus, 1758)
                  in bad condition, was eaten by
                  Leptodactylus rugosus (3023) at
special remarks
                  night and thrown up again the next
                  morning when killed, partly digested
                        14
15
• a database provides structure
• computers are good at comparing values
• statistical methods can detect
  inconsistencies




                    16
preservation
   author        determinator     family          genus    country
                                                                         method


(Daudin, 1802)      ------      Bataguridae      Anolis    Cambodja    (shield, dry)


  (Schlegel)     G. vd. Boog    Colubridae       Geophis   Indonesia       -----


                   M. S.
  Schneider                        ------         Bufo     Suriname        -----
                 Hoogmoed


(Horst, 1883)     Tyler, M J     Hylidae         Litoria     ------      alcohol




                                            17
preservation
   author        determinator     family          genus    country
                                                                         method


(Daudin, 1802)      ------      Bataguridae      Anolis    Cambodja    (shield, dry)


  (Schlegel)     G. vd. Boog    Colubridae       Geophis   Indonesia       -----


                   M. S.
  Schneider                        ------         Bufo     Suriname        -----
                 Hoogmoed


(Horst, 1883)     Tyler, M J     Hylidae         Litoria     ------      alcohol




                                            18
actual value: Geophis


                                                                           preservation
   author        determinator     family         genus         country
                                                                             method


(Daudin, 1802)      ------      Bataguridae      Anolis        Cambodja    (shield, dry)


  (Schlegel)     G. vd. Boog    Colubridae          ?          Indonesia       -----


                   M. S.
  Schneider                        ------         Bufo         Suriname        -----
                 Hoogmoed


(Horst, 1883)     Tyler, M J     Hylidae         Litoria         ------      alcohol




                                            19
actual value: Geophis


                                                                           preservation
   author        determinator     family         genus         country
                                                                             method


(Daudin, 1802)      ------      Bataguridae      Anolis        Cambodja    (shield, dry)


  (Schlegel)     G. vd. Boog    Colubridae          ?          Indonesia       -----


                   M. S.
  Schneider                        ------         Bufo         Suriname        -----
                 Hoogmoed


(Horst, 1883)     Tyler, M J     Hylidae         Litoria         ------      alcohol




                                            20
actual value: Geophis


                                                                           preservation
   author        determinator     family         genus         country
                                                                             method


(Daudin, 1802)      ------      Bataguridae      Anolis        Cambodja    (shield, dry)


  (Schlegel)     G. vd. Boog    Colubridae          ?          Indonesia       -----


                   M. S.
  Schneider                        ------         Bufo         Suriname        -----
                 Hoogmoed


(Horst, 1883)     Tyler, M J     Hylidae         Litoria         ------      alcohol




                                            21
actual value: Geophis


                                                                           preservation
   author        determinator     family         genus         country
                                                                             method


(Daudin, 1802)      ------      Bataguridae      Anolis        Cambodja    (shield, dry)


  (Schlegel)     G. vd. Boog    Colubridae          ?          Indonesia       -----


                   M. S.
  Schneider                        ------         Bufo         Suriname        -----
                 Hoogmoed


(Horst, 1883)     Tyler, M J     Hylidae         Litoria         ------      alcohol




                                            22
actual value: Geophis


                                                                           preservation
   author        determinator     family         genus         country
                                                                             method


(Daudin, 1802)      ------      Bataguridae      Anolis        Cambodja    (shield, dry)


  (Schlegel)     G. vd. Boog    Colubridae          ?          Indonesia       -----


                   M. S.
  Schneider                        ------         Bufo         Suriname        -----
                 Hoogmoed


(Horst, 1883)     Tyler, M J     Hylidae         Litoria         ------      alcohol




                                            23
actual value:   Geophis
                                                   predicted value: Rhapdophis

                                                                       preservation
   author        determinator     family         genus     country
                                                                         method


(Daudin, 1802)      ------      Bataguridae      Anolis    Cambodja    (shield, dry)


  (Schlegel)     G. vd. Boog    Colubridae         ?       Indonesia       -----


                   M. S.
  Schneider                        ------         Bufo     Suriname        -----
                 Hoogmoed


(Horst, 1883)     Tyler, M J     Hylidae         Litoria     ------      alcohol




                                            24
• <100 cells to check for a column instead of
  16,780
• recall (estimate): 90-100%
• one-size-fits-all

                       25
• Data-driven cleaning cannot detect
  systematic errors

• Maybe systematics can help?


                      26
subject       relation      object

specimen                     entry in
             occurs before
collection                   museum
             has broader
 species                      genus
                term

   city       falls within   country


                   27
• detects inconsistencies database usage
• small scope
• high recall and precision within scope
• needs adapting for each new domain

                      28
Disambiguating
  Locations


      29
Challenge                           Example
 Ambiguous location name                   Amsterdam
  Two or more location              Wakarusa, 24mi WSW of
      descriptors                         Lawrence
    Topological nesting           Moccassin Creek on Hog Island
                                   Bupo [?Buso] River, 15 miles
   Complex description
                                        [24km] E of Lae
Linear feature measurement         16km (by road) N of Murtoa
                                   On the road between Sydney
     Linear ambiguity
                                          and Bathurst
      Vague localities                 Southeast Michigan
 Changed political borders                  Yugoslavia
  Historical Place Names              British North Borneo

                             30
• Randomly annotated geographical
  information in 200 database records

• 50 records for development, 150 for testing


                     31
Knowledge-driven
       Georeferencing
•   Record retrieval

•   Text parsing

•   Gazetteer lookup

•   Offset calculation

•   Disambiguation Heuristics



                            32
Offset




  33
Disambiguation
             Heuristics
•   Spatial Minimality

      •   if Amsterdam and Utrecht are mentioned in the same record,
          then Amsterdam, NL is more likely than Amsterdam, NY, USA

•   Expedition clusters

      •   It is unlikely that a collector was collecting in Europe on
          Monday and in the US on Tuesday

•   Species occurrence data

      •   GBIF can tell us where a certain species does or does not
          occur


                                   34
Species Occurrence
       Data




        35
Results
                                                 Mean
               Correct   Correct      Correct
                                                distance   Not Found
               @5km      @25km        @100km
                                                   off
  Baseline      38.9      47.0          58.4     251.1       26.2

 + Google
                53.0      65.1          74.5     244.1        8.7
maps + fuzzy
 + Spatial
                59.1      71.8          77.2     171.1        7.4
 minimality
+ Expedition    59.1      71.8          77.2     171.1        7.4

  + GBIF        61.7      74.5          79.9     114.5        7.4
                                 36
Confidence




    37
General Conclusions


• data cleaning is essential
• “digitising” a heritage collection is
  complicated
• don’t try to tame text

                         38
• Data-driven error correction method is
  being developed further in the CATCHPlus
  programme

   • http://www.catchplus.nl/diensten/
      deelprojecten/checkers/




                    39
Thank you for your
    attention!


        40
• CATCH: http://www.nwo.nl/catch
• MITCH: http://ilk.uvt.nl/mitch
• Agora: http://agora.cs.vu.nl/


                   41
• More information about machine learning
 • Video explaining k-nearest neighbour
    algorithm: http://videolectures.net/
    aaai07_bosch_knnc/

 • Weka Toolkit: http://
    www.cs.waikato.ac.nz/ml/weka/



                      42

Weitere ähnliche Inhalte

Mehr von Marieke van Erp

Towards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH SymposiumTowards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH SymposiumMarieke van Erp
 
A Polyvocal and Contextualised Semantic Web
A Polyvocal and Contextualised Semantic WebA Polyvocal and Contextualised Semantic Web
A Polyvocal and Contextualised Semantic WebMarieke van Erp
 
AI x Digital Humanities = > Inclusiviteit
AI x Digital Humanities = > Inclusiviteit AI x Digital Humanities = > Inclusiviteit
AI x Digital Humanities = > Inclusiviteit Marieke van Erp
 
Computationally Tracing Concepts Through Time and Space
Computationally Tracing Concepts Through Time and SpaceComputationally Tracing Concepts Through Time and Space
Computationally Tracing Concepts Through Time and SpaceMarieke van Erp
 
The Hitchhiker's Guide to the Future of Digital Humanities
The Hitchhiker's Guide to the Future of Digital HumanitiesThe Hitchhiker's Guide to the Future of Digital Humanities
The Hitchhiker's Guide to the Future of Digital HumanitiesMarieke van Erp
 
Why language technology can’t handle Game of Thrones (yet)
Why language technology can’t handle Game of Thrones (yet)Why language technology can’t handle Game of Thrones (yet)
Why language technology can’t handle Game of Thrones (yet)Marieke van Erp
 
(Beyond) Combining Text and Tables for qualitative and quantitative research
(Beyond) Combining Text and Tables for qualitative and quantitative research (Beyond) Combining Text and Tables for qualitative and quantitative research
(Beyond) Combining Text and Tables for qualitative and quantitative research Marieke van Erp
 
Finding common ground between text, maps, and tables for quantitative and qua...
Finding common ground between text, maps, and tables for quantitative and qua...Finding common ground between text, maps, and tables for quantitative and qua...
Finding common ground between text, maps, and tables for quantitative and qua...Marieke van Erp
 
Slicing and Dicing a Newspaper Corpus for Historical Ecology Research
Slicing and Dicing a Newspaper Corpus for Historical Ecology ResearchSlicing and Dicing a Newspaper Corpus for Historical Ecology Research
Slicing and Dicing a Newspaper Corpus for Historical Ecology ResearchMarieke van Erp
 
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Marieke van Erp
 
Good Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologistsGood Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologistsMarieke van Erp
 
Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case Marieke van Erp
 
Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition Marieke van Erp
 
HuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the ConversationHuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the ConversationMarieke van Erp
 
Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing Marieke van Erp
 
Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia Marieke van Erp
 
Entity Typing and Event Extraction
Entity Typing and Event Extraction Entity Typing and Event Extraction
Entity Typing and Event Extraction Marieke van Erp
 
The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...Marieke van Erp
 
Evaluating entity linking an analysis of current benchmark datasets and a ro...
Evaluating entity linking  an analysis of current benchmark datasets and a ro...Evaluating entity linking  an analysis of current benchmark datasets and a ro...
Evaluating entity linking an analysis of current benchmark datasets and a ro...Marieke van Erp
 
Finding Stories in 1,784,532 Events: Scaling up computational models of narr...
Finding Stories in 1,784,532 Events:  Scaling up computational models of narr...Finding Stories in 1,784,532 Events:  Scaling up computational models of narr...
Finding Stories in 1,784,532 Events: Scaling up computational models of narr...Marieke van Erp
 

Mehr von Marieke van Erp (20)

Towards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH SymposiumTowards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH Symposium
 
A Polyvocal and Contextualised Semantic Web
A Polyvocal and Contextualised Semantic WebA Polyvocal and Contextualised Semantic Web
A Polyvocal and Contextualised Semantic Web
 
AI x Digital Humanities = > Inclusiviteit
AI x Digital Humanities = > Inclusiviteit AI x Digital Humanities = > Inclusiviteit
AI x Digital Humanities = > Inclusiviteit
 
Computationally Tracing Concepts Through Time and Space
Computationally Tracing Concepts Through Time and SpaceComputationally Tracing Concepts Through Time and Space
Computationally Tracing Concepts Through Time and Space
 
The Hitchhiker's Guide to the Future of Digital Humanities
The Hitchhiker's Guide to the Future of Digital HumanitiesThe Hitchhiker's Guide to the Future of Digital Humanities
The Hitchhiker's Guide to the Future of Digital Humanities
 
Why language technology can’t handle Game of Thrones (yet)
Why language technology can’t handle Game of Thrones (yet)Why language technology can’t handle Game of Thrones (yet)
Why language technology can’t handle Game of Thrones (yet)
 
(Beyond) Combining Text and Tables for qualitative and quantitative research
(Beyond) Combining Text and Tables for qualitative and quantitative research (Beyond) Combining Text and Tables for qualitative and quantitative research
(Beyond) Combining Text and Tables for qualitative and quantitative research
 
Finding common ground between text, maps, and tables for quantitative and qua...
Finding common ground between text, maps, and tables for quantitative and qua...Finding common ground between text, maps, and tables for quantitative and qua...
Finding common ground between text, maps, and tables for quantitative and qua...
 
Slicing and Dicing a Newspaper Corpus for Historical Ecology Research
Slicing and Dicing a Newspaper Corpus for Historical Ecology ResearchSlicing and Dicing a Newspaper Corpus for Historical Ecology Research
Slicing and Dicing a Newspaper Corpus for Historical Ecology Research
 
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
 
Good Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologistsGood Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologists
 
Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case
 
Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition
 
HuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the ConversationHuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the Conversation
 
Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing
 
Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia
 
Entity Typing and Event Extraction
Entity Typing and Event Extraction Entity Typing and Event Extraction
Entity Typing and Event Extraction
 
The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...
 
Evaluating entity linking an analysis of current benchmark datasets and a ro...
Evaluating entity linking  an analysis of current benchmark datasets and a ro...Evaluating entity linking  an analysis of current benchmark datasets and a ro...
Evaluating entity linking an analysis of current benchmark datasets and a ro...
 
Finding Stories in 1,784,532 Events: Scaling up computational models of narr...
Finding Stories in 1,784,532 Events:  Scaling up computational models of narr...Finding Stories in 1,784,532 Events:  Scaling up computational models of narr...
Finding Stories in 1,784,532 Events: Scaling up computational models of narr...
 

Richness oftheworld2012

  • 1. Digitising Natural History Marieke van Erp marieke@cs.vu.nl 1
  • 2. 2
  • 3. Why Digitise? New technology offers many new possibilities • improves collection management • opens up new avenues of research • digital collection access 3
  • 4. Digitisation at Naturalis • goal is to have 7 million objects digitised by mid-2015 (out of 37 million) + robust infrastructure for continuation of digitisation • 3 million within Naturalis digitisation streets • 4 million elsewhere • other 30 million objects will be digitised at less detailed level 4
  • 5. 5
  • 6. 6
  • 7. 7
  • 8. • Leposoma Guianense, Sipaliwini, 4 km e. of airport, near base camp, forest ground, among leaves, 28-VIII-1968, 12.45 u. reg. nr. 13879 8
  • 9. But what you really want... Genus Leposoma Species Guianense Region Sipaliwini Location 4 km e. of airport near base camp, forest ground Biotope among leaves Date 28-08-1968 Time 12:45 Reg # 13879 9
  • 10. Leposoma Guianense, Sipaliwini, 4 km e. of airport, near base camp, forest ground, among leaves, 28- VIII-1968, 12.45 u. reg. nr. 13879 • ask a computer to learn to segment and classify text snippets 10
  • 11. • Manually annotate 500 text snippets (~3h) • 300 for training • 200 for testing 11
  • 12. • 49,688 new database records (547,528 database cells) at ~84.57 accuracy 12
  • 13. The Manually Created Reptiles and Amphibians Database • 16,870 records describing characteristics and history of animal specimens in a natural history database • 39 columns • Dutch, English, German and Portuguese • numeric and textual values (both atomic and elaborate) 13
  • 14. column Name value order Anura genus Megophrys country Indonesia biotope in rain near road collection date 01.02.1888 type holotype determinator A. Dubois defined by (Linnaeus, 1758) in bad condition, was eaten by Leptodactylus rugosus (3023) at special remarks night and thrown up again the next morning when killed, partly digested 14
  • 15. 15
  • 16. • a database provides structure • computers are good at comparing values • statistical methods can detect inconsistencies 16
  • 17. preservation author determinator family genus country method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae Geophis Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 17
  • 18. preservation author determinator family genus country method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae Geophis Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 18
  • 19. actual value: Geophis preservation author determinator family genus country method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 19
  • 20. actual value: Geophis preservation author determinator family genus country method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 20
  • 21. actual value: Geophis preservation author determinator family genus country method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 21
  • 22. actual value: Geophis preservation author determinator family genus country method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 22
  • 23. actual value: Geophis preservation author determinator family genus country method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 23
  • 24. actual value: Geophis predicted value: Rhapdophis preservation author determinator family genus country method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- M. S. Schneider ------ Bufo Suriname ----- Hoogmoed (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 24
  • 25. • <100 cells to check for a column instead of 16,780 • recall (estimate): 90-100% • one-size-fits-all 25
  • 26. • Data-driven cleaning cannot detect systematic errors • Maybe systematics can help? 26
  • 27. subject relation object specimen entry in occurs before collection museum has broader species genus term city falls within country 27
  • 28. • detects inconsistencies database usage • small scope • high recall and precision within scope • needs adapting for each new domain 28
  • 30. Challenge Example Ambiguous location name Amsterdam Two or more location Wakarusa, 24mi WSW of descriptors Lawrence Topological nesting Moccassin Creek on Hog Island Bupo [?Buso] River, 15 miles Complex description [24km] E of Lae Linear feature measurement 16km (by road) N of Murtoa On the road between Sydney Linear ambiguity and Bathurst Vague localities Southeast Michigan Changed political borders Yugoslavia Historical Place Names British North Borneo 30
  • 31. • Randomly annotated geographical information in 200 database records • 50 records for development, 150 for testing 31
  • 32. Knowledge-driven Georeferencing • Record retrieval • Text parsing • Gazetteer lookup • Offset calculation • Disambiguation Heuristics 32
  • 34. Disambiguation Heuristics • Spatial Minimality • if Amsterdam and Utrecht are mentioned in the same record, then Amsterdam, NL is more likely than Amsterdam, NY, USA • Expedition clusters • It is unlikely that a collector was collecting in Europe on Monday and in the US on Tuesday • Species occurrence data • GBIF can tell us where a certain species does or does not occur 34
  • 36. Results Mean Correct Correct Correct distance Not Found @5km @25km @100km off Baseline 38.9 47.0 58.4 251.1 26.2 + Google 53.0 65.1 74.5 244.1 8.7 maps + fuzzy + Spatial 59.1 71.8 77.2 171.1 7.4 minimality + Expedition 59.1 71.8 77.2 171.1 7.4 + GBIF 61.7 74.5 79.9 114.5 7.4 36
  • 38. General Conclusions • data cleaning is essential • “digitising” a heritage collection is complicated • don’t try to tame text 38
  • 39. • Data-driven error correction method is being developed further in the CATCHPlus programme • http://www.catchplus.nl/diensten/ deelprojecten/checkers/ 39
  • 40. Thank you for your attention! 40
  • 41. • CATCH: http://www.nwo.nl/catch • MITCH: http://ilk.uvt.nl/mitch • Agora: http://agora.cs.vu.nl/ 41
  • 42. • More information about machine learning • Video explaining k-nearest neighbour algorithm: http://videolectures.net/ aaai07_bosch_knnc/ • Weka Toolkit: http:// www.cs.waikato.ac.nz/ml/weka/ 42