3. Why Digitise?
New technology offers many new possibilities
• improves collection management
• opens up new avenues of research
• digital collection access
3
4. Digitisation at Naturalis
• goal is to have 7 million objects digitised by mid-2015
(out of 37 million) + robust infrastructure for
continuation of digitisation
• 3 million within Naturalis digitisation streets
• 4 million elsewhere
• other 30 million objects will be digitised at less detailed
level
4
8. • Leposoma Guianense, Sipaliwini, 4 km e. of
airport, near base camp, forest ground,
among leaves, 28-VIII-1968, 12.45 u. reg. nr.
13879
8
9. But what you really want...
Genus Leposoma
Species Guianense
Region Sipaliwini
Location 4 km e. of airport
near base camp, forest ground
Biotope
among leaves
Date 28-08-1968
Time 12:45
Reg # 13879
9
10. • Leposoma Guianense, Sipaliwini, 4 km e. of airport,
near base camp, forest ground, among leaves, 28-
VIII-1968, 12.45 u. reg. nr. 13879
• ask a computer to learn to segment and classify
text snippets
10
11. • Manually annotate 500 text snippets (~3h)
• 300 for training
• 200 for testing
11
12. • 49,688 new database records (547,528
database cells) at ~84.57 accuracy
12
13. The Manually Created Reptiles and
Amphibians Database
• 16,870 records describing characteristics and
history of animal specimens in a natural
history database
• 39 columns
• Dutch, English, German and Portuguese
• numeric and textual values (both atomic and
elaborate)
13
14. column Name value
order Anura
genus Megophrys
country Indonesia
biotope in rain near road
collection date 01.02.1888
type holotype
determinator A. Dubois
defined by (Linnaeus, 1758)
in bad condition, was eaten by
Leptodactylus rugosus (3023) at
special remarks
night and thrown up again the next
morning when killed, partly digested
14
16. • a database provides structure
• computers are good at comparing values
• statistical methods can detect
inconsistencies
16
17. preservation
author determinator family genus country
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae Geophis Indonesia -----
M. S.
Schneider ------ Bufo Suriname -----
Hoogmoed
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
17
18. preservation
author determinator family genus country
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae Geophis Indonesia -----
M. S.
Schneider ------ Bufo Suriname -----
Hoogmoed
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
18
19. actual value: Geophis
preservation
author determinator family genus country
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae ? Indonesia -----
M. S.
Schneider ------ Bufo Suriname -----
Hoogmoed
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
19
20. actual value: Geophis
preservation
author determinator family genus country
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae ? Indonesia -----
M. S.
Schneider ------ Bufo Suriname -----
Hoogmoed
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
20
21. actual value: Geophis
preservation
author determinator family genus country
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae ? Indonesia -----
M. S.
Schneider ------ Bufo Suriname -----
Hoogmoed
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
21
22. actual value: Geophis
preservation
author determinator family genus country
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae ? Indonesia -----
M. S.
Schneider ------ Bufo Suriname -----
Hoogmoed
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
22
23. actual value: Geophis
preservation
author determinator family genus country
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae ? Indonesia -----
M. S.
Schneider ------ Bufo Suriname -----
Hoogmoed
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
23
24. actual value: Geophis
predicted value: Rhapdophis
preservation
author determinator family genus country
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae ? Indonesia -----
M. S.
Schneider ------ Bufo Suriname -----
Hoogmoed
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
24
25. • <100 cells to check for a column instead of
16,780
• recall (estimate): 90-100%
• one-size-fits-all
25
30. Challenge Example
Ambiguous location name Amsterdam
Two or more location Wakarusa, 24mi WSW of
descriptors Lawrence
Topological nesting Moccassin Creek on Hog Island
Bupo [?Buso] River, 15 miles
Complex description
[24km] E of Lae
Linear feature measurement 16km (by road) N of Murtoa
On the road between Sydney
Linear ambiguity
and Bathurst
Vague localities Southeast Michigan
Changed political borders Yugoslavia
Historical Place Names British North Borneo
30
31. • Randomly annotated geographical
information in 200 database records
• 50 records for development, 150 for testing
31
32. Knowledge-driven
Georeferencing
• Record retrieval
• Text parsing
• Gazetteer lookup
• Offset calculation
• Disambiguation Heuristics
32
34. Disambiguation
Heuristics
• Spatial Minimality
• if Amsterdam and Utrecht are mentioned in the same record,
then Amsterdam, NL is more likely than Amsterdam, NY, USA
• Expedition clusters
• It is unlikely that a collector was collecting in Europe on
Monday and in the US on Tuesday
• Species occurrence data
• GBIF can tell us where a certain species does or does not
occur
34
38. General Conclusions
• data cleaning is essential
• “digitising” a heritage collection is
complicated
• don’t try to tame text
38
39. • Data-driven error correction method is
being developed further in the CATCHPlus
programme
• http://www.catchplus.nl/diensten/
deelprojecten/checkers/
39