Boost PC performance: How more available memory can improve productivity
Oboyski cal bug_ecn_2012
1. Digitizing California Arthropod Collections
Peter Oboyski, Gordon Nishida, Kipling Will, Rosemary Gillespie
Essig Museum of Entomology
University of California
Berkeley, California, USA
2. What is CalBug?
Essig Museum of Entomology
California Academy of Sciences
California State Collection of Arthropods
Bohart Museum, UC Davis
Entomology Research Museum, UC Riverside
San Diego Natural History Museum
LA County Museum
Santa Barbara Museum of Natural History
3.
4. (Optional)
Sort by locality,
date, sex, etc.
Remove labels, add
unique identifier
Replace labels,
return to collection
Manually enter data
into MySQL database
Online crowd-sourcing
of manual data entry
Optical Character
Recognition (OCR) &
Automated data parsing
Error checking
Geographic
referencing
Aggregate data in
online cache
Temporospatial
analyses
Take digital image,
name and save file
Digitization workflow
Handling & Imaging Data Capture Data Manipulation
5. Why Image Specimens/Labels?
• Data capture can be done remotely
• Magnify difficult to read labels
• Verbatim archive of label data
6. (Optional)
Sort by locality,
date, sex, etc.
Remove labels, add
unique identifier
Replace labels,
return to collection
Take digital image,
name and save file
Handling & Imaging
Presorting allows faster databasing
Removing labels is quick
Adding unique identifiers is slow
Efficient work station, file naming
conventions and batch processing
Replacing labels takes time
10. High resolution
- magnify hard to read labels
Labels flat, unobscured
- better for OCR
Scale bar, controlled light
Important to add species
name to image or file name
Digital camera
Tethered to computer
Labels removed
EMEC218958 Paracotalpa ursina.jpg
13. IrfanView software for batch processing of image files
EMEC218958 Paracotalpa ursina.jpg
14. Manually enter data
into MySQL database
Online crowd-sourcing
of manual data entry
Optical Character
Recognition (OCR) &
Automated data parsing
Data capture
Using our own MySQL database (EssigDB)
Built-in error checking
Data carry-over one record to next
Taxonomy automatically added
“Notes from Nature”
Collaboration with Zooniverse
Citizen Scientist transcription of labels
Collaboration with UC San Diego
Improved OCR and “word spotting”
Automatic data parsing (not yet!!)
- iDigBio “hackathon” in February for OCR
15.
16. Genus and species from file name
Higher taxonomy auto-filled
from database authority file
19. Integrating OCR with crowd sourcing
o Spotting words within images
o Copy-paste, highlight-drag fields
o Auto-detecting repeated “words”
o eg. species, states, counties
o Providing an additional “vote”
for transcription consensus
20. The OCR challenge for specimen labels
DETECTION:
Finding text in a
complex matrix
Machine-typed vs.
hand-written labels
Sliding window
classifier creating text
bounding boxes
>95% detection and
localization using pixel-
overlap measures
21. RECOGNITION:
Using Tesseract OCR engine
Machine Type
74% accuracy for word-level
82% accuracy for character-level
Hand Writing
5.4% accuracy for word-level
9.2% accuracy for character-level
Current Progress in OCR recognition
22. Error checking
Geographic
referencing
Aggregate data in
online cache
Temporospatial
analyses
Data Manipulation
Just starting this phase
No report on error rates
Georeferencing very slow even with semi-
automation with GeoLocate and other services
Following Darwin Core standards
Merging of data straight forward
Analyses pending
23. Progress
• After 2 years ...
• Undergraduate student work force
• Pinned specimens
– imaging 20-65 specimens per hour (ave. = 40)
• Microscope slides
– Imaging 100-170 specimens per hour (ave. = 140)
• Approximately 40,000 records databased
– Plus 115,000 previously databased insect records
• 150,000+ images waiting to be databased
The tool prompts the user to first highlight where the record text is within the image. This allows us to store a spatial annotation about where on an image data was transcribed (stored in MongoDB)