Oboyski cal bug_ecn_2012

Digitizing California Arthropod Collections
Peter Oboyski, Gordon Nishida, Kipling Will, Rosemary Gillespie
Essig Museum of Entomology
University of California
Berkeley, California, USA

What is CalBug?
Essig Museum of Entomology
California Academy of Sciences
California State Collection of Arthropods
Bohart Museum, UC Davis
Entomology Research Museum, UC Riverside
San Diego Natural History Museum
LA County Museum
Santa Barbara Museum of Natural History

(Optional)
Sort by locality,
date, sex, etc.
Remove labels, add
unique identifier
Replace labels,
return to collection
Manually enter data
into MySQL database
Online crowd-sourcing
of manual data entry
Optical Character
Recognition (OCR) &
Automated data parsing
Error checking
Geographic
referencing
Aggregate data in
online cache
Temporospatial
analyses
Take digital image,
name and save file
Digitization workflow
Handling & Imaging Data Capture Data Manipulation

Why Image Specimens/Labels?
• Data capture can be done remotely
• Magnify difficult to read labels
• Verbatim archive of label data

(Optional)
Sort by locality,
date, sex, etc.
Remove labels, add
unique identifier
Replace labels,
return to collection
Take digital image,
name and save file
Handling & Imaging
Presorting allows faster databasing
Removing labels is quick
Adding unique identifiers is slow
Efficient work station, file naming
conventions and batch processing
Replacing labels takes time

1st generation - DinoLite digital microscope

2nd generation – Digital Camera (Canon G9)

High resolution
- magnify hard to read labels
Labels flat, unobscured
- better for OCR
Scale bar, controlled light
Important to add species
name to image or file name
Digital camera
Tethered to computer
Labels removed
EMEC218958 Paracotalpa ursina.jpg

Scanning Slides
Flatbed scanner & Photoshop

IrfanView software for batch processing of image files
EMEC218958 Paracotalpa ursina.jpg

Manually enter data
into MySQL database
Online crowd-sourcing
of manual data entry
Optical Character
Recognition (OCR) &
Automated data parsing
Data capture
Using our own MySQL database (EssigDB)
Built-in error checking
Data carry-over one record to next
Taxonomy automatically added
“Notes from Nature”
Collaboration with Zooniverse
Citizen Scientist transcription of labels
Collaboration with UC San Diego
Improved OCR and “word spotting”
Automatic data parsing (not yet!!)
- iDigBio “hackathon” in February for OCR

Genus and species from file name
Higher taxonomy auto-filled
from database authority file

Notes from Nature
Citizen Science data transcription

Integrating OCR with crowd sourcing
o Spotting words within images
o Copy-paste, highlight-drag fields
o Auto-detecting repeated “words”
o eg. species, states, counties
o Providing an additional “vote”
for transcription consensus

The OCR challenge for specimen labels
DETECTION:
Finding text in a
complex matrix
Machine-typed vs.
hand-written labels
Sliding window
classifier creating text
bounding boxes
>95% detection and
localization using pixel-
overlap measures

RECOGNITION:
Using Tesseract OCR engine
Machine Type
74% accuracy for word-level
82% accuracy for character-level
Hand Writing
5.4% accuracy for word-level
9.2% accuracy for character-level
Current Progress in OCR recognition

Error checking
Geographic
referencing
Aggregate data in
online cache
Temporospatial
analyses
Data Manipulation
Just starting this phase
No report on error rates
Georeferencing very slow even with semi-
automation with GeoLocate and other services
Following Darwin Core standards
Merging of data straight forward
Analyses pending

Progress
• After 2 years ...
• Undergraduate student work force
• Pinned specimens
– imaging 20-65 specimens per hour (ave. = 40)
• Microscope slides
– Imaging 100-170 specimens per hour (ave. = 140)
• Approximately 40,000 records databased
– Plus 115,000 previously databased insect records
• 150,000+ images waiting to be databased

Thank you
http://calbug.berkeley.edu

Oboyski cal bug_ecn_2012

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Oboyski cal bug_ecn_2012

Ähnlich wie Oboyski cal bug_ecn_2012 (20)

Mehr von ECNOfficer

Mehr von ECNOfficer (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Oboyski cal bug_ecn_2012

Hinweis der Redaktion