SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Digitizing California Arthropod Collections
Peter Oboyski, Gordon Nishida, Kipling Will, Rosemary Gillespie
Essig Museum of Entomology
University of California
Berkeley, California, USA
What is CalBug?
Essig Museum of Entomology
California Academy of Sciences
California State Collection of Arthropods
Bohart Museum, UC Davis
Entomology Research Museum, UC Riverside
San Diego Natural History Museum
LA County Museum
Santa Barbara Museum of Natural History
(Optional)
Sort by locality,
date, sex, etc.
Remove labels, add
unique identifier
Replace labels,
return to collection
Manually enter data
into MySQL database
Online crowd-sourcing
of manual data entry
Optical Character
Recognition (OCR) &
Automated data parsing
Error checking
Geographic
referencing
Aggregate data in
online cache
Temporospatial
analyses
Take digital image,
name and save file
Digitization workflow
Handling & Imaging Data Capture Data Manipulation
Why Image Specimens/Labels?
• Data capture can be done remotely
• Magnify difficult to read labels
• Verbatim archive of label data
(Optional)
Sort by locality,
date, sex, etc.
Remove labels, add
unique identifier
Replace labels,
return to collection
Take digital image,
name and save file
Handling & Imaging
Presorting allows faster databasing
Removing labels is quick
Adding unique identifiers is slow
Efficient work station, file naming
conventions and batch processing
Replacing labels takes time
1st generation - DinoLite digital microscope
2nd generation – Digital Camera (Canon G9)
High resolution
- magnify hard to read labels
Labels flat, unobscured
- better for OCR
Scale bar, controlled light
Important to add species
name to image or file name
Digital camera
Tethered to computer
Labels removed
EMEC218958 Paracotalpa ursina.jpg
Scanning Slides
Flatbed scanner & Photoshop
Save for Web & Devices
IrfanView software for batch processing of image files
EMEC218958 Paracotalpa ursina.jpg
Manually enter data
into MySQL database
Online crowd-sourcing
of manual data entry
Optical Character
Recognition (OCR) &
Automated data parsing
Data capture
Using our own MySQL database (EssigDB)
Built-in error checking
Data carry-over one record to next
Taxonomy automatically added
“Notes from Nature”
Collaboration with Zooniverse
Citizen Scientist transcription of labels
Collaboration with UC San Diego
Improved OCR and “word spotting”
Automatic data parsing (not yet!!)
- iDigBio “hackathon” in February for OCR
Genus and species from file name
Higher taxonomy auto-filled
from database authority file
Notes from Nature
Citizen Science data transcription
Integrating OCR with crowd sourcing
o Spotting words within images
o Copy-paste, highlight-drag fields
o Auto-detecting repeated “words”
o eg. species, states, counties
o Providing an additional “vote”
for transcription consensus
The OCR challenge for specimen labels
DETECTION:
Finding text in a
complex matrix
Machine-typed vs.
hand-written labels
Sliding window
classifier creating text
bounding boxes
>95% detection and
localization using pixel-
overlap measures
RECOGNITION:
Using Tesseract OCR engine
Machine Type
74% accuracy for word-level
82% accuracy for character-level
Hand Writing
5.4% accuracy for word-level
9.2% accuracy for character-level
Current Progress in OCR recognition
Error checking
Geographic
referencing
Aggregate data in
online cache
Temporospatial
analyses
Data Manipulation
Just starting this phase
No report on error rates
Georeferencing very slow even with semi-
automation with GeoLocate and other services
Following Darwin Core standards
Merging of data straight forward
Analyses pending
Progress
• After 2 years ...
• Undergraduate student work force
• Pinned specimens
– imaging 20-65 specimens per hour (ave. = 40)
• Microscope slides
– Imaging 100-170 specimens per hour (ave. = 140)
• Approximately 40,000 records databased
– Plus 115,000 previously databased insect records
• 150,000+ images waiting to be databased
Thank you
http://calbug.berkeley.edu

Weitere ähnliche Inhalte

Ähnlich wie Oboyski cal bug_ecn_2012

Biodiversity Informatics Course Presentation
Biodiversity Informatics Course PresentationBiodiversity Informatics Course Presentation
Biodiversity Informatics Course PresentationRoderic Page
 
Giddens ecn2013
Giddens ecn2013Giddens ecn2013
Giddens ecn2013ECNOfficer
 
Big Data Taxonomy 8/26/2013
Big Data Taxonomy 8/26/2013Big Data Taxonomy 8/26/2013
Big Data Taxonomy 8/26/2013DataTactics
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionUniversity of Washington
 
Data-Mining-ppt (1).pptx
Data-Mining-ppt (1).pptxData-Mining-ppt (1).pptx
Data-Mining-ppt (1).pptxParvathyparu25
 
Data-Mining-ppt.pptx
Data-Mining-ppt.pptxData-Mining-ppt.pptx
Data-Mining-ppt.pptxayush309565
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Webebiquity
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data miningDevakumar Jain
 
CIKM Tutorial 2008
CIKM Tutorial 2008CIKM Tutorial 2008
CIKM Tutorial 2008Peiling Wang
 
Tracing Networks: Ontology Software in a Nutshell
Tracing Networks: Ontology Software in a NutshellTracing Networks: Ontology Software in a Nutshell
Tracing Networks: Ontology Software in a Nutshellenoch1982
 
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...Chris Freeland
 
Stanford Info Seminar March 07
Stanford Info Seminar March 07Stanford Info Seminar March 07
Stanford Info Seminar March 07mor
 
Pikas bibliometricsfor21may2015
Pikas bibliometricsfor21may2015Pikas bibliometricsfor21may2015
Pikas bibliometricsfor21may2015Christina Pikas
 
Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009Ian Foster
 
Human resource assignment help
Human resource assignment helpHuman resource assignment help
Human resource assignment helpjohn mayer
 
IASSIST identifiers By Joan Starr
IASSIST identifiers By Joan StarrIASSIST identifiers By Joan Starr
IASSIST identifiers By Joan StarrCarly Strasser
 
Tracing Networks: Ontology-based Software in a Nutshell
Tracing Networks: Ontology-based Software in a NutshellTracing Networks: Ontology-based Software in a Nutshell
Tracing Networks: Ontology-based Software in a NutshellTracingNetworks
 
breeding informatics solutions at SGN
breeding informatics solutions at SGNbreeding informatics solutions at SGN
breeding informatics solutions at SGNsolgenomics
 

Ähnlich wie Oboyski cal bug_ecn_2012 (20)

Biodiversity Informatics Course Presentation
Biodiversity Informatics Course PresentationBiodiversity Informatics Course Presentation
Biodiversity Informatics Course Presentation
 
Giddens ecn2013
Giddens ecn2013Giddens ecn2013
Giddens ecn2013
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Big Data Taxonomy 8/26/2013
Big Data Taxonomy 8/26/2013Big Data Taxonomy 8/26/2013
Big Data Taxonomy 8/26/2013
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 
Week12
Week12Week12
Week12
 
Data-Mining-ppt (1).pptx
Data-Mining-ppt (1).pptxData-Mining-ppt (1).pptx
Data-Mining-ppt (1).pptx
 
Data-Mining-ppt.pptx
Data-Mining-ppt.pptxData-Mining-ppt.pptx
Data-Mining-ppt.pptx
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Web
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data mining
 
CIKM Tutorial 2008
CIKM Tutorial 2008CIKM Tutorial 2008
CIKM Tutorial 2008
 
Tracing Networks: Ontology Software in a Nutshell
Tracing Networks: Ontology Software in a NutshellTracing Networks: Ontology Software in a Nutshell
Tracing Networks: Ontology Software in a Nutshell
 
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...
 
Stanford Info Seminar March 07
Stanford Info Seminar March 07Stanford Info Seminar March 07
Stanford Info Seminar March 07
 
Pikas bibliometricsfor21may2015
Pikas bibliometricsfor21may2015Pikas bibliometricsfor21may2015
Pikas bibliometricsfor21may2015
 
Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009
 
Human resource assignment help
Human resource assignment helpHuman resource assignment help
Human resource assignment help
 
IASSIST identifiers By Joan Starr
IASSIST identifiers By Joan StarrIASSIST identifiers By Joan Starr
IASSIST identifiers By Joan Starr
 
Tracing Networks: Ontology-based Software in a Nutshell
Tracing Networks: Ontology-based Software in a NutshellTracing Networks: Ontology-based Software in a Nutshell
Tracing Networks: Ontology-based Software in a Nutshell
 
breeding informatics solutions at SGN
breeding informatics solutions at SGNbreeding informatics solutions at SGN
breeding informatics solutions at SGN
 

Mehr von ECNOfficer

Price2 ecn2013
Price2 ecn2013Price2 ecn2013
Price2 ecn2013ECNOfficer
 
Sikes ecn2013 dn_ab
Sikes ecn2013 dn_abSikes ecn2013 dn_ab
Sikes ecn2013 dn_abECNOfficer
 
Janzen ecn2013
Janzen ecn2013Janzen ecn2013
Janzen ecn2013ECNOfficer
 
Nearns ecn2013
Nearns ecn2013Nearns ecn2013
Nearns ecn2013ECNOfficer
 
D paul ecn2013
D paul ecn2013D paul ecn2013
D paul ecn2013ECNOfficer
 
Rubinoff ecn2013 uhim
Rubinoff ecn2013 uhimRubinoff ecn2013 uhim
Rubinoff ecn2013 uhimECNOfficer
 
Mc alister ecn2013
Mc alister ecn2013Mc alister ecn2013
Mc alister ecn2013ECNOfficer
 
Dombroskie ecn2013
Dombroskie ecn2013Dombroskie ecn2013
Dombroskie ecn2013ECNOfficer
 
Dmitriev ecn2013
Dmitriev ecn2013Dmitriev ecn2013
Dmitriev ecn2013ECNOfficer
 
Oboyski ecn2013
Oboyski ecn2013Oboyski ecn2013
Oboyski ecn2013ECNOfficer
 
Thomas ecn2013
Thomas ecn2013Thomas ecn2013
Thomas ecn2013ECNOfficer
 
Jones ecn2013 the_goodbadugly conabio
Jones ecn2013 the_goodbadugly conabioJones ecn2013 the_goodbadugly conabio
Jones ecn2013 the_goodbadugly conabioECNOfficer
 
Austin ecn2013
Austin ecn2013Austin ecn2013
Austin ecn2013ECNOfficer
 
Yu ecn2013 cnc_databasing
Yu ecn2013 cnc_databasingYu ecn2013 cnc_databasing
Yu ecn2013 cnc_databasingECNOfficer
 
Solis ecn2013 usfws
Solis ecn2013 usfwsSolis ecn2013 usfws
Solis ecn2013 usfwsECNOfficer
 
Schuh ecn2013 tcn_data_structure
Schuh ecn2013 tcn_data_structureSchuh ecn2013 tcn_data_structure
Schuh ecn2013 tcn_data_structureECNOfficer
 
Gil ecn2013 ppt
Gil ecn2013 pptGil ecn2013 ppt
Gil ecn2013 pptECNOfficer
 
Dm smith ecn2013
Dm smith ecn2013Dm smith ecn2013
Dm smith ecn2013ECNOfficer
 

Mehr von ECNOfficer (20)

Price2 ecn2013
Price2 ecn2013Price2 ecn2013
Price2 ecn2013
 
Sikes ecn2013 dn_ab
Sikes ecn2013 dn_abSikes ecn2013 dn_ab
Sikes ecn2013 dn_ab
 
Ryder ecn2013
Ryder ecn2013Ryder ecn2013
Ryder ecn2013
 
Janzen ecn2013
Janzen ecn2013Janzen ecn2013
Janzen ecn2013
 
Nearns ecn2013
Nearns ecn2013Nearns ecn2013
Nearns ecn2013
 
Krell ecn2013
Krell ecn2013Krell ecn2013
Krell ecn2013
 
D paul ecn2013
D paul ecn2013D paul ecn2013
D paul ecn2013
 
Rubinoff ecn2013 uhim
Rubinoff ecn2013 uhimRubinoff ecn2013 uhim
Rubinoff ecn2013 uhim
 
Mc alister ecn2013
Mc alister ecn2013Mc alister ecn2013
Mc alister ecn2013
 
Dombroskie ecn2013
Dombroskie ecn2013Dombroskie ecn2013
Dombroskie ecn2013
 
Dmitriev ecn2013
Dmitriev ecn2013Dmitriev ecn2013
Dmitriev ecn2013
 
Oboyski ecn2013
Oboyski ecn2013Oboyski ecn2013
Oboyski ecn2013
 
Thomas ecn2013
Thomas ecn2013Thomas ecn2013
Thomas ecn2013
 
Jones ecn2013 the_goodbadugly conabio
Jones ecn2013 the_goodbadugly conabioJones ecn2013 the_goodbadugly conabio
Jones ecn2013 the_goodbadugly conabio
 
Austin ecn2013
Austin ecn2013Austin ecn2013
Austin ecn2013
 
Yu ecn2013 cnc_databasing
Yu ecn2013 cnc_databasingYu ecn2013 cnc_databasing
Yu ecn2013 cnc_databasing
 
Solis ecn2013 usfws
Solis ecn2013 usfwsSolis ecn2013 usfws
Solis ecn2013 usfws
 
Schuh ecn2013 tcn_data_structure
Schuh ecn2013 tcn_data_structureSchuh ecn2013 tcn_data_structure
Schuh ecn2013 tcn_data_structure
 
Gil ecn2013 ppt
Gil ecn2013 pptGil ecn2013 ppt
Gil ecn2013 ppt
 
Dm smith ecn2013
Dm smith ecn2013Dm smith ecn2013
Dm smith ecn2013
 

Kürzlich hochgeladen

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Kürzlich hochgeladen (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Oboyski cal bug_ecn_2012

  • 1. Digitizing California Arthropod Collections Peter Oboyski, Gordon Nishida, Kipling Will, Rosemary Gillespie Essig Museum of Entomology University of California Berkeley, California, USA
  • 2. What is CalBug? Essig Museum of Entomology California Academy of Sciences California State Collection of Arthropods Bohart Museum, UC Davis Entomology Research Museum, UC Riverside San Diego Natural History Museum LA County Museum Santa Barbara Museum of Natural History
  • 3.
  • 4. (Optional) Sort by locality, date, sex, etc. Remove labels, add unique identifier Replace labels, return to collection Manually enter data into MySQL database Online crowd-sourcing of manual data entry Optical Character Recognition (OCR) & Automated data parsing Error checking Geographic referencing Aggregate data in online cache Temporospatial analyses Take digital image, name and save file Digitization workflow Handling & Imaging Data Capture Data Manipulation
  • 5. Why Image Specimens/Labels? • Data capture can be done remotely • Magnify difficult to read labels • Verbatim archive of label data
  • 6. (Optional) Sort by locality, date, sex, etc. Remove labels, add unique identifier Replace labels, return to collection Take digital image, name and save file Handling & Imaging Presorting allows faster databasing Removing labels is quick Adding unique identifiers is slow Efficient work station, file naming conventions and batch processing Replacing labels takes time
  • 7. 1st generation - DinoLite digital microscope
  • 8.
  • 9. 2nd generation – Digital Camera (Canon G9)
  • 10. High resolution - magnify hard to read labels Labels flat, unobscured - better for OCR Scale bar, controlled light Important to add species name to image or file name Digital camera Tethered to computer Labels removed EMEC218958 Paracotalpa ursina.jpg
  • 12. Save for Web & Devices
  • 13. IrfanView software for batch processing of image files EMEC218958 Paracotalpa ursina.jpg
  • 14. Manually enter data into MySQL database Online crowd-sourcing of manual data entry Optical Character Recognition (OCR) & Automated data parsing Data capture Using our own MySQL database (EssigDB) Built-in error checking Data carry-over one record to next Taxonomy automatically added “Notes from Nature” Collaboration with Zooniverse Citizen Scientist transcription of labels Collaboration with UC San Diego Improved OCR and “word spotting” Automatic data parsing (not yet!!) - iDigBio “hackathon” in February for OCR
  • 15.
  • 16. Genus and species from file name Higher taxonomy auto-filled from database authority file
  • 17. Notes from Nature Citizen Science data transcription
  • 18.
  • 19. Integrating OCR with crowd sourcing o Spotting words within images o Copy-paste, highlight-drag fields o Auto-detecting repeated “words” o eg. species, states, counties o Providing an additional “vote” for transcription consensus
  • 20. The OCR challenge for specimen labels DETECTION: Finding text in a complex matrix Machine-typed vs. hand-written labels Sliding window classifier creating text bounding boxes >95% detection and localization using pixel- overlap measures
  • 21. RECOGNITION: Using Tesseract OCR engine Machine Type 74% accuracy for word-level 82% accuracy for character-level Hand Writing 5.4% accuracy for word-level 9.2% accuracy for character-level Current Progress in OCR recognition
  • 22. Error checking Geographic referencing Aggregate data in online cache Temporospatial analyses Data Manipulation Just starting this phase No report on error rates Georeferencing very slow even with semi- automation with GeoLocate and other services Following Darwin Core standards Merging of data straight forward Analyses pending
  • 23. Progress • After 2 years ... • Undergraduate student work force • Pinned specimens – imaging 20-65 specimens per hour (ave. = 40) • Microscope slides – Imaging 100-170 specimens per hour (ave. = 140) • Approximately 40,000 records databased – Plus 115,000 previously databased insect records • 150,000+ images waiting to be databased

Hinweis der Redaktion

  1. The tool prompts the user to first highlight where the record text is within the image. This allows us to store a spatial annotation about where on an image data was transcribed (stored in MongoDB)