SlideShare ist ein Scribd-Unternehmen logo
1 von 22
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage Library (BHL) developments  Chris Freeland Technical Director, BHL Director of Bioinformatics,  Missouri Botanical Garden
Goals of BHL ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],http://www.biodiversitylibrary.org
BHL Institutions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Now Online Only 290 million to go! See you in 2048!
Scanning Operations ,[object Object],[object Object],[object Object],[object Object],[object Object],Locations of BHL/IA Scanning Centers
Complexities of distributed, mass scanning from NYBG from Smithsonian
Open Access Data The snakes of Australia ; an illustrated and descriptive catalogue of all the known species. By Gerard Krefft...   Publisher: Sydney,T. Richards, Government Printer,1869.  PDF OCR XML JP2
Name Finding via  TaxonFinder
Raw Image Converted to text via OCR Name finding via TaxonFinder Extract names Submit to NameBank SOAP response Name Finding in action with Taxonomic Intelligence…
Name Finding Stats to date * ,[object Object],[object Object],[object Object],[object Object],*19 October 2008
 
 
APIs & Data Sharing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Name Finding Evaluation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],See Poster in hall
Characteristics of sample = 86.91% 2610 Total Number of Unique Names 3003 Total Number of Names 7.7 Average Number of Names per Page 446.8 Average Number of Words per Page 392 Number of Pages
OCR error rate  for names only Top OCR errors Of the 3,003 names, 1,056 were incorrectly transcribed by OCR. e->o 14 c->e 7 h->ii 13 i->l 6 h->l 12 u->n 5 u->ii 11 u->I 4 r->i 10 e->c 3 l->i 9 Omit Space 2 n->v 8 Insert Space 1 35.16%
Performances of algorithms TaxonFinder FAT Excluding names with OCR errors Including names with OCR errors 28.20% 40.32% Precision 23.34% 36.62% Recall 25.77% 38.47% F-score 32.25% 43.77% Precision 17.21% 25.82% Recall 24.73% 34.80% F-score
Considerations ,[object Object],[object Object],[object Object],[object Object],[object Object]
Recommendations ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Up next: BHL Article Repository ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
And if that wasn’t enough… ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Contact ,[object Object],[object Object],[object Object],[object Object],[object Object]

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...
2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...
2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...
 
Creating Incentives
Creating IncentivesCreating Incentives
Creating Incentives
 
Lines of Communication: Effectively Advocating Open Access Repositories
Lines of Communication: Effectively Advocating Open Access RepositoriesLines of Communication: Effectively Advocating Open Access Repositories
Lines of Communication: Effectively Advocating Open Access Repositories
 
NGB Documenation System SESTO (4 February 2004)
NGB Documenation System SESTO (4 February 2004)NGB Documenation System SESTO (4 February 2004)
NGB Documenation System SESTO (4 February 2004)
 
Consuming Linked Data by Machines - WWW2010
Consuming Linked Data by Machines - WWW2010Consuming Linked Data by Machines - WWW2010
Consuming Linked Data by Machines - WWW2010
 
Data Journalism - Cleaning Data
Data Journalism - Cleaning DataData Journalism - Cleaning Data
Data Journalism - Cleaning Data
 
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
 
DataCite at APE 2011
DataCite at APE 2011DataCite at APE 2011
DataCite at APE 2011
 
Museum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themMuseum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on them
 
Modern Tools & Rationales for 21st Century Research
Modern Tools & Rationales  for 21st Century ResearchModern Tools & Rationales  for 21st Century Research
Modern Tools & Rationales for 21st Century Research
 
Friday talk 11.02.2011
Friday talk 11.02.2011Friday talk 11.02.2011
Friday talk 11.02.2011
 
SciDataCon 2014 TDM Workshop Intro Slides
SciDataCon 2014 TDM Workshop Intro SlidesSciDataCon 2014 TDM Workshop Intro Slides
SciDataCon 2014 TDM Workshop Intro Slides
 
Data analytics courses
Data analytics coursesData analytics courses
Data analytics courses
 
Open Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureOpen Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | Future
 
2009 0807 Lod Gmod
2009 0807 Lod Gmod2009 0807 Lod Gmod
2009 0807 Lod Gmod
 
Materials informatics
Materials informaticsMaterials informatics
Materials informatics
 
Research Objects Tutorial (TPDL)
Research Objects Tutorial (TPDL)Research Objects Tutorial (TPDL)
Research Objects Tutorial (TPDL)
 
Research Objects in Scientific Publications
Research Objects in Scientific PublicationsResearch Objects in Scientific Publications
Research Objects in Scientific Publications
 
Triplificating and linking XBRL financial data
Triplificating and linking XBRL financial dataTriplificating and linking XBRL financial data
Triplificating and linking XBRL financial data
 
Sql can be cool again
Sql can be cool againSql can be cool again
Sql can be cool again
 

Ähnlich wie An evaluation of taxonomic name finding & next steps in Biodiversity Heritage Library (BHL) developments

Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Web
ebiquity
 
Training daypresentation
Training daypresentationTraining daypresentation
Training daypresentation
Amy Fry
 

Ähnlich wie An evaluation of taxonomic name finding & next steps in Biodiversity Heritage Library (BHL) developments (20)

BHL Technologies: Review for BHL-Australia
BHL Technologies: Review for BHL-AustraliaBHL Technologies: Review for BHL-Australia
BHL Technologies: Review for BHL-Australia
 
Digitization and enhancement of biodiversity literature through OCR, scientif...
Digitization and enhancement of biodiversity literature through OCR, scientif...Digitization and enhancement of biodiversity literature through OCR, scientif...
Digitization and enhancement of biodiversity literature through OCR, scientif...
 
OCLC Research @ U of Calgary: New directions for metadata workflows across li...
OCLC Research @ U of Calgary: New directions for metadata workflows across li...OCLC Research @ U of Calgary: New directions for metadata workflows across li...
OCLC Research @ U of Calgary: New directions for metadata workflows across li...
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Riding the wave - Paradigm shifts in information access
Riding the wave - Paradigm shifts in information accessRiding the wave - Paradigm shifts in information access
Riding the wave - Paradigm shifts in information access
 
BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014
 
Research Objects: more than the sum of the parts
Research Objects: more than the sum of the partsResearch Objects: more than the sum of the parts
Research Objects: more than the sum of the parts
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Web
 
Global Library of Life: The Biodiversity Heritage Library
Global Library of Life: The Biodiversity Heritage LibraryGlobal Library of Life: The Biodiversity Heritage Library
Global Library of Life: The Biodiversity Heritage Library
 
Next Generation Technical Services May 2009 Calhoun
Next Generation Technical Services May 2009 CalhounNext Generation Technical Services May 2009 Calhoun
Next Generation Technical Services May 2009 Calhoun
 
TIDSR
TIDSRTIDSR
TIDSR
 
Current metadata landscape in the library world (Getaneh Alemu)
Current metadata landscape in the library world (Getaneh Alemu)Current metadata landscape in the library world (Getaneh Alemu)
Current metadata landscape in the library world (Getaneh Alemu)
 
Scratchpads introductory presentation 45mins
Scratchpads introductory presentation   45minsScratchpads introductory presentation   45mins
Scratchpads introductory presentation 45mins
 
Adding Meaning To Your Data
Adding Meaning To Your DataAdding Meaning To Your Data
Adding Meaning To Your Data
 
Training daypresentation
Training daypresentationTraining daypresentation
Training daypresentation
 
OhioLINK ERM Forum: The Front End
OhioLINK ERM Forum: The Front EndOhioLINK ERM Forum: The Front End
OhioLINK ERM Forum: The Front End
 
FAIRy Stories
FAIRy StoriesFAIRy Stories
FAIRy Stories
 
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
 

Mehr von Chris Freeland

Pilots & Partnerships: University Academic Computing and University Libraries...
Pilots & Partnerships: University Academic Computing and University Libraries...Pilots & Partnerships: University Academic Computing and University Libraries...
Pilots & Partnerships: University Academic Computing and University Libraries...
Chris Freeland
 
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Chris Freeland
 
Seeding links from Wikipedia to BHL (2008 - 2012)
Seeding links from Wikipedia to BHL (2008 - 2012)Seeding links from Wikipedia to BHL (2008 - 2012)
Seeding links from Wikipedia to BHL (2008 - 2012)
Chris Freeland
 
BHL: Assigning DOIs & Other Identifiers to Legacy Literature
BHL: Assigning DOIs & Other Identifiers to Legacy LiteratureBHL: Assigning DOIs & Other Identifiers to Legacy Literature
BHL: Assigning DOIs & Other Identifiers to Legacy Literature
Chris Freeland
 
Plant Name Services Using Tropicos
Plant Name Services Using TropicosPlant Name Services Using Tropicos
Plant Name Services Using Tropicos
Chris Freeland
 

Mehr von Chris Freeland (20)

From Eames & Young to Pruitt-Igoe
From Eames & Young to Pruitt-IgoeFrom Eames & Young to Pruitt-Igoe
From Eames & Young to Pruitt-Igoe
 
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
 
Building the Missouri Hub for DPLA
Building the Missouri Hub for DPLABuilding the Missouri Hub for DPLA
Building the Missouri Hub for DPLA
 
Documenting Ferguson: Building a community digital repository
Documenting Ferguson: Building a community digital repositoryDocumenting Ferguson: Building a community digital repository
Documenting Ferguson: Building a community digital repository
 
Newman Numismatic Portal Overview - Mar 2015
Newman Numismatic Portal Overview - Mar 2015Newman Numismatic Portal Overview - Mar 2015
Newman Numismatic Portal Overview - Mar 2015
 
Establishing the Missouri Hub: A Service Hub for DPLA
Establishing the Missouri Hub: A Service Hub for DPLAEstablishing the Missouri Hub: A Service Hub for DPLA
Establishing the Missouri Hub: A Service Hub for DPLA
 
Organizing a DPLA Service Hub in Missouri
Organizing a DPLA Service Hub in MissouriOrganizing a DPLA Service Hub in Missouri
Organizing a DPLA Service Hub in Missouri
 
Pilots & Partnerships: University Academic Computing and University Libraries...
Pilots & Partnerships: University Academic Computing and University Libraries...Pilots & Partnerships: University Academic Computing and University Libraries...
Pilots & Partnerships: University Academic Computing and University Libraries...
 
Built Works Registry: Geocoding Biodiversity Heritage Library
Built Works Registry: Geocoding Biodiversity Heritage LibraryBuilt Works Registry: Geocoding Biodiversity Heritage Library
Built Works Registry: Geocoding Biodiversity Heritage Library
 
A Digitization Primer for Botanical and Horticultural Librarians
A Digitization Primer for Botanical and Horticultural LibrariansA Digitization Primer for Botanical and Horticultural Librarians
A Digitization Primer for Botanical and Horticultural Librarians
 
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
 
MBG Rare Book Digitization Project (2003)
MBG Rare Book Digitization Project (2003)MBG Rare Book Digitization Project (2003)
MBG Rare Book Digitization Project (2003)
 
BHL: Your 24hr Library
BHL: Your 24hr LibraryBHL: Your 24hr Library
BHL: Your 24hr Library
 
Seeding links from Wikipedia to BHL (2008 - 2012)
Seeding links from Wikipedia to BHL (2008 - 2012)Seeding links from Wikipedia to BHL (2008 - 2012)
Seeding links from Wikipedia to BHL (2008 - 2012)
 
BHL: Assigning DOIs & Other Identifiers to Legacy Literature
BHL: Assigning DOIs & Other Identifiers to Legacy LiteratureBHL: Assigning DOIs & Other Identifiers to Legacy Literature
BHL: Assigning DOIs & Other Identifiers to Legacy Literature
 
Global BHL Activities
Global BHL ActivitiesGlobal BHL Activities
Global BHL Activities
 
Life & Literature Future Framing for BHL
Life & Literature Future Framing for BHLLife & Literature Future Framing for BHL
Life & Literature Future Framing for BHL
 
Approaches to preserving digitized taxonomic data
Approaches to preserving digitized taxonomic dataApproaches to preserving digitized taxonomic data
Approaches to preserving digitized taxonomic data
 
Scribbles & Scraps: Darwin’s Library & Annotated Literature
Scribbles & Scraps: Darwin’s Library & Annotated LiteratureScribbles & Scraps: Darwin’s Library & Annotated Literature
Scribbles & Scraps: Darwin’s Library & Annotated Literature
 
Plant Name Services Using Tropicos
Plant Name Services Using TropicosPlant Name Services Using Tropicos
Plant Name Services Using Tropicos
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

An evaluation of taxonomic name finding & next steps in Biodiversity Heritage Library (BHL) developments

  • 1. An evaluation of taxonomic name finding & next steps in Biodiversity Heritage Library (BHL) developments Chris Freeland Technical Director, BHL Director of Bioinformatics, Missouri Botanical Garden
  • 2.
  • 3.
  • 4.
  • 5.
  • 6. Complexities of distributed, mass scanning from NYBG from Smithsonian
  • 7. Open Access Data The snakes of Australia ; an illustrated and descriptive catalogue of all the known species. By Gerard Krefft... Publisher: Sydney,T. Richards, Government Printer,1869. PDF OCR XML JP2
  • 8. Name Finding via TaxonFinder
  • 9. Raw Image Converted to text via OCR Name finding via TaxonFinder Extract names Submit to NameBank SOAP response Name Finding in action with Taxonomic Intelligence…
  • 10.
  • 11.  
  • 12.  
  • 13.
  • 14.
  • 15. Characteristics of sample = 86.91% 2610 Total Number of Unique Names 3003 Total Number of Names 7.7 Average Number of Names per Page 446.8 Average Number of Words per Page 392 Number of Pages
  • 16. OCR error rate for names only Top OCR errors Of the 3,003 names, 1,056 were incorrectly transcribed by OCR. e->o 14 c->e 7 h->ii 13 i->l 6 h->l 12 u->n 5 u->ii 11 u->I 4 r->i 10 e->c 3 l->i 9 Omit Space 2 n->v 8 Insert Space 1 35.16%
  • 17. Performances of algorithms TaxonFinder FAT Excluding names with OCR errors Including names with OCR errors 28.20% 40.32% Precision 23.34% 36.62% Recall 25.77% 38.47% F-score 32.25% 43.77% Precision 17.21% 25.82% Recall 24.73% 34.80% F-score
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.