Collection assessment in a collaborative environment: Biodiversity Heritage Library
1. Collection Assessment in a Collaborative
Environment: BHL
Connie Rinaldo, Bianca Crowley, Trish
Rose Sandler & William Ulate
2. The BHL is…
• A consortium of 15 natural history, botanical libraries and
research institutions
• An open access, full-text digital library for legacy
biodiversity literature.
• An open data repository of taxonomic names and
bibliographic information
• An expanding global effort
• Mission: The Biodiversity Heritage Library improves &
makes more efficient the methodology of research in
biodiversity studies by collaboratively making biodiversity
literature openly available to the world as part of a
global biodiversity community.
3. BHL Goals
• Goal 1: Relevant Content: Build & maintain the BHL as the largest reliable,
reputable, & responsive repository of biodiversity literature & archival
materials.
• Goal 2: Tools & Services: Develop services & tools which facilitate
discovery & improve research efficiency of BHL content.
• Goal 3: User Engagement: Increase global awareness about the BHL
through outreach, learning & education, & branding through engagement
& collaboration with existing & new user communities.
• Goal 4: Membership & Partnerships: Grow BHL consortia membership &
partnerships while fostering cross-institutional collaboration that
continues to serve as a model for digital library development
• Goal 5: Financial Sustainability: Ensure sustainability & relevance by being
flexible, adaptable, & financially sound while the content & services
remain openly & freely available.
7. BHL Overview
• New user interface launched in March
• Search by title, author, article, subjects and scientific
names
• Various download options, including high resolution
• Taxonomic name finding algorithm
• Machine-to-machine services
• Full-text search being tested
8. Core Principles
• Open access
• Open data
• Deconstruct the silo and deliver content where users are
already working
– Via other biodiversity websites and taxonomic
resources
– Via social media platforms: blog, flickr, Facebook,
Twitter, Pinterest, &etc.
• Involve users in collection and technical development
activities
10. Beyond the Silo: Open Data
Stable
URLs
Open Data
Policy
APIs
Application
Programming
Interfaces
Data
Exports
OAI-PMH
Open Archive
Initiative –
Protocol for
Metadata
Harvesting
11. User Feedback is Critical
General feedback form
http://biodiversitylibrary.org/contact
Scan request form
12. Impact
•
“BHL came to the rescue when a planned trip to work in the Mertz Library at The New
York Botanical Garden had to be cancelled due to Hurricane Sandy. Thanks to the online
resources available through BHL I was able to source most of the key works I needed,
with their supporting bibliographic information. Further use of BHL occurred when
building work at the Linnean Society of London limited access to some of the book I had
been able to use from that collection."
•
“I would like thank you all very much for invaluable work and support you do. I just got a
pdf-file from more than century old (1893) journal paper (regional naturalist society
paper, published in Finland), to get copy I should take 500 mile drive to our university
library. Now I am got it fastly in high-quality pdf-copy. Cordial thanks and all success in
continuing your highly valuable mission.” [conservation biologist from Estonia]
•
“You are a wonderful resource. I maintain a Website that describes the plant genus
Opuntia (prickly pear cacti). There is no way I could maintain such a site without access to
literature from 100-200 years ago. Most of the cactus species were discovered long ago; I
find it invaluable to put up PDF files to document each species in the literature as I
document them photographically. I am a botanist, but I work in the pharmaceutical field
(not so many botanical jobs out there). Your library makes it possible for me to continue
working with plants in a meaningful and scientific manner.”
14. Questions about BHL Content
• How many books in BHL are there about....?
• How can we identify areas of weakness in BHL
in order to prioritize what materials to scan
next?
• Rod Page has one suggestion:
http://iphylo.blogspot.com/2013/10/whichtaxonomic-journals-should-be.html
15.
16. Questions about BHL Content
• What are scalable solutions to content
analysis?
• Can we provide creative & meaningful
visualizations?
17. Why do we care about taxonomic
names?
• Scientists use taxonomic names to organize
their research
• Biodiversity literature breaks down by
discipline & by specific taxon
19. What is “Taxonomic Intelligence”?
• Global Names Recognition & Discovery tool
– Locate, verify, record scientific names from each
page
– Text is uncorrected OCR
20. Overview of available BHL (meta) data
http://biodivlib.wikispaces.com/Data+Exports
• Title metadata: contributed from MARC records of
hundreds of library catalogs (BHL consortium libraries &
non-BHL IA contributors)
• Volume/item metadata: provides information about the
actual objects & pieces digitized
• Subject
• Creator/author data
• Segment/part/”article” metadata (separate table for
segment/part creators?)
• Page metadata which includes our algorithmically
identified scientific name data
• OCR text available at the item/volume level but not overall
for corpus of BHL
25. Sample BHL & Nomenclatural Data
• Google Refine reconciled list of BHL subject keywords
• List of vetted BHL subject targets from collection
development policy
• Taxonomic name data set for trees of North America
(link out)
• http://www.fs.fed.us/database/feis/plants/tree/ind
ex.html
• http://www.treesofnorthamerica.net/
• Subject terms associated with BHL titles where Pinus
banksiana occurs
26. OtherTools& Process
• Bibliographies (discipline & more)
• Index Animalium: identifies first appearance of 400,000 animals
from 1758-1850
• Researcher supplied specific taxon bibliographies
• Zoological Record: Taxonomic references back to 1864.
• Taxonomic Literature II: a selective guide to botanical publications
with dates, commentaries and scientific types
• Compare universe of biodiversity literature to BHL
• Unknown dataset for full universe
• Compared BHL member collections to BHL content for gap-filling
before content expanded (lists automated but gap identification
manual)
• REST especies: a way to collate species metadata? http://dopaservices.jrc.ec.europa.eu/services/especies/
• DOPA Explorer http://ehabitat-wps.jrc.ec.europa.eu/dopasimple/
35. Thank you for your
Help!
http://biodiversitylibrary.org
Connie Rinaldo
crinaldo@oeb.harvard.edu
Hinweis der Redaktion
GOALS:
A free & open access digital library for biodiversity literature and primary source materials (field books)A consortium of 15 libraries working together to run a virtual library branchA collection of content from the 15 member BHL consortium and other Internet Archive contributorsAnyone is free to access & download BHL materials
SEARCH: Subject searching in BHL via the advanced search http://biodiversitylibrary.org/advsearch"subjects" tab is searching through the table of subject keywords we have in BHL, derived from the LCSH. It does NOT search titles or scientific names. If you do a basic keyword search via the homepage for a subject term, say "Birds", you will pull hits across all titles, articles, authors, subjects and scientific names broken out by tabs. Notice that the subjects tab shows all search results where "birds" is a part of the subject keyword string such as "Birds of prey" or "Cage birds".
COLLABORATION!
Add images…Also add DOIs?
User feedback is key; we rely on the many eyes of the crowd to help us direct our curation activities to the content people are actually usingUsers can let us know if they find a problem with something in our collection through our general feedback form and place a request for something to be scanned through our scanning request form
The trees of north america, entomology, or bears: metadata, right? BUT LCSH doesn’t adequately describe the biodiversity literature. Scientists organize around scientific names, articles, and parts of articles (species descriptiond)Rod Page did this: constructed a table listing all the journals in BioNames that have an ISSN, ordered by the number of articles in BioNames (i.e., mostly articles that publish new names). The full table is here, I've reproduced part of it below (limited to those journals with at least 500 articles in BioNames)
From Rod PageRod Page did this: constructed a table listing all the journals in BioNames that have an ISSN, ordered by the number of articles in BioNames (i.e., mostly articles that publish new names). The full table is here, I've reproduced part of it below (limited to those journals with at least 500 articles in BioNames)
The trees of north america, entomology, or bears: metadata, right? BUT LCSH doesn’t adequately describe the biodiversity literature. Scientists organize around scientific names, articles, and parts of articles (species descriptiond)
The Biodiversity Heritage Library uses taxonomic intelligence tools, including Global Names Recognition and Discovery (GNRD) developed by Global Names Architecture, to locate, verify, and record scientific names located within the text of each digitized page. The Note: The text used for this identification is uncorrected OCR, so may not include all results expected or visible in the pageThis names-based index is an incredibly valuable tool for organismal research, and is easily incorporated into external web sites through two different methods of access.
Bold= focus for this session—what we have provided on library boxNames aEach dataset has its own complexity: - taxonomic names have a. hierarchy (the previous to last is an infraspecific taxonomic level: forma) b. change over time (the 4th one in the list Pinusdivaricata is a synonym)c. and have all sorts of exceptions to the rules (the last one Pinus X murraybanksiana is a hybrid) - common names are a. subjective, biased towards organisms of well known groups onlyb. are dependant on language, region and time. - subjects are a. language dependantb. hierarchicalc. at title levelre extracted from OCR text
These have all been provided on Library Box, in addition to some more specific setsAlso have MODS, Endnote and BibTex files for titles, items/volumes and parts
A visualizaton of BHL data (for Pinusbanksiana)How do we reconcile all of this to find out what content covers our question? How can we map the more specific terms to LCSH/call numbers when we have limited resources--we need to automate as much as possible. We want consistent language. The BHL uses LC for the volumes but also pulls out scientific names. How do we get them incorporated into the consistent language of LC in an automated way that can scale? We want to know what we have so we can compare to an (as yet) unidentified universe. (bibliographies, index animalium, TL2)A picture of BHL data (for Pinusbanksiana as it appears in page 140 of v.78 of The Canadian field naturalist)How do we reconcile all of this to find out what content covers our question? How can we map the more specific terms to LCSH/call numbers when we have limited resources--we need to automate as much as possible. We want consistent language. The BHL uses LC for the volumes but also pulls out scientific names. How do we get them incorporated into the consistent language of LC in an automated way that can scale? We want to know what we have so we can compare to an (as yet) unidentified universe. (bibliographies, index animalium, TL2)Each dataset has its own complexity: - taxonomic names have a. hierarchy (the previous to last is an infraspecific taxonomic level: forma) b. change over time (the 4th one in the list Pinusdivaricata is a synonym) c. and have all sorts of exceptions to the rules (the last one Pinus X murraybanksiana is a hybrid) - common names are a. subjective, biased towards organisms of well known groups only b. are dependant on language, region and time. - subjects are a. language dependant b. hierarchical c. at title level
To show that name data come from multiple sources
BOLD means in library boxGoogle refine: what they are and implications for collection analysisThese are links to
index animalium, TL2; Literature breaks down by discipline and even by specific taxon; scientific names and bibliographic structure are different and we are trying to merge the two: looking at scientific data next to library data but have to make sense of the merger in the library world (see coll dev chart). Scientists work at an article/name/article part level; we work on the level of the volume.Taxonomic Literature: A selective guide to botanical publications and collections with dates, commentaries and types (Stafleu et al.).TL-2 is the premier publication of the International Association for Plant Taxonomy (IAPT), TL-2 is a 15 volume guide to the literature of systematic botany published between 1753 and 1940. It is organized by author and includes numbered entries for the author's publications. How can we map back to LCSH/call numbers when we have limited resources--we need to automate as much as possible. We want consistent language. The BHL uses LC for the volumes but also pulls out scientific names. How do we get them incorporated into the consistent language of LC in an automated way that can scale? We want to know what we have so we can compare to an (as yet) unidentified universe. (bibliographies, index animalium, TL2)IndexAnimalium is Sherborn’s life’s work—9000 page bibliography identifying the first book in which over 400,000 organisms appeared; covers 1758-1850LENGTHY process is all of this! Needs more automationZoological Record: is the world's oldest continuing database of animal biology. It is considered the world's leading taxonomic reference, and with coverage back to 1864, has long acted as the world's unofficial register of animal names.Early on we compared the universe of what is in the big libraries to what was in BHL and that allowed us to fill gaps: https://bhl.wikispaces.com/BHL+Priority+Titles
These are keywords that we use to describe how we collect for BHL. These are adapted from LC but not necessarily actual subject heading. We modified some terms to make the language clear and bring in some of the scientific naming conventions (Ornithology instead of birds). This was meant to merge appropriate parts of the library and scientific world. This is the consistent language against which we want to compare BHL content.
Many irrelevant features; breaks up phrases (united states) At least is shows that we have lots of BOTANY (but we would want to merge that with plants) .
This shows the distribution of keywords for items scanned by the Ernst Mayr Library of the Museum of Comparative Zoology (good thing zoology shows up as a big piece). This was made using tableau software—all of the tiny items can be identified but like wordle, lots of irrelevant stuff. How can we automate the improvement and appropriate merging of metadata? http://public.tableausoftware.com/views/BHLViz/DigitizedSubjects