BiSciCol: Linking Information for Biodiversity Scientists
1. The BiSciCol Project
Linking Information for Biodiversity Scientists
John Deck, UC Berkeley
BiSciCol Team: Reed Beaman, Nico Cellinese, Jonathan Coddington, Tom Conlin, Neil Davies, John Deck,
Rob Guralnick, Bryan P. Heidorn, Chris Meyer, Tom Orrell, Rich Pyle, Brian Stucky, Rob Whitton
2. Adapted from The Economist, by David Simonds
The Biodiversity Data Integration Challenge
3. “I’m here to fight for truth, justice, and the American way.”
– Superman
Ontologies, vocabularies, and standards help provide a
common understanding of the structure of information,
allowing us to break data down to its fundamental parts.
4. “Your identity is your most valuable possession. Protect it.
And if anything goes wrong, use your powers.” - Elastigirl
Identifiers allow us to tag, track, or reference any object or
process. They must be awesome: persistent, unique, resolvable.
5. Spreadsheets / DwC Archives / Raw Data
Re-assemble and integrate
Assign awesome identifiers
Break down to fundamental parts
The BiSciCol Strategy
6. A Data Integration Experiment:
Link records between VertNet and Genbank using the Darwin Core
Triplet (InstitutionCode : CollectionCode : CatalogNumber)
• 1,400,000 VertNet Records
• 460,739 Genbank records (filtered by VertNet institutions)
Question:
What % of harvested Genbank records could be linked to VertNet
voucher specimen records using the Darwin Core Triplet?
Back to Reality …
Less than 1%!
7. NONE of the identifiers (that we found) employ strategies to ensure
truly long-term persistence, decoupling metadata from the identifier
itself.
Identifier Challenges
Darwin Core triplets (at least as currently specified in standards, and
implemented) do not do well for linking data.
Interim Solutions
Fix DwC Triplets standards/validation (that’s you Genbank), build a
Triplet resolver
PURL
Awesome Solutions
8. Ontologies, vocabularies, standards
Biological Collections Ontology (http://code.google.com/p/bco)
Genomic, Biodiversity, and Ecological standards alignment
*+BCIDs
Free, persistent, scalable, resolvable and awesome identifiers for
biodiversity data, built on CDL’s EZID system (http://biscicol.org/bcid/)
BiSciCol Strategies to Address the
Biodiversity Data Integration
Challenge
*Triplifier
Chunks raw data into fundamental parts then re-assembles as RDF and
integrates with other data (http://biscicol.org/triplifier/)
*Learn more about these projects at the Software Bazaar
+More about BCIDs integrating with VertNet on Day 2
9.
10. Ontology / Vocabulary Challenges
Need to clarify assumptions behind concepts
• Individual / Material Sample / Specimen / Population
• Different interpretations x-domains: MIxS, INSDC, DwC, OBI
Solutions:
• Continually improve clarity in definitions
• Work towards more robust standards governance frameworks
• Implement test beds and better understand use cases
Varying degrees of formalism
• Checklists, spreadsheets, RDF, OBO, OWL
Insufficient support for standards organizations
• Consisting of tenuous structures maintained by informal networks of
active volunteers