SlideShare ist ein Scribd-Unternehmen logo
1 von 10
The BiSciCol Project
Linking Information for Biodiversity Scientists
John Deck, UC Berkeley
BiSciCol Team: Reed Beaman, Nico Cellinese, Jonathan Coddington, Tom Conlin, Neil Davies, John Deck,
Rob Guralnick, Bryan P. Heidorn, Chris Meyer, Tom Orrell, Rich Pyle, Brian Stucky, Rob Whitton
Adapted from The Economist, by David Simonds
The Biodiversity Data Integration Challenge
“I’m here to fight for truth, justice, and the American way.”
– Superman
Ontologies, vocabularies, and standards help provide a
common understanding of the structure of information,
allowing us to break data down to its fundamental parts.
“Your identity is your most valuable possession. Protect it.
And if anything goes wrong, use your powers.” - Elastigirl
Identifiers allow us to tag, track, or reference any object or
process. They must be awesome: persistent, unique, resolvable.
Spreadsheets / DwC Archives / Raw Data
Re-assemble and integrate
Assign awesome identifiers
Break down to fundamental parts
The BiSciCol Strategy
A Data Integration Experiment:
Link records between VertNet and Genbank using the Darwin Core
Triplet (InstitutionCode : CollectionCode : CatalogNumber)
• 1,400,000 VertNet Records
• 460,739 Genbank records (filtered by VertNet institutions)
Question:
What % of harvested Genbank records could be linked to VertNet
voucher specimen records using the Darwin Core Triplet?
Back to Reality …
Less than 1%!
NONE of the identifiers (that we found) employ strategies to ensure
truly long-term persistence, decoupling metadata from the identifier
itself.
Identifier Challenges
Darwin Core triplets (at least as currently specified in standards, and
implemented) do not do well for linking data.
Interim Solutions
Fix DwC Triplets standards/validation (that’s you Genbank), build a
Triplet resolver
PURL
Awesome Solutions
Ontologies, vocabularies, standards
Biological Collections Ontology (http://code.google.com/p/bco)
Genomic, Biodiversity, and Ecological standards alignment
*+BCIDs
Free, persistent, scalable, resolvable and awesome identifiers for
biodiversity data, built on CDL’s EZID system (http://biscicol.org/bcid/)
BiSciCol Strategies to Address the
Biodiversity Data Integration
Challenge
*Triplifier
Chunks raw data into fundamental parts then re-assembles as RDF and
integrates with other data (http://biscicol.org/triplifier/)
*Learn more about these projects at the Software Bazaar
+More about BCIDs integrating with VertNet on Day 2
Ontology / Vocabulary Challenges
Need to clarify assumptions behind concepts
• Individual / Material Sample / Specimen / Population
• Different interpretations x-domains: MIxS, INSDC, DwC, OBI
Solutions:
• Continually improve clarity in definitions
• Work towards more robust standards governance frameworks
• Implement test beds and better understand use cases
Varying degrees of formalism
• Checklists, spreadsheets, RDF, OBO, OWL
Insufficient support for standards organizations
• Consisting of tenuous structures maintained by informal networks of
active volunteers

Weitere ähnliche Inhalte

Ähnlich wie BiSciCol: Linking Information for Biodiversity Scientists

3 bitriplifiertalk
3 bitriplifiertalk3 bitriplifiertalk
3 bitriplifiertalk
John Deck
 
Biological Science Collections Tagging and Tracking presented at SPNHC
Biological Science Collections Tagging and Tracking presented at SPNHCBiological Science Collections Tagging and Tracking presented at SPNHC
Biological Science Collections Tagging and Tracking presented at SPNHC
Rob Guralnick
 
Life science requirements from e-infrastructure: initial results from a joint...
Life science requirements from e-infrastructure:initial results from a joint...Life science requirements from e-infrastructure:initial results from a joint...
Life science requirements from e-infrastructure: initial results from a joint...
Rafael C. Jimenez
 
D paul ecn2013
D paul ecn2013D paul ecn2013
D paul ecn2013
ECNOfficer
 
Thomas ecn 2012
Thomas ecn 2012Thomas ecn 2012
Thomas ecn 2012
ECNOfficer
 
DC 2012 - Leveraging the DDI Model for Linked Statistical Data in the Social...
DC 2012 - Leveraging the DDI Model for Linked Statistical Data in the Social...DC 2012 - Leveraging the DDI Model for Linked Statistical Data in the Social...
DC 2012 - Leveraging the DDI Model for Linked Statistical Data in the Social...
Dr.-Ing. Thomas Hartmann
 

Ähnlich wie BiSciCol: Linking Information for Biodiversity Scientists (20)

Triplifier talk
Triplifier talkTriplifier talk
Triplifier talk
 
3 bitriplifiertalk
3 bitriplifiertalk3 bitriplifiertalk
3 bitriplifiertalk
 
Biological Science Collections Tagging and Tracking presented at SPNHC
Biological Science Collections Tagging and Tracking presented at SPNHCBiological Science Collections Tagging and Tracking presented at SPNHC
Biological Science Collections Tagging and Tracking presented at SPNHC
 
Data Science BD2K Update for NIH
Data Science BD2K Update for NIH Data Science BD2K Update for NIH
Data Science BD2K Update for NIH
 
Life science requirements from e-infrastructure: initial results from a joint...
Life science requirements from e-infrastructure:initial results from a joint...Life science requirements from e-infrastructure:initial results from a joint...
Life science requirements from e-infrastructure: initial results from a joint...
 
Big Data as a Catalyst for Collaboration & Innovation
Big Data as a Catalyst for Collaboration & InnovationBig Data as a Catalyst for Collaboration & Innovation
Big Data as a Catalyst for Collaboration & Innovation
 
D paul ecn2013
D paul ecn2013D paul ecn2013
D paul ecn2013
 
Tragedy of the (Data) Commons
Tragedy of the (Data) CommonsTragedy of the (Data) Commons
Tragedy of the (Data) Commons
 
TDWG at the University of Tasmania
TDWG at the University of TasmaniaTDWG at the University of Tasmania
TDWG at the University of Tasmania
 
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?Marco Roos: Newton's ideas and methods are preserved forever: how about yours?
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?
 
EMBL Australian Bioinformatics Resource AHM - Data Commons
EMBL Australian Bioinformatics Resource AHM   - Data CommonsEMBL Australian Bioinformatics Resource AHM   - Data Commons
EMBL Australian Bioinformatics Resource AHM - Data Commons
 
Knowledge Discovery using an Integrated Semantic Web
Knowledge Discovery using an Integrated Semantic WebKnowledge Discovery using an Integrated Semantic Web
Knowledge Discovery using an Integrated Semantic Web
 
Komatsoulis internet2 executive track
Komatsoulis internet2 executive trackKomatsoulis internet2 executive track
Komatsoulis internet2 executive track
 
NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...
NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...
NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...
 
Tragedy of the Data Commons (ODSC-East, 2021)
Tragedy of the Data Commons (ODSC-East, 2021)Tragedy of the Data Commons (ODSC-East, 2021)
Tragedy of the Data Commons (ODSC-East, 2021)
 
Thomas ecn 2012
Thomas ecn 2012Thomas ecn 2012
Thomas ecn 2012
 
Mapping Genotype to Phenotype using Attribute Grammar, Laura Adam
Mapping Genotype to Phenotype using Attribute Grammar, Laura AdamMapping Genotype to Phenotype using Attribute Grammar, Laura Adam
Mapping Genotype to Phenotype using Attribute Grammar, Laura Adam
 
Scio12 sem web_final
Scio12 sem web_finalScio12 sem web_final
Scio12 sem web_final
 
DC 2012 - Leveraging the DDI Model for Linked Statistical Data in the Social...
DC 2012 - Leveraging the DDI Model for Linked Statistical Data in the Social...DC 2012 - Leveraging the DDI Model for Linked Statistical Data in the Social...
DC 2012 - Leveraging the DDI Model for Linked Statistical Data in the Social...
 
FDS Module I 20.1.2022.ppt
FDS Module I 20.1.2022.pptFDS Module I 20.1.2022.ppt
FDS Module I 20.1.2022.ppt
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

BiSciCol: Linking Information for Biodiversity Scientists

  • 1. The BiSciCol Project Linking Information for Biodiversity Scientists John Deck, UC Berkeley BiSciCol Team: Reed Beaman, Nico Cellinese, Jonathan Coddington, Tom Conlin, Neil Davies, John Deck, Rob Guralnick, Bryan P. Heidorn, Chris Meyer, Tom Orrell, Rich Pyle, Brian Stucky, Rob Whitton
  • 2. Adapted from The Economist, by David Simonds The Biodiversity Data Integration Challenge
  • 3. “I’m here to fight for truth, justice, and the American way.” – Superman Ontologies, vocabularies, and standards help provide a common understanding of the structure of information, allowing us to break data down to its fundamental parts.
  • 4. “Your identity is your most valuable possession. Protect it. And if anything goes wrong, use your powers.” - Elastigirl Identifiers allow us to tag, track, or reference any object or process. They must be awesome: persistent, unique, resolvable.
  • 5. Spreadsheets / DwC Archives / Raw Data Re-assemble and integrate Assign awesome identifiers Break down to fundamental parts The BiSciCol Strategy
  • 6. A Data Integration Experiment: Link records between VertNet and Genbank using the Darwin Core Triplet (InstitutionCode : CollectionCode : CatalogNumber) • 1,400,000 VertNet Records • 460,739 Genbank records (filtered by VertNet institutions) Question: What % of harvested Genbank records could be linked to VertNet voucher specimen records using the Darwin Core Triplet? Back to Reality … Less than 1%!
  • 7. NONE of the identifiers (that we found) employ strategies to ensure truly long-term persistence, decoupling metadata from the identifier itself. Identifier Challenges Darwin Core triplets (at least as currently specified in standards, and implemented) do not do well for linking data. Interim Solutions Fix DwC Triplets standards/validation (that’s you Genbank), build a Triplet resolver PURL Awesome Solutions
  • 8. Ontologies, vocabularies, standards Biological Collections Ontology (http://code.google.com/p/bco) Genomic, Biodiversity, and Ecological standards alignment *+BCIDs Free, persistent, scalable, resolvable and awesome identifiers for biodiversity data, built on CDL’s EZID system (http://biscicol.org/bcid/) BiSciCol Strategies to Address the Biodiversity Data Integration Challenge *Triplifier Chunks raw data into fundamental parts then re-assembles as RDF and integrates with other data (http://biscicol.org/triplifier/) *Learn more about these projects at the Software Bazaar +More about BCIDs integrating with VertNet on Day 2
  • 9.
  • 10. Ontology / Vocabulary Challenges Need to clarify assumptions behind concepts • Individual / Material Sample / Specimen / Population • Different interpretations x-domains: MIxS, INSDC, DwC, OBI Solutions: • Continually improve clarity in definitions • Work towards more robust standards governance frameworks • Implement test beds and better understand use cases Varying degrees of formalism • Checklists, spreadsheets, RDF, OBO, OWL Insufficient support for standards organizations • Consisting of tenuous structures maintained by informal networks of active volunteers