Bio2RDF Release 2 is an updated version that provides improved coverage of life science linked data through additional datasets and properties. It features open source conversion scripts, a common URI pattern for integration, and dataset provenance information. The release includes 19 datasets with over 1 billion RDF triples that can be queried using SPARQL endpoints or through federated queries across datasets using the Semanticscience Integrated Ontology. Future work aims to incorporate additional large datasets and consolidate provenance.
1. Bio2RDF Release 2: Improved coverage,
interoperability and provenance of Life
Science Linked Data
Linked Data for the Life Sciences
Alison Callahan1, Jose Cruz-Toledo1, Peter Ansell2, Michel Dumontier1
1Carleton University, 2University of Queensland
ESWC2013::Bio2RDF Release 21
2. ESWC2013::Bio2RDF Release 2
is an open source framework
on the emerging semantic web
to produce and provide biological linked data
that uses simple conventions
2
4. Main features of Bio2RDF Release 2
• Bio2RDF conversion scripts, mapping files and web
application are open source and freely available at
http://github.com/bio2rdf
• Bio2RDF enables (syntactic) data integration within and
across datasets by using one language (RDF), having a
common URI pattern and a common resource registry
• 19 Release 2 datasets have provenance and endpoints
feature pre-computed graph summaries for fast lookup
• Bio2RDF web application enables entity resolution, query
federation across an expandable distributed network of
SPARQL endpoints
ESWC2013::Bio2RDF Release 24
6. Bio2RDF data are identified using
simple http URI patterns
Bio2RDF data are identified by Internationalized Resource Identifiers
(IRIs) of the form:
• http://bio2rdf.org/namespace:identifier
for source data with an assigned identifier
• http://bio2rdf.org/namespace_resource:identifier
for source data without an assigned identifier
• http://bio2rdf.org/namespace_vocabulary:identifier
for dataset-specific types and relations
Where namespace comes from a curated registry of datasets, hence
enabling simple syntactic-based integration in and across datasets.
ESWC2013::Bio2RDF Release 26
7. Data are described through machine-
understandable statements
ESWC2013::Bio2RDF Release 2
drugbank:DB00650
drugbank_vocabulary:Drug
rdf:type
drugbank_resource:DB00440_DB00650
drugbank_vocabulary:Drug-Drug-Interaction
rdf:type
drugbank_vocabulary:ddi-interactor-in
rdfs:label
DDI between Trimethoprim and
Leucovorin [drugbank_resource:
DB00440_DB00650]
rdfs:label
Leucovorin [drugbank:DB00650]
7
8. The linked data network expands
with inter-dataset statements
ESWC2013::Bio2RDF Release 2
drugbank:DB00650
pharmgkb_vocabulary:Drug
rdf:type
rdfs:label
Leucovorin [drugbank:DB00650]
pharmgkb:PA450198
drugbank_vocabulary:Drug
pharmgkb_vocabulary:xref
leucovorin [pharmgkb:PA450198]
rdfs:label
DrugBank
PharmGKB
8
10. You can get what links to it
ESWC2013::Bio2RDF Release 2
http://bio2rdf.org/linksns/drugbank/drugbank:DB00650
What links to DrugBank’s Leucovorin?
10
11. Every Bio2RDF dataset now contains
provenance metadata
ESWC2013::Bio2RDF Release 2
Features
- Entity-dataset link
- Creator
- Publisher
- Date created
- License & rights
- Source
- Availability
- SPARQL endpoint
- Data dump
Vocabularies
VoID
Dublin Core
W3C Provenance
Bio2RDF vocabulary
11
data item
Bio2RDF
dataset
Source
dataset
void:inDataset
prov:wasDerivedFrom
12. Dataset Namespace # of triples
Affymetrix affymetrix 44469611
Biomodels* biomodels 589753
Comparative Toxicogenomics Database ctd 141845167
DrugBank drugbank 1121468
NCBI Gene ncbigene 394026267
Gene Ontology Annotations goa 80028873
HUGO Gene Nomenclature Committee hgnc 836060
Homologene homologene 1281881
InterPro* interpro 999031
iProClass iproclass 211365460
iRefIndex irefindex 31042135
Medical Subject Headings mesh 4172230
NCBO BioPortal* bioportal 15384622
National Drug Code Directory* ndc 17814216
Online Mendelian Inheritance in Man omim 1848729
Pharmacogenomics Knowledge Base pharmgkb 37949275
SABIO-RK* sabiork 2618288
Saccharomyces Genome Database sgd 5551009
NCBI Taxonomy taxon 17814216
Total 19 1,010,758,291
Bio2RDF Release 2 – New and Updated Datasets
ESWC2013::Bio2RDF Release 212
20. You can use the SPARQLed query
assistant with updated endpoints
ESWC2013::Bio2RDF Release 2
http://sindicetech.com/sindice-suite/sparqled/
graph: http://sindicetech.com/analytics20
21. Use virtuoso’s built in faceted
browser to construct increasingly
complex queries with little effort
ESWC2013::Bio2RDF Release 221
22. Federated Queries over independent
SPARQL endpoints
# get all biochemical reactions in biomodels that are kinds of "protein catabolic
process“, as defined by the gene ontology (in bioportal endpoint)
SPARQL Endpoint: http://bioportal.bio2rdf.org/sparql
SELECT ?go ?label count(distinct ?x)
WHERE {
?go rdfs:label ?label .
?go rdfs:subClassOf ?tgo OPTION (TRANSITIVE) .
?tgo rdfs:label ?tlabel .
FILTER regex(?tlabel, "^protein catabolic process")
service <http://biomodels.bio2rdf.org/sparql> {
?x <http://bio2rdf.org/biopax_vocabulary:identical-to> ?go .
?x a <http://www.biopax.org/release/biopax-level3.owl#BiochemicalReaction> .
} # end service
}
ESWC2013::Bio2RDF Release 222
23. Question: Find all proteins that interact with beta
amyloid
SELECT * WHERE {
?protein a bio2rdf:Protein .
?protein bio2rdf:interacts_with bio2rdf:beta-amyloid
.
}
Heterogeneous biological data on the
semantic web is difficult to query
UniProt Protein PDB Protein
iRefIndex Protein
?
Physical interaction?
Pathway interaction?
Genetic interaction?
ESWC2013::Bio2RDF Release 223
24. ontology as a
strategy to formally
represent and
integrate knowledge
ESWC2013::Bio2RDF Release 224
25. uniprot:P05067
uniprot:Protein
is a
sio:gene
is a is a
Semantic data integration, consistency checking and
query answering over Bio2RDF with the
Semanticscience Integrated Ontology (SIO)
dataset
ontology
Knowledge Base
ESWC2013::Bio2RDF Release 2
pharmgkb:PA30917
refseq:Protein
is a
is a
omim:189931
omim:Gene pharmgkb:Gene
Querying Bio2RDF Linked Open Data with a Global Schema. Alison Callahan, José Cruz-Toledo and
Michel Dumontier. Bio-ontologies 2012.
25
27. Bio2RDF and SIO powered SPARQL 1.1 federated query:
Find chemicals in CTD and proteins in SGD that participate
in the same GO process
SELECT ?chem, ?prot, ?proc
FROM <http://bio2rdf.org/ctd>
WHERE {
?chemical a sio:chemical-entity.
?chemical rdfs:label ?chem.
?chemical sio:is-participant-in ?process.
?process rdfs:label ?proc.
FILTER regex (?process, "http://bio2rdf.org/go:")
SERVICE <http://sgd.bio2rdf.org/sparql> {
?protein a sio:protein .
?protein sio:is-participant-in ?process.
?protein rdfs:label ?prot .
}
}
ESWC2013::Bio2RDF Release 227
28. Bio2RDF RDFization guidelines are
available at our wiki
https://github.com/bio2rdf/bio2rdf-scripts/wiki/RDFization-Guide-v1.1
ESWC2013::Bio2RDF Release 228
29. What the future holds
• Aiming for twice yearly release schedule. Next release will
include large datasets (>15B RefSeq, Genbank, PubMed, PDB)
• Working with identifiers.org to create a common registry with
2200 entries
• Spiking in identifiers.org and original data uris, where
available
• Consolidating provenance/metrics with OpenPHACTS
• Incorporate W3C Linking Open Drug Data (LODD) effort
– OMIM (released), SIDER (beta), CHEMBL (beta), LinkedCT
(beta), DailyMed (RDF available), TCM (RDF available),
• Extended dataset coverage by tapping into existing endpoints
(uniprot, bioportal, ebi-rdf?)
• Showcase with other third party tools
ESWC2013::Bio2RDF Release 229
30. Bio2RDF Release 2 – A summary
• Updated data conversion source code to use PHP API (all
available through GitHub)
– http://github.com/bio2rdf/bio2rdf-scripts
– https://github.com/bio2rdf/bio2rdf-scripts/wiki/RDFization-
Guide-v1.1
• Simple Bio2RDF IRI design patterns that facilitate syntactic
consistency and interoperability backed by simple registry
• Dataset provenance and metrics
• We welcome comments, suggestions and contributions
• Join our mailing list at bio2rdf@googlegroups.com
ESWC2013::Bio2RDF Release 230
31. Acknowledgements
Bio2RDF Release 2
Allison Callahan, Jose Cruz-Toledo, Peter Ansell
Bio2RDF
Francois Belleau, Marc-Alexandre Nolin
Alex De Leon, Steve Etlinger, Nichealla Keath
Jacques Corbeil, James Hogan Jean Morissette, Nicole
Tourigny, Philippe Rigault and Paul Roe
ESWC2013::Bio2RDF Release 231
Bio2RDF facilitates data sharing and reuse in a number of ways: Access the scripts that generate our linked data. These scripts are openly licensed for modification, redistribution or use by anyone wishing to generate RDF data on their own We make use of syntactically consistent IRI patterns We do not impose a structural scheme for representing our data Bio2RDF allows for multiple mirrors host our data, a DNS “round robin” procedure balances incoming requests between participating mirrors, this prevents a single point of failure and allows for additional mirrors to be added to the network A centralized resource registry has been developed for consistent usage of namespaces and vocabularies- In the next slides I will go into a bit more detail about the first 3 points
The Bio2RDF project transforms silos of life science data into a globally distributed network of linked data for biological knowledge discovery.
In order to keep a clear link back to the original data, in our RDFized datasets we maintain the original data provider’s record identifiers by making use of the following URI pattern:- namespace: preferred short name for a biological dataset
All resources in every bio2RDF graph is linked to a graph that describes the provenance of the generated linked data. We make use of the W3C Vocabulary of Interlinked Datasets (VoID), the Provenance vocabulary and Dublin core vocabulary to describe items such as: URL to the source data Date of generation Licensing information (if available) URL to the script used to generate the data Download URL for linked data files URL of SPARQL endpoint
Here I present some numbers for the 19 updated datasets
Here I am showing with more detail the top 10 subject type – predicate - object type frequencies that can be found in our Drugbank endpoint So for example to find drugs that participate in drug-drug interactions one would use the ddi-interactor-in predicate that associates the 1074 resources of type drug to the 10891 resources typed as drug-drug-interaction.
We are currently in the process of exposing Bio2RDF data using a variety of 3rd party tools- specifically, all of our endpoints can be navigated using Virtuoso’s faceted browser- We have also generated Sindice metrics that enable the use of SPARQLedautomed query builder. (This tool allows users to semiautomatically construct SPARQL queries based on namespaces used therein)-- finally it is possible to use Sig.ma browser to aggreate data from multiple endpoints and construct mashups
However, types and predicates are dataset specific and therefore not consistentWouldn’t it be nice to be able to query all of these datasets with a common nomenclature? (our objective)