Semantic W3C standards provide a framework for the creation of knowledge bases that are extensible, coherent, interoperable, and on which interactive analytics systems can be developed. A growing number of knowledge bases are being built on these standards— in particular as Linked Open Data (LOD) resources, and their availability has received increasing attention in industry and academia. Using LOD resources to provide value to industry is challenging, however, and early expectations have not always been met: issues arise from the alignment of public and experimental corporate standards, from inconsistent URI policies, and from the use of internal, non-formal application ontologies. To add to this, often the reliability of resources is problematic, from service levels to SPARQL endpoint uptime to URI persistence. Not the least, in many cases provenance issues have not properly resolved, and there are serious funding concerns related to government grant-backed resources. For this reasons, an integrated data appliance (iDA) preloaded with semantically integrated public knowledgebases provides an enterprise-ready “Semantics In-a-box” solution to address those shortcomings effectively.
4. • RDF HAS EVOLVED AS ACCEPTED FRAMEWORK
• DYNAMIC, EXTENSIBLE, INTEROPERABLE SOLUTIONS NEEDED FOR BIG DATA
• ADVANTAGE: DON’T NEED TO KNOW A PRIORI WHICH QUESTIONS TO ASK
• THE LOD CLOUD IS GROWING …
• SPARQL 1.1 IS DE-FACTO STANDARD
• MARCH 21, 2013 W3C RECOMMENDATION
• LOTS OF POCS, PILOT STUDIES …
BUT
• TOO IDEALISTIC EXPECTATIONS:
***** LINKED (OPEN) DATA ≠ ***** COLLABORATIVE USABILITY !
• DIVERGING DIRECTIONS:
• DIFFERENT VOCABULARIES, REGISTRIES, OBJECTIVES, DESCRIPTORS
• DIFFERENT APPROACHES, PROVENANCE METADATA (VOID, PROV-O, PAV,
OPENPHACTS, BIO2RDF, BIODBCORE, SADI, MIRIAM)
• W3C HCLS TRIES TO RESOLVE THIS BY BUILDING CONSENT ON MAPPINGS
4
6. THINKING LLD / LOD
6
MYTH #1: PUBLIC SPARQL ENDPOINTS ARE EQUAL
• DIFFERENT VOCABULARIES, REGISTRIES, OBJECTIVES, DESCRIPTORS
• DIFFERENT CONCEPTUAL APPROACH (OPENPHACTS, BIO2RDF,
BIODBCORE, SADI, MIRIAM, …)
MYTH #2: PUBLIC SPARQL ENDPOINTS ARE INTEROPERABLE
• VERSIONING AND PROVENANCE ISSUES (PROV-O, VOID, SKOS, PAV)
• CLINICAL INTEROPERABILITY (HL7, MEDDRA, CDISC, MESH, ICD9/10 …)
MYTH #3: PUBLIC RESOURCES ARE ALWAYS AVAILABLE
• RELIABILITY CONCERNS FROM SERVICE-LEVEL TO URI PERSISTENCE
• MORE AND MORE “OPEN DATA” ARE CLOSED FOR COMMERCIAL USE
• ISSUES OF ACCESS TRACEABILITY ON CONFIDENTIAL DATA
• SERIOUS FUNDING UNEASE ABOUT AVAILABILITY OF GOVERNMENT-BACKED RESOURCES
9. BEST PRACTICES CHECKLIST
• WHICH RESOURCES DO WE NEED?
• REVIEW BASICS (LICENSING, PROVENANCE, VERSIONING, HIGH INTERLINK
QUALITY, PERSISTENCE)
• BUILD GENERALLY APPLICABLE SOLUTIONS (VOCABULARIES, COMMON
PREDICATES)
• FOCUS ON TRUE ‘’ RESOURCES
• DYNAMIC “APPLICATIONS ONTOLOGY” FIRST!
• HAVE THE BIG PICTURE IN MIND, BUT DON’T WAIT FOR PERFECTION
• ALIGN WITH FORMAL ONTOLOGIES (OR PARTS OF)
WHENEVER POSSIBLE
• NCBO BIOPORTAL
• THINK INTEROPERABILITY FROM THE BEGINNING
9
11. THE IDA CONCEPT
• INTEGRATED, PERSISTENT, CURRENT SEMANTIC KBS
• GOAL: READY TO USE FOR ENRICHMENT OF EXPERIMENTAL / INTERNAL DATASETS
• COMBINING APPLICATIONS AND RESOURCES
• WEB QUERY SERVER, KNOWLEDGE EXPLORER PRO, VIRTUOSO
• ALL NECESSARY TOOLS INCLUDED FOR MAPPING AND QUERY
• PRE-CONFIGURED KNOWLEDGEBASE(S), CONTROLLED VERSIONING, PERIODIC
UPDATES
• ENTERPRISE-READY APPLIANCE
• 64 GB RAM FOR FAST QUERY PERFORMANCE
• RAID-5 REDUNDANT ARCHIVING
11
65. ‘SEMANTICS IN A BOX’
PROS
• READY-TO-GO: NO SETUP AND INTEGRATION TIME, NO INTEROPERABILITY ISSUES
• PRECONFIGURED ENTERPRISE-READY HARDWARE WITH SEMANTICALLY INTEGRATED SETS OF PUBLIC
KNOWLEDGEBASES OUT-OF-THE-BOX
• NO CONCERNS ABOUT UPTIME OF PUBLIC RESOURCES
• CONTROLLED VERSIONING AND MAINTENANCE CYCLES SOLVE RELIABILITY AND DATA
INTEGRITY ISSUES
• NO TRACEABILITY WORRIES ON CONFIDENTIAL DATA
• INTEGRATED CLIENT AND WEB APPLICATIONS FOR GRAPH VISUALIZATION, EXPLORATION
AND QUERY REDUCE BARRIERS TO ENTRY FOR END USERS AND FOCUS PRIMARILY ON ITS
SCIENTIFIC UTILITY
CONS
• LIVE PUBLIC RESOURCES MAY UPDATE IN-BETWEEN SCHEDULED MAINTENANCE
• SELECTION OF RESOURCES MAY NOT SUFFICE ALL USE CASES
65
66. CONCLUSIONS
• THE USE OF IDA-HOSTED PUBLIC RESOURCES COMBINED WITH EXPERIMENTAL DATA TO
PROVIDE MODELS FOR CLASSIFICATION OF TOXICITY TYPES IN PRE-CLINICAL SETTINGS
DEMONSTRATES A SUCCESSFUL AND FAST SEMANTIC INTEGRATION WHICH PROVIDED
BIOLOGICAL QUALIFICATION OF GENOMIC AND METABOLOMIC BIOMARKERS.
• AS RDF IS ALREADY PRE-ALIGNED AND CONTAINS PROVENANCE AND VERSIONING, A
BETTER A PRIORI DETERMINATION OF ADVERSE EFFECTS OF DRUG COMBINATIONS CAN BE
ACHIEVED MUCH FASTER AND AT MUCH LESS EFFORT. RICH SPARQL QUERIES CORRELATE
RESPONSES OF UNRELATED STUDIES WITH DIFFERENT EXPERIMENTAL MODELS, AND
VALIDATE SYSTEM CHANGES ASSOCIATED WITH KNOWN COMMON TOXICITY MECHANISMS.
• HAVING LINKED DATA AVAILABLE IN ONE APPLIANCE TOGETHER WITH EXPERIMENTAL
RESULTS MAKES IT EASY TO EMPLOY SEMANTIC TECHNOLOGIES WORRY FREE, AND, AS
SUCH, TO PROMOTE A BETTER UNDERSTANDING OF BIOLOGICAL SYSTEMS MORE READILY.
TIME AND MONEY SAVED HAS HUGE SOCIO-ECONOMIC BENEFITS IN DRUG DISCOVERY AND
HEALTHCARE.
66
67. ACKNOWLEDGEMENTS
67
SUPPORT FOR TOXICITY STUDIES
NIST ATP #70NANB2H3009
NIAAA #HHSN281200510008C
W3C
HCLS LLD / PHARMACOGENOMICS SIG
Scott Marshall, Michel Dumontier
PATHOGEN PROJECT
FDA NARMS
Sherry Ayers
PUBLIC RESOURCES
SIB / UNIPROT CONSORTIUM
Jerven Bolleman
WIKIMEDIA FOUNDATION
Anja Jentsch
BIO2RDF II
Michel Dumontier
BMIR / NCBO STANFORD
Mark Musen, Trish Whetzel
IDA DEVELOPMENT
SAGE-N
James Candlin, David Chiang
IO INFORMATICS
Andrea Splendiani, Jason Eshleman,
Robert Stanley
TOXICITY PROJECT
COGENICS
Pat Hurban, Alan Higgins, Imran Shah, Hongkang Mei,
Ed Lobenhofer
BOWLES CENTER FOR ALCOHOL STUDIES / UNC
Fulton Crews
68. REFERENCES
1) LDOW2012 Linked Data on the Web. Bizer C,Heath T, Berners-Lee T, Hausenblas M. WWW Workshop on Linked Data on the Web, 2012
Apr.16, Lyon, France.
2) The National Center for Biomedical Ontology. Musen MA, Noy NF, Shah NH, Whetzel PL, Chute CG, Story MA, Smith B. J Am Med Inform
Assoc. 2012 Mar-Apr; 19 (2): 190-5
3) Using SPARQL to Query BioPortal Ontologies and Metadata Salvadores M, Horridge M, Alexander PR, Fergerson RW, Musen MA, and Noy NF.
International Semantic Web Conference. Boston US. LNCS 7650, pp. 180195, 2012.
4) The Translational Medicine Ontology and Knowledge Base: driving personalized medicine by bridging the gap between bench and bedside.
Luciano JS, Andersson B, Batchelor C, Bodenreider O, Clark T, Denney CK, Domarew C, Gambet T, Harland L, Jentzsch A, Kashyap V, Kos P,
Kozlovsky J, Lebo T, Marshall SM, McCusker JP, McGuinness DL, Ogbuji C, Pichler E, Powers RL, Prud’hommeaux E, Samwald M, Schriml L,
Tonellato PJ, Whetzel PL, Zhao J, Stephens S, Dumontier M. J.Biomed.Semantics 2011; 2(Suppl 2):S1
5) VoID Vocabulary of Interlinked Datasets. Cyganiak R, Zhao J, Alexander K, Hausenblas M. DERI, W3C note 6-Mar-2011
6) PROV-O: The PROV Ontology. W3C Candidate Recommendation 11- Dec-2012
7) Does network analysis of integrated data help understanding how alcohol affects biological functions? - Results of a semantic approach to
biomarker discovery. Gombocz EA, A.J. Higgins AJ, Hurban P, Lobenhofer EK, Crews FT, Stanley RA, Rockey C, Nishimura T. 2008 Sept.29-
Oct.1.Biomarker Discovery Summit, Philadelphia, PA.
8) W3C Semantic Web Use Cases and Case Studies Case Study: Applied Semantic Knowledgebase for Detection of Patients at Risk of Organ
Failure through Immune Rejection Stanley R, McManus B, Ng R, Gombocz E, Eshleman J, Rockey C. Joint Case Study of IO Informatics and
University British Columbia (UBC), NCE CECR PROOF Centre of Excellence, James Hogg iCAPTURE Centre, Vancouver, BC, Canada, 2011
9) A Novel Approach to Recognize Peptide Functions in Microorganisms: Establishing Systems Biology-based Relationship Networks to Better
Understand Disease Causes and Prevention E. Gombocz E, Candlin J 8th Annual Conference US Human Proteome Organisation: The Future
of Proteomics (HUPO 2012) San Francisco, CA, March 4-7, 2012
10) Correlation Network Analysis and Knowledge Integration In: Applied Statistics for Network Biology: Methods in Systems Biology Plasterer TN,
Stanley R, Gombocz E; M. Dehmer, F. Emmert-Streib, A. Graber, A. Salvador (Eds.)
Wiley-VCH, Weinheim, ISBN: 978-3-527-32750-8 (2011)
11) Improved dataset coverage and interoperability with Bio2RDF Release 2. Callahan A, Cruz-Toledo J, Ansell P, Klassen D, Tumarello G,
Dumontier M. SWAT4LS Workshop. 2012 Nov.30, Paris, France.
12) Ontology-Based Querying with Bio2RDF’s Linked Open Data. Callahan A, Cruz-Toledo J, Dumontier M. 2013. Journal of Biomedical Semantics;
in press.
68
Semantic W3C standards provide a framework for the creation of knowledge bases that are extensible, coherent, interoperable, and on which interactive analytics systems can be developed. A growing number of knowledge bases are being built on these standards— in particular as Linked Open Data (LOD) resources, and their availability has received increasing attention in industry and academia. Using LOD resources to provide value to industry is challenging, however, and early expectations have not always been met: issues arise from the alignment of public and experimental corporate standards, from inconsistent URI policies, and from the use of internal, non-formal application ontologies. To add to this, often the reliability of resources is problematic, from service levels to SPARQL endpoint uptime to URI persistence. Not the least, in many cases provenance issues have not properly resolved, and there are serious funding concerns related to government grant-backed resources.
For this reasons, an integrated data appliance (iDA) preloaded with semantically integrated public knowledgebases provides an enterprise-ready “Semantics In-a-box” solution to address those shortcomings effectively. As public datasets exist in many revisions over time, registered and mirrored on many places, with registries often out of date or containing conflicting information, several initiatives have been currently proposed at the W3C and in consortia and industry alliances to align interlinked datasets (such as using vocabulary of interlinked datasets, VoID or PROV-O). For the end user, the dilemma of having to deal with such obstacles as additional non-trivial data mapping as well as the need to have rich authoring, licensing, provenance and versioning (such as developed in PAV) included with the data creates another barrier in broad application of semantically contextualized, integrated experimental and public datasets.
This can be remedied. Using an iDA on a preconfigured enterprise-ready hardware containing semantically integrated sets of public knowledgebases out-of-the-box and providing controlled versioning and maintenance cycles solves this predicament. Integrated client and web applications to visualize explore and query the RDF graphs from a common UI reduce barriers to entry for end users and focus primarily on its scientific utility.
By means of such an approach to better understanding and characterization of toxicity, we show how, starting from semantically integrated experimental results from multi-year toxicology studies performed on different platforms (genomic and metabolic profiling), iDA-hosted public life sciences resources (UniProt, Drugbank, Diseasome, SIDER, Reactome, NCBI Biosystems) can be used to provide models for classification of toxicity types in pre-clinical settings. Due to already pre-aligned RDF with detailed and accurate provenance and versioning, a better a priori determination of adverse effects of drug combinations can be achieved much faster and at much less effort. Rich SPARQL queries allowed to quickly correlate responses across unrelated studies with different experimental models, and to validate system changes associated with known common toxicity mechanisms.
The time and money saved from such an approach has huge socio-economic benefits for drug companies and healthcare alike. Having linked data available in one appliance together with experimental results makes it easy to employ Semantic Web technologies worry free, and, as such, to promote a better understanding of biological systems more readily
Step 1: Map to RDF – Term harmonization via one or multiple thesauri; select thesauri for classes during mapping
Step 2: Use public ontologies – BioPortal example; merge applications ontology with parts of formal ontologies to utilize their structure (applied VoID, PROV-O and elements of TMO to informal applications ontology)
Ontology import and merging: building from parts of well-formed public ontologies to final merged application-specific ontology with common vocabularies
Explore common relationships for experimental observations between treatments
Perform iterative visual SPARQL queries with perturbation ranges for each putative marker to establish a model pattern
Enrichment via queries:
Public SPARQL endpoints: UniProt, GO, Drugbank, Diseasome, SIDER, Reactome, ChEMBL – import results to enrich the network.
Drillout to NCBI BioSystems and Gene – import results to enrich further
Common Toxicity marker across 2 compounds (genes and metabolites) and their involvement in biological systems of diseases
Common Toxicity marker (genes and metabolites) and their involvement in biological systems of diseases: 2 different treatments, pulled apart for better visual exploration
Common Toxicity marker (genes and metabolites) and their involvement in biological systems of diseases: explore relationships in DrugBank and Diseasome and add them selectively to the Knowledge Base
All genes impacted by toxicant
Pharmacogenomic correlations are not necessarily aligned with biological functions – using integrated semantic KBs allows to qualify & validate biomarkers for their bological validity
All genes impacted by toxicant – web-based toxicity screening
Rapid MS-based sequencing for pathogen id
Mapping samples to pathogens to disease outbreak
Web-based screening for different microbial caused diseases