Presentation given at the Open PHACTS project symposium.
The slides give an overview of the data in the 2.0 Open PHACTS drug discovery platform and the challenges that have been faced in the Open PHACTS project to reach this stage.
3. Dataset Downloaded Version Licence Triples
Bio Assay Ontology CC-By 10,360
CALOHA 8 Apr 2015 2014-01-22 CC-By-ND 14,552
ChEBI 4 Mar 2015 125 CC-By-SA 1,012,056
ChEMBL 18 Feb 2015 20.0 CC-By-SA 445,732,880
ConceptWiki 12 Dec 2013 CC-By-SA 4,331,760
DisGeNET 31 Mar 2015 2.1.0 ODbL 15,011,136
Disease Ontology 2015-05-21 CC-By 188,062
DrugBank 19 Feb 2015 4.1 Non-commercial 4,028,767
ENZYME 2015_11 CC-By-ND 61,467
FDA Adverse Events 9 Jul 2012 CC0 13,557,070
Total: ~3 Billion triples
4. Dataset Downloaded Version Licence Triples
Gene Ontology 4 Mar 2015 CC-By 1,366,494
Gene Ontology Annotations 17 Feb 2015 CC-By 879,448,347
NCATS OPDDR Nov 2015 Oct 2015 2,643
neXTProt (NP) 1 Feb 2014 1.0 CC-By-ND 215,006,108
OPS Chemical Registry 4 Nov 2014 CC-By-SA 241,986,722
HMDB 3.6 HMDB
MeSH 2015 MeSH
PDB Ligands 2 PDB
OPS Metadata CC-By-SA 2,053
UniProt 2015_11 CC-By-ND 1,131,186,434
WikiPathways 20151118 CC-By 11,781,627
Total: ~3 Billion triples
5. John Wilbanks consulted for us
A framework built around STANDARD well-understood
Creative Commons licences – and how they interoperate
Deal with the problems by:
Interoperable licences
Appropriate terms
Declare expectations to users and
data publishers
One size won‘t fit all requirements
Data Licensing (Or Lack Of!)
12. P12047
X31045
GB:29384
Andy Law's Third Law
“The number of unique identifiers assigned to an individual is
never less than the number of Institutions involved in the study”
http://bioinformatics.roslin.ac.uk/lawslaws/
16. A lens defines a conceptual view over the data
Specifies operational equivalence conditions
Consists of:
Identifier (URI)
Title
(dct:title)
Description
(dct:description)
Documentation link
(dcat:landingPage)
Creator
(pav:createdBy)
Timestamp
(pav:createdOn)
Equivalence rules
(bdb:linksetJustification)
Scientific Lens
Lenses
34 in total
7 Public
25 Chemistry
2 Gene
22. Alasdair J G Gray
A.J.G.Gray@hw.ac.uk
www.macs.hw.ac.uk/~ajg33/
@gray_alasdair
Open PHACTS
contact@openphacts.org
openphacts.org
@open_phacts
Hinweis der Redaktion
Data provided by many publishers: some cover other sets, e.g. ChemSpider
Originally in many formats: relational, SD files and RDF
Worked closely with publishers getting them to publish
Raw RDF
Metadata descriptions of their data
Links between their data and others
API: Complex data interactions/relationships
Interactions needed to satisfy use cases
Gradually added additional types of data and interactions
Quantitative Data Challenges
No standard units
Even in curated sources!
Feedback issues to data providers
Quality Assurance
Validation & Standardization Platform
Developed by Royal Society of Chemistry
http://bit.ly/NZF5VB
CRS Dataset Generation
Validate structure: Source data is messy!
Identify common problems:
Charge imbalance
Stereochemistry
Compute physiochemical properties
Identify related properties based on structure
17 relationship types
Example drug: Gleevec Cancer drug for leukemia
Lookup in three popular public chemical databases Different results
Chemistry is complicated, often simplified for convenience
Data is messy!
Are these records the same? It depends on what you are doing with the data!
Each captures a subtly different view of the world
Structure Lens
Interested in physiochemical properties of Gleevec
Name Lens
Interested in biomedical and pharmacological properties
sameAs != sameAs depends on your point of view
Links relate individual data instances: source, target, predicate, reason.
Links are grouped into Linksets which have VoID header providing provenance and justification for the link.
Lens enables certain relationships and disables others
Alters links between the data
Builds on OPS document: Checklist and guidance notes!
Covers a wider range of use cases
Large community buy in – Including EBI
Builds on OPS document: Checklist and guidance notes!
Covers a wider range of use cases
Large community buy in – Including EBI
Verifying data
Verifying linkages
Investigating unexpected answers
Not to be