Alice: "What version of ChEMBL are we using?"
Bob: "Er…let me check. It's going to take a while, I'll get back to you."
This simple question took us the best part of a month to resolve and involved several individuals. Knowing the provenance of your data is essential, especially when using large complex systems that process multiple datasets.
The underlying issues of this simple question motivated us to improve the provenance data in the Open PHACTS project. We developed a guideline for dataset descriptions where the metadata is carried with the data. In this talk I will highlight the challenges we faced and give an overview of our metadata guidelines.
Presentation given to the W3C Semantic Web for Health Care and Life Sciences Interest Group on 14 January 2013.
1. Dataset Descriptions in
Open PHACTS
Alasdair J G Gray
University of Manchester
W3C HCLS Call – 14 January 2013
www.openphacts.org/specs/datadesc/
Authors:
Christian Y. A. Brenninkmeijer, Chris Evelo, Carole Goble,
Alasdair J. G. Gray, Andra Waagmeester and
Egon L. Willighagen
2. Public Domain Drug Discovery Data:
Pharma are accessing, processing, storing & re-processing
Repeat @
Literature Genbank
Patents PubChem
Databases
Downloads
x each
company
Firewalled Databases
Data Integration Data Analysis
Why?
3. The Innovative Medicines
Initiative The Open PHACTS Project
• EC funded public-private • Create a semantic integration hub (“Open
partnership for Pharmacological Space”)…
pharmaceutical research • Delivering services to support on-going drug
• Focus on key problems discovery programs in pharma and public domain
– Efficacy, Safety, Educati • Not just another project; Leading academics in
on & semantics, pharmacology and informatics, driven
Training, Knowledge by solid industry business requirements
Management • 13 academic partners, 9 pharmaceutical
companies, 6 SMEs
• Work split into clusters:
• Technical Build (focus here)
• Scientific Drive
• Community & Sustainability
The Project
4. User Interfaces & Applications
Linked Data API
Identity Identity
Linked Data Cache Mapping Resolution
Service Service
Domain
Specific Data
Services Architecture
6. ChemSpider
• ChemSpider aggregates data from
over 400 sources
• Central integration point for
chemicals in OPS
• OPS data covers
– ChEBI
– ChEMBL
– DrugBank
14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 5
7. What version of ChEMBL?
~Jan 2012
• ChemSpider: EBI SDF file
– ChEMBL 13
• Data Cache: Chem2Bio2RDF ChEMBL RDF
– File downloaded May 2011
– Chem2Bio2RDF metadata webpages:
ChEMBL 8
– File: ChEMBL 2
• Mapping Server: Kasabi ChEMBL RDF file
– ChEMBL 12
14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 6
8. For the record
• OPS currently uses ChEMBL 13
– RDF generated from EBI database
dump
– Published at linkedchemistry.info
• Credit: Egon Willighagen
• Soon moving to ChEMBL 15
– RDF published by EBI
14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 7
9. Challenges
• Datasets available
– In many versions over time
– In different formats
– From many mirrors/registries
• Files do not carry metadata
• Registries
– Can be out-of-date
– Can contain conflicting information
14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 8
10. VoID: Vocabulary of Interlinked Datasets
• Describes RDF datasets
– W3C Note: http://www.w3.org/TR/void/
• Metadata carried with data
– Directly embedded or
linked (void:inDataset)
• Problems
– Very generic
– No checklist of requisite fields
14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 9
11. Provenance Vocabularies
• Dublin Core Terms
– Widely used
– Terms to generic to give proper credit
• “Date: A point or period of time associated with
an event in the lifecycle of the resource.”
• PROV
– New W3C standard: www.w3.org/2011/prov
– Generic framework for exchanging data
– Does not contain required predicates
14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 10
12. PAV: Provenance, Authoring and
Versioning Vocabulary
http://code.google.com/p/pav-
ontology/wiki/Homepage
• Easy to understand predicates
– http://purl.org/pav/
• Right level of granularity
– Distinguishes: author/creator/curator
– Captures source of data:
• import/derived/accessed
• version/previousVersion
• Being aligned with PROV-O
14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 11
13. Dataset Descriptions in the
Open Pharmacological Space
14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 12
14. Related Work
• Registries: DataHub, MIRIAM
– Do not tie metadata with the data
– No checklist of attributes
• BioDBCore
– Checklist
• Similar information captured
• Includes point of contact information
– Not tied to the data
14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 13
15. Realisation of Dataset
Descriptions
• Needs to be incorporated into data
publishing pipeline
• Hard for publishers to provide
conformant descriptions
– Datasets are complex
– Evolve over time
– Seen as yet another burden
• Validation tool provided
– http://openphacts.cs.man.ac.uk:9090/OPS-IMS/validate
14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 14
16. Future Vision
• Provide rich and accurate
provenance trail of data
– Alignment with BioDBCore
• One standard to rule them all
– Automatic pipeline from VoID file to
registries
• Write once, use many times
14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 15
17. Thank you
A.Gray@cs.man.ac.uk
www.cs.man.ac.uk/~graya/
www.openphacts.org
14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 16
Editor's Notes
This is what motivated us that we need metadata in the data files
Specifies VoID and PAV predicatesMIM checklist
Open PHACTS: 28 partner9 Pharmaceuticals3 Biotechs1 Triplestore firm15 academic