This document describes the need for standardized metadata to describe datasets in the health and life sciences domain. It summarizes challenges with current practices, such as datasets having multiple versions and formats without consistent metadata. The document then introduces the Health Care and Life Sciences Community Profile for Dataset Descriptions, which defines a set of core and optional metadata properties from existing vocabularies to provide comprehensive, standardized descriptions of datasets. Implementing this profile will help address scientists' need for clear provenance about the data they use.
The HCLS Community Profile: Describing Datasets, Versions, and Distributions
1. The HCLS Community Profile:
Describing Datasets, Versions, and
Distributions
Alasdair J G Gray
Heriot-Watt University
www.macs.hw.ac.uk/~ajg33
A.J.G.Gray@hw.ac.uk
@gray_alasdair
Michel Dumontier
Stanford University
M. Scott Marshall
MAASTRO Clinic
30/11/2016
@gray_alasdair
www.macs.hw.ac.uk/~ajg33
1
6. Data Cache
(Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON) Domain
Specific
Services
Identity
Resolution
Service
Identifier
Management
Service
“Adenosine
receptor 2a”
EC2.43.4
CS4532
P12374
CorePlatform
ChEMBL-
RDF
ChEMBL
v13
Chem2
Bio2RDF
SD
v13
v12
v2 or v8
Open PHACTS
Discovery PlatformHistoric Use Case
~January 2012
Open PHACTS v2.1
ChEMBL 20
http://tiny.cc/ops-datasets
Which ChEMBL version?
@gray_alasdair
www.macs.hw.ac.uk/~ajg33
6
7. Challenges
• Datasets available
– In many versions over time
– In different formats
– From many mirrors/registries
• Datasets build on each other
• Files do not carry metadata
• Registries
– Can be out-of-date
– Can contain conflicting information
30/11/2016
@gray_alasdair
www.macs.hw.ac.uk/~ajg33
7
Scientists
require data
provenance!
8. Dublin Core Metadata Initiative
Widely used
Broadly applicable
– Documents
– Datasets
✗Generic terms
✗Not comprehensive
✗No required properties
30/11/2016
@gray_alasdair
www.macs.hw.ac.uk/~ajg33
8
“Date: A point or period of time
associated with an event in the
lifecycle of the resource.”
13. Prescribed Usage
Element Property Value
Summary
Level
Version
Level
Distribution
Level
Core
Metadata
Type
declaration
rdf:type dctypes:Dataset MUST MUST SHOULD
Type
declaration
rdf:type
void:Dataset or
dcat:Distribution
MUST
NOT
MUST
NOT
MUST
Title dct:title rdf:langString MUST MUST MUST
Alternative
titles
dct:alternative rdf:langString MAY MAY MAY
Description dct:description rdf:langString MUST MUST MUST
… … … … … …
Large community buy in
- 27 authors
– Major data providers EBI, RIKEN, SIB
Weekly telcons, collaborative editing
2-3 year process
Wide range of use cases
Summary level: time unchanging information, e.g. name, description, publisher
Version level: version specific information, e.g. version number, creator, etc
Distribution level: file specific information, e.g. file location and format, number of triples
Reuse vocabularies: DCTerms, DCAT, VoID, FOAF, …
Prescribed properties: MUST, SHOULD, MAY, MUST NOT for each level
61 properties from 18 vocabularies
Minimised number of MUST/SHOULD to those for interoperability
MAYs are recommended terms
21 Properties
4 MUST
4 SHOULD
13 May
Summary level: time unchanging information, e.g. name, description, publisher
Version level: version specific information, e.g. version number, creator, etc
Distribution level: file specific information, e.g. file location and format, number of triples
Acknowledge W3C HCLS IG