Model for capturing provenance of chemical assertions

A model for capturing provenance of
assertions about chemical substances
Kody Moodley1
, Amrapali Zaveri1
, Chunlei Wu2
, Michel Dumontier1
1
Institute of Data Science @ Maastricht University
2
Scripps Research Institute

Chemical data resources on the Web

Provenance for scientific assertions
“Acetaminophen inhibits Prostaglandin G/H synthase 1”
Publication
Report Extract
Who? When? Publication? Experiment? Evidence?
"...information about entities, activities, and people involved in producing a piece of data or thing,
which can be used to form assessments about its quality, reliability or trustworthiness" - W3C

Challenges
1. No guidelines on which provenance should be reported and extracted (which attributes are
compulsory, recommended and optional?)
2. Different terminologies (ontologies) are used to capture the same provenance information across
different databases
???
???
Affects data producers, curators & API developers
Publication
ExtractReport
???

Related Work
DAta Tag Suite (DATS)
HCLS Dataset
Descriptions Bioschemas.org
Datasets: Assertions:
SEPIO
PROV-O
Dublin Core
PAV ontology

Limitations
Current provenance models:
● Do not define compulsory, recommended and optional provenance information
● Are “top-down”: propose new terminology for capturing provenance
Requirement level guidelines are needed to improve consistency and compliance in
provenance specification
“Bottom-up” approaches, taking into account how users currently specify provenance, can
help researchers to assist in standardizing its specification

Our Approach
● Which provenance properties are being specified by data publishers?
○ Analyse provenance coverage and propose attributes to consider
● Which ontology terms do they use to capture these properties?
○ Propose ontology terms to use for defining provenance attributes
● Which properties are more frequently used?
○ Propose compulsory, recommended and optional attributes

Datasets
Analyzed provenance usage in:
● Nanopublications
● Wikidata
And provenance advocated in:
● Minimal Information Standards (MISTS) for reporting biomedical studies

Nanopublications
● Extracted 333 properties from 10,803,231 nanopubs (05/04/2018)
● Manually pruned to 37 unique provenance properties
● Most frequently used:
Property Ontology
hasPublicationInfo Nanopubs nschema
created (date) DC terms
references SIO
contributor (author) DC terms

Wikidata
● All 157,522 records of type “Chemical Compound” (extracted 24/09/2018)
● Identified 76 unique relation / metadata types
● Pruned to 37 provenance related elements
● Most frequently used:
Property Ontology
statedin (database) Wikidata
retrieved (date extracted) Wikidata
referenceurl (website) Wikidata
publicationdate Wikidata

MISTS
● 14 reporting guidelines for biomedical studies in FAIRSharing.org
● Read publications to identify 347 required metadata elements for studies
● Pruned to only provenance related elements (44 in total)
● Examples:
○ ethics statement
○ apparatus
○ duration
○ location

Analysis
74 provenance properties naturally fell into 5 dimensions: Coverage of provenance metadata types in Wikidata & Nanopubs

Analysis
Partitioned 74 properties into 23 MUST, 23 SHOULD & 28 OPTIONAL by:
● Frequency of use in Nanopubs and Wikidata
● Advocacy in selected MISTS
● Relative importance in determining veracity of assertion (e.g. more important
to know title of publication than the chapter or page number)

Results https://github.com/MaastrichtU-IDS/biothingsprovenancemodel
{
"@context" : "http://biothings.io/biothingspovenancemodel.jsonld",
"id" : "assertion1",
"type" : "sio:assertion",
"label" : "Acetaminophen inhibits Prostaglandin G/H synthase 1",
"subject" : "drugbank:DB00316",
"predicate" : "sio:inhibits",
"object" : "drugbankb:BE0000017",
"assertedBy" : ”Regina M. Botting",
"coAssertedBy" : "Samir S. Ayoub",
"assertedOn" : "2004/11/23",
"publishedOn" : "2004/11/23",
"wasInferredFrom" : "assertion2",
"publishedIn" : “Prostaglandins, Leukotrienes and Essential Fatty Acids (PLEFA)",
"publicationTitle" : "COX-3 and the mechanism of action of paracetamol/acetaminophen",
}
Examples in JSON-LD & RDF:
JSON-LD context file is also provided in Github
for mapping to ontology terms

Future Work
● Implement model on data sources in BioThings API - MyChem.info
● Evaluate the feasibility of the model on these data sources by:
○ Measuring how much compulsory provenance is actually specified in these sources
● Discuss possible revisions and uptake within the biomedical community
○ Choice of compulsory provenance may differ across subdomains in chemical substance
research

Thank you!
Questions?
https://github.com/MaastrichtU-IDS/biothingsprovenancemodel
kody.moodley@maastrichtuniversity.nl
@MoodleyKody
Biomedical Data
Translator program
Funded by
https://ncats.nih.gov/translator

Model for capturing provenance of chemical assertions

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Model for capturing provenance of chemical assertions

Similar to Model for capturing provenance of chemical assertions (20)

Recently uploaded

Recently uploaded (20)

Model for capturing provenance of chemical assertions