Chemical substance resources on the Web are often made accessible to researchers through public APIs (Application Programming Interfaces). A significant problem of missing provenance information arises when extracting and integrating data in such APIs. Even when provenance is stated, it is usually not done with any prescribed templates or terminology. This creates a burden on data producers and makes it challenging for API developers to automatically extract and analyse this information. Downstream, these consequences hinder efforts to automatically determine the veracity and quality of extracted data, critical for proving the integrity of associated research findings. In this paper, we propose a model for capturing provenance of assertions about chemical substances by systematically analyzing three sources: (i) Nanopublications, (ii) Wikidata and (iii) selected Minimal Information Standards (MISTS) for reporting biomedical studies. We analyse provenance terms used in these sources along with their frequency of use and synthesize our findings into a preliminary model for capturing provenance.
Disentangling the origin of chemical differences using GHOST
Model for capturing provenance of chemical assertions
1. A model for capturing provenance of
assertions about chemical substances
Kody Moodley1
, Amrapali Zaveri1
, Chunlei Wu2
, Michel Dumontier1
1
Institute of Data Science @ Maastricht University
2
Scripps Research Institute
3. Provenance for scientific assertions
“Acetaminophen inhibits Prostaglandin G/H synthase 1”
Publication
Report Extract
Who? When? Publication? Experiment? Evidence?
"...information about entities, activities, and people involved in producing a piece of data or thing,
which can be used to form assessments about its quality, reliability or trustworthiness" - W3C
4. Challenges
1. No guidelines on which provenance should be reported and extracted (which attributes are
compulsory, recommended and optional?)
2. Different terminologies (ontologies) are used to capture the same provenance information across
different databases
???
???
Affects data producers, curators & API developers
Publication
ExtractReport
???
5. Related Work
DAta Tag Suite (DATS)
HCLS Dataset
Descriptions Bioschemas.org
Datasets: Assertions:
SEPIO
PROV-O
Dublin Core
PAV ontology
6. Limitations
Current provenance models:
● Do not define compulsory, recommended and optional provenance information
● Are “top-down”: propose new terminology for capturing provenance
Requirement level guidelines are needed to improve consistency and compliance in
provenance specification
“Bottom-up” approaches, taking into account how users currently specify provenance, can
help researchers to assist in standardizing its specification
7. Our Approach
● Which provenance properties are being specified by data publishers?
○ Analyse provenance coverage and propose attributes to consider
● Which ontology terms do they use to capture these properties?
○ Propose ontology terms to use for defining provenance attributes
● Which properties are more frequently used?
○ Propose compulsory, recommended and optional attributes
8. Datasets
Analyzed provenance usage in:
● Nanopublications
● Wikidata
And provenance advocated in:
● Minimal Information Standards (MISTS) for reporting biomedical studies
9. Nanopublications
● Extracted 333 properties from 10,803,231 nanopubs (05/04/2018)
● Manually pruned to 37 unique provenance properties
● Most frequently used:
Property Ontology
hasPublicationInfo Nanopubs nschema
created (date) DC terms
references SIO
contributor (author) DC terms
10. Wikidata
● All 157,522 records of type “Chemical Compound” (extracted 24/09/2018)
● Identified 76 unique relation / metadata types
● Pruned to 37 provenance related elements
● Most frequently used:
Property Ontology
statedin (database) Wikidata
retrieved (date extracted) Wikidata
referenceurl (website) Wikidata
publicationdate Wikidata
11. MISTS
● 14 reporting guidelines for biomedical studies in FAIRSharing.org
● Read publications to identify 347 required metadata elements for studies
● Pruned to only provenance related elements (44 in total)
● Examples:
○ ethics statement
○ apparatus
○ duration
○ location
13. Analysis
Partitioned 74 properties into 23 MUST, 23 SHOULD & 28 OPTIONAL by:
● Frequency of use in Nanopubs and Wikidata
● Advocacy in selected MISTS
● Relative importance in determining veracity of assertion (e.g. more important
to know title of publication than the chapter or page number)
14. Results https://github.com/MaastrichtU-IDS/biothingsprovenancemodel
{
"@context" : "http://biothings.io/biothingspovenancemodel.jsonld",
"id" : "assertion1",
"type" : "sio:assertion",
"label" : "Acetaminophen inhibits Prostaglandin G/H synthase 1",
"subject" : "drugbank:DB00316",
"predicate" : "sio:inhibits",
"object" : "drugbankb:BE0000017",
"assertedBy" : ”Regina M. Botting",
"coAssertedBy" : "Samir S. Ayoub",
"assertedOn" : "2004/11/23",
"publishedOn" : "2004/11/23",
"wasInferredFrom" : "assertion2",
"publishedIn" : “Prostaglandins, Leukotrienes and Essential Fatty Acids (PLEFA)",
"publicationTitle" : "COX-3 and the mechanism of action of paracetamol/acetaminophen",
}
Examples in JSON-LD & RDF:
JSON-LD context file is also provided in Github
for mapping to ontology terms
15. Future Work
● Implement model on data sources in BioThings API - MyChem.info
● Evaluate the feasibility of the model on these data sources by:
○ Measuring how much compulsory provenance is actually specified in these sources
● Discuss possible revisions and uptake within the biomedical community
○ Choice of compulsory provenance may differ across subdomains in chemical substance
research