In recent years there has been a reproducibility crisis in science. Computational notebooks, such as Jupyter, have been touted as one solution to this problem. However, when executing analyses over live SPARQL endpoints, we get different answers depending upon when the analysis in the notebook was executed. In this paper, we identify some of the issues discovered in trying to develop a reproducible analysis over a collection of biomedical data sources and suggest some best practice to overcome these issues.
Using a Jupyter Notebook to perform a reproducible scientific analysis over semantic web sources
1. Alasdair J G Gray
A.J.G.Gray@hw.ac.uk
www.macs.hw.ac.uk/~ajg33
@gray_alasdair
Using a Jupyter Notebook to perform a
reproducible scientific analysis over semantic
web sources
4. Aim
Use a computation notebook to:
1. Perform an analysis over Semantic Web resources
– Reproduce an analysis performed through website
– Exploit recent Guide to Pharmacology RDF data publication
and other Linked Open Data endpoints
2. Publish the analysis for ease of reproducibility
3. Embed semantics into the notebook
9 October 2018 www.macs.hw.ac.uk/SWeL – @hw_swel 4
5. Pharmacology Analysis to Reproduce
9 October 2018 www.macs.hw.ac.uk/SWeL – @hw_swel 5
• Using PubChem
• Compound count
in several datasets
– ChEBI
– ChEMBL
– DrugBank
– GtP
• Intersection of
compounds
across datasets
– Results reproduced
15 March 2018
8. Jupyter Notebook Experience
• Easy to interlace explanation
and code
• Writing style:
– Papers tend to be formal
– Code explanation informal
• How to represent results at
time of writing vs live results?
– Used static table
• Embed myBinder link
• No referencing support
(out of the box)
– cite2c plugin:
https://github.com/takluyver/cite2c
• No standard metadata
– Metadata not displayed
– No markup, e.g. ORCID
• Couldn’t include environment
details
• Generating HTML using print
dialogue
– LaTeX generation didn’t work
9 October 2018 www.macs.hw.ac.uk/SWeL – @hw_swel 8
9. Conclusions
Use a computation notebook to:
1. Perform an analysis over Semantic Web resources
– Reproduce an analysis performed through website
– Exploit recent Guide to Pharmacology RDF data publication
and other Linked Open Data endpoints
2. Publish the analysis for ease of reproducibility
– https://mybinder.org/v2/gh/AlasdairGray/SemSci2018/master?filepath=SemSci2018%20Publication.ipynb
3. Embed semantics into the notebook
9 October 2018 www.macs.hw.ac.uk/SWeL – @hw_swel 9
Alasdair J G Gray
A.J.G.Gray@hw.ac.uk
www.macs.hw.ac.uk/~ajg33
@gray_alasdair
Hinweis der Redaktion
Nature survey of over 1,500 scientists, published May 2016
Includes nice video
Literate programming: combines narrative with computation
Click screenshot to launch HTML version of the Notebook
Click on Binder link to launch executable version
Have local backup incase of network issues
Each execution captures a point in time
Datasets are constantly evolving
Intersection: differences in InChI Key representation
Datasets too large to load all data into Notebook (standard configuration)
Federated queries timed out