Slides for a talk held at BioScience Seminar at Dept. of Pharmaceutical BioSciences at Uppsala University on December 16, 2016.
The event webpage: http://www.farmbio.uu.se/calendar/kalendarium-detaljsida/?eventId=22496
Structure of the talk:
Reproducibility in Scientific Data Analysis ...
● What is it?
● Why is it important?
● Why is it a problem?
● What can we do about it?
● What does pharmb.io do about it?
Reproducibility in Scientific Data Analysis - BioScience Seminar
1. Reproducibility
in Scientific Data Analysis
Samuel Lampa @smllmp
PhD Student
Pharmaceutical Bioinformatics at pharmb.io
with Assoc. Prof. Ola Spjuth @ola_spjuth
@ Dept. of Pharm. Biosci. / Uppsala University
Farmbio BioScience Seminar – Dec 16 2016
2.
3. Structure of this talk
Reproducibility in Scientific Data Analysis …
● What is it?
● Why is it important?
● Why is it a problem?
● What can we do about it?
● What does pharmb.io do about it?
7. Why is it important?
“it” = reproducibility in scientific data analysis
8. Why is it important?
● More and more data generation automated
→ More and more focus on data analysis
● Culture of replicability not (yet) as established
in computational as in classical disciplines
● “it is the only thing that an investigator can
guarantee about a study”
simplystatistics.org/2014/06/06/the-real-reason-reproducible-research-is-important
9. Why is it a problem?
“it” = reproducibility in scientific data analysis
11. Why is it a problem?
● Complexity of computing environment
– Software versions, Data versions ...
● More black box components
● Assumptions on computing
environment often left out
● Manual steps often left out
12. What can we do about it?
“it” = reproducibility in scientific data analysis
13. What can we do about it?
Utopia: Infrastructure for all data and
computations to be inspected and re-run
with other data and parameters by anyone
But: We can’t wait for that
In the meanwhile: Even small steps towards
reproducibility will help. Start today!
14. General themes
Know exactly what data and results mean
Know exactly how results were obtained
Be able to get same result independently
15. More concretely ...
Know exactly what data and results mean
– Open standards, Ontologies, Data formats
Know exactly how results were obtained
– Keeping track of manual steps, parameters, versions of
software and data ...
– Version control
– Automation (scripts)
Be able to get same result independently
– code, data, and scripts … make it all available!
16. Sandve GK, Nekrutenko A, Taylor J, Hovig E. Ten Simple Rules for
Reproducible Computational Research. PLoS Comput Biol.
2013;9(10):1-4. dx.doi.org/10.1371/journal.pcbi.1003285
17. FAIR Principles
for data and meta data
F - Findable
A - Accessible
I - Interoperable
R – Reusable
Wilkinson MD, Dumontier M, Aalbersberg IjJ, et al.
The FAIR Guiding Principles for scientific data management and
stewardship. Sci Data. 2016;3:160018. doi:10.1038/sdata.2016.18.
18. What does pharmb.io do about it?
“it” = reproducibility in scientific data analysis
19. What does pharmb.io do about it?
● Open data, open source, open standards
Promoting and using as much as possible
● BioImg.org
Store Virtual Machines & Containers
● Semantic Data Technologies
Machine readability - Avoiding ambiguity
● Re-runnable computational experiments
Via workflows, containers, infrastructure as code
20. O’Boyle NM, Guha R, Willighagen EL, et al.
Open Data, Open Source and Open Standards in chemistry: The
Blue Obelisk five years on. J Cheminform. 2011;3(10):1-16.
doi:10.1186/1758-2946-3-37
21. BioImg.org
Dahlö M, Haziza F, Kallio A, Korpelainen E, Bongcam-Rudloff E, Spjuth O.
BioImg.org: A catalog of virtual machine images for the life sciences.
Bioinform Biol Insights. 2015;9(Vmi):125-128. doi:10.4137/BBI.S28636.
Martin Dahlö
22. Semantic Data Technologies
Lampa S, Willighagen E, Kohonen P, King A, Vrandečić D, Grafström R,
Spjuth O. RDFIO: Extending Semantic MediaWiki for interoperable
biomedical data management. J Biomed Sem. Submitted.
25. Lampa S, Alvarsson J, Spjuth O. Towards agile large-scale predictive modelling in
drug discovery with flow-based programming design principles.
J Cheminform. 2016;8(1):67. doi:10.1186/s13321-016-0179-6.
26. Lampa S, Alvarsson J, Spjuth O. Towards agile large-scale predictive modelling in
drug discovery with flow-based programming design principles.
J Cheminform. 2016;8(1):67. doi:10.1186/s13321-016-0179-6.