Talk given at CHOP bioinformatics retreat where I describe the technical and cultural ingredients needed to foster a reproducible research culture in a bioinformatics core
13. Why Snakemake?
Addresses Makefile weaknesses:
Difficult to implement control flow
No cluster support
Inflexible wildcards
Too much reliance on sentinel files
No reporting mechanism
Keeps the good stuff:
Implicit dependency resolution
Johannes Köster
18. Acknowledgements
BiG
Deanne Taylor
Batsal Devkota
Noor Dawany
Perry Evans
Juan Perin
Pichai Raman
Ariella Sasson
Hongbo Xie
Zhe Zhang
TIU
Jeff Pennington
Byron Ruth
Kevin Murphy
DBHi
Bryan Wolf
Mark Porter
Hinweis der Redaktion
---
---
Slide 2
Because I'm giving a talk on reproducible research I am legally obligated to open with the cautionary tale of Anil Potti scandal. This is the Duke researcher who from 2006 to 2009 used microarray expression data to classify predict sensitivity of clinical tumor samples to various chemotherapy treatments.
As early as 2007 Keith Baggerly and Kevin Coombes from MD Anderson tried to reproduce the initial analyses and obtained radically different results.
So lacking much of the code had to reverse engineer what they believed occurred and discovered a number of off-by-one errors, one of which was induced by a program that didn’t want a header row, genes missing from the test set that were in the training set, genes that appeared in a later email from the authors were not even in the probeset, and a huge number of samples that were duplicated, some of which were both labeled sensitive and resistant.
Despite years of efforts by Baggerly and Coombes to have the authors come clean with a working analysis this went all the way to clinical trials, which were halted only when it was discovered Anil Potti lied on his CV and was not in fact a Rhodes Scholar.
I really encourage everyone to view the presentation by Keith Baggerly on Youtube - “The Importance of Reproducible Research in High-Throughput Biology”
The real point of the story is that the manipulation of data might have began with a cell shift and off-by-one errors, stupid mistakes that anyone can make but are virtually impossible to detect unless submit a reproducible workflow with your paper and allow reviewers to run that on a novel dataset.
Withholding of data was likely a side of active obfuscation I suspect much of the initial errors in these papers were due to stupid mistakes.
So for me preventing the outright falsification of data is not even in my top reasons for reproducible research. If someone is determined to lie they're going to find a way to do it.
The thing is they Duke fired Anil Potti but I didn’t they fired Microsoft Excel. Excel still has tenure at Duke for all I know. And that’s a shame because Excel was likely a partner in this crime.
---
Slide 3
The kind of reproducible research also tends to get conflated with a bunch of hot topics: open access, open data, software carpentry, and a bunch of other stuff people don't want to do. This is what I call the “reproducible research guilt trip”. In the big bad mean world the relationship between journals, funding institutions, and reviewers is essentially adversarial.
Inside a group like ours there are a lot more incentives and ways of enforcing reproducibility, and if an investigator wants to say publish an open data set and a fully reproducible analysis it should be possible. If not, that’s fine, these things still benefit us.
I tend to blur the lines between good practice, automation and reproducibility. I consider this a branch of software or process engineering rather than ethics. This just goes hand-in-hand with optimizing our practices and becoming a more efficient and productive group and also produce analyses that will live up to the increased scrutiny that is coming from the journals. So this is not feel-good reproducibility and this not just for the benefit of the people we work with.
So I want to talk about our values and our practices, and our habits.
I was a biologist I wouldn't work with a core that didn't have a reproducibility as a standard. That might be because I've seen how the sausage is made, but I think there are sound reasons why this should a guiding principle for how we do work.
---
Slide 4
The why’s of reproducible research, other that “umm this is science” are for me:
Sanity – in terms of being able to reliably derive results from raw data because if I can’t do that then I don’t have a leg to stand on
Reuse – I want others or my future self to reproduce an analysis from the start that should be possible
Redundancy – so if someone gets hits by a bus there rest of us
Evaluation – we have a group with a lot of different strengths and weaknesses – software development, statistics, systems biology, sequencing, and disease domains. If we don’t have codebase that is shared, open, reproducible I really don’t see any reason to have a group at all. We might as well just be fully embedded analysts. I’m sure there are some people here who feel this would be a better arrangement and I can respect that viewpoint but I would think there is some real synergistic benefit to having a group of analysts rather than 8 scattered throughout an organization
---
Slide 5
I'm here today to speak about two very domestic areas of interest to me and where I think DBHi can be a standard bearer and innovator.
First is the marriage or "bondage" between code, data and tracking metadata (which is called data provenance) , and results (or what are sometimes called deliverables) – so version control, reproducible reports, MyBiC, tools that are more or less in place
My other interest is in accelerating what I call the “edit cycle”, the spin cycle of occurs when you present results to an investigator and they want to tweak parameters and redo everything. The challenge is keeping the work in a reproducible context while still allowing people to explore their own data.
---
Slide 6
So a couple years ago there was this paper outlining the ten simple rules for reproducible computational research.
And these are great rules I can’t argue with any of them:
Rule 1: For Every Result, Keep Track of How It Was Produced
Rule 2: Avoid Manual Data Manipulation Steps
Rule 3: Archive the Exact Versions of All External Programs Used
Rule 4: Version Control All Custom Scripts
Rule 5: Record All Intermediate Results, When Possible in Standardized Formats
Etc. etc. ten rules is actually a bit much to take in at once but I think we can melt this down to one principal component for these is…
---
Slide 7
You wouldn’t think’d need to say that.
But a lot of people who come from life science disciplines with very strict traditions in keeping lab notebooks when they come to bioinformatics they just say “well I’ll do whatever the hell I want”
This is quite dangerous because the less experienced you are as a programmer the more prone you are to making stupid mistakes because you write repetitive code, you don’t write sanity checks or test cases.
So where do we write “stuff” down?
---
Slide 8
So where do we write “stuff” down? In version control. We use Git for version control and we’re lucky to have a 40 seat Github enterprise instance behind the firewall.
If your code is not in Github it doesn’t exist.
You’ve probably used something akin to an automatic version control in google docs…
---
Slide 9
The difference with git is the that the changes are explicit manually denoted by commits in which you attach a message that says what the significance of this change was.
Repositories are distributed so you can work on something without a constant internet connection to the server
Commits can be organized into branches. This is a branching pattern we use in software development where you have a master branch of major releases, a development branch, and then feature branches that become more experimental as you move left here.
What distinguishes Git from older version control systems is the ability to do very graceful merges. So person A and person B can both make changes and then merge those changes into a common branch if the changes are mutually exclusive they’ll be transparent. If there are conflicts, if two people modify the exact same line of code, then it will denote that and ask you to decide how to proceed.
It’s not unusual for projects to have 20 or 30 branches that come and go – one for each feature and one for each bug.
---
Slide 10
This a non-normalized heatmap of git commits on our enterprise Github from members of the bioinformatics group. What I like most is to see who is committing stuff on Saturdays and Sundays. Typically it's people without kids.
I want to discuss this because reproducible research culture has to come from the top down.
The previous director gave us zero support for this initiative, for him enforcing any software development standards was just a brake on the wheel. He said “You don't want to get bogged down in process.”
This ball didn’t get rolling Mark Porter stepped in as interim director. And he kind of held people’s feet to the fire and said hey you need to push your damn code.
Despite being thrust unwillingly into the director position he actually wound up being one of the best directors we ever had.
And pretty soon cores won't be hiring programmers and analysts who don't have a body of work on Github. This does factor into hiring. That's just where it's headed.
We had one applicant who, for whatever reason, just to acquiesce to the journal put her code in the readme section of the repository, where you describe your repository. And the code was awful.
So for me this is fundamental. If you value reproducibility it starts here.
I'd like to hear your thoughts if you would like to join me at the roundtable.
---
Slide 11
So it’s not enough that we just put reproducible code if it’s some inscrutable black box. We need something that is both reproducible and literate.
That’s where Sweave comes in. Sweave is acutally pronounced S-Weave (S is the predecessor to R)
In Sweave you wrap your R code chunks in these tags and it is embedded or “weaved” into the LaTeX formatted markup that describes what the code is doing and we use this to produce PDF reports. This is generally the last step in an analysis pipeline but the one that involves the vast majority of tweaks and edits and actual analysis and statistics.
So when people ask why Excel is such a pariah in the bioninformatics community no self-respecting analyst would use Excel. Is it because it mangles files or turns gene names into dates? No they fix that and it would still suck. The reason Excel is not a tool we use for scientific research is because what you do in it is not reproducible, it's not automated, and it's not literate.
Jim has really run with the report concept and made great use of it both for NGS and microarray expression reports that he has developed with Deborah Watson.
In the R community Sweave has been supplanted by a package called knitr which has a lot more features in terms of caching and variable execution of chunks as well as support for Markdown so you can easily produce web reports.
I’m still somewhat partial to PDFs because they have a beginning and an end and you can print them out and you can time stamp them and you can stamp them with git commit hashes.
---
Slide 12
This git hash looks pretty hairy but it actually provides a hook by which we can really hold provenance
So can include metadata we get from the LIMS, what we alignments and variants calls from CBMi-Seq and everything downstream from there
---
Slide 13
So what’s the glue brings from raw data to Sweave reports.
For a long time it was Make. Make is a build system designed to compile C programs but it has been really useful in for bioinformatics because it provides a syntax for describing how to convert one type of file to another based on filename suffixes.
That is 95% of all analytical pipelines.
But Make has some limitations that are frustrating to work with. And that’s where this genius Johannes Koster basically solved all those by basically subsuming the entirety of python into the domain specific language.
---
Slide 14
But what is really attractive for me is the ability to keep an entire workflow encapsulated in the Snakefile.
So all the input, outputs, and intermediates are all first-class citizens in the Snakefile, so the same code that runs the alignments can also kick off the Sweave scripts and also produce Markdown web pages that we can display in a portal.
---
Slide 15
So what is that portal. Where do we put the deliverables?
For a long time they were being emailed, or put in DropBox. These both present big disadvantages in terms of persistence, and access, and also technically I’m not sure we’re supposed to use DropBox.
We could in theory use Github since it has a lot of utility for displaying Markdown and it has issue tracking, but the authentication in github is crude and you can’t really put big files in there, it’s just not organized the way an investigator would want to use them.
MyBiC is a Django-based delivery portal that I created to serve as a delivery portal for analyses. MyBiC provides an Users/Labs/Groups authentication scheme, search, news, and trackingProjects within MyBiC can be created in Markdown or HTML and can be loaded directly from Github or from the disk. The MyBiC server has a read-only mount to the Isilon, so even very large files can be served.
---
Slide 16
OS-level virtualization
Unlike a virtual machine which talks to the hardware through a translator, the docker engine is much more lightweight.
And Unlike a VM which is kind of a stateful machine you massage into place, a docker container is run off of a reproducible configured script called a Dockerfile which makes it “literate”, if that were a terms DevOps people used.
Once a program is dockerized it can be run without installing it.
It lives inside a container and it only knows what you tell it as far as network ports, permissions, file volumes.
This has great appeal for anyone who has ever tried to install software.
---
Slide 17
The question is what do we get from dockerizing an entire analysis
For one it will be trivial to send an entire workflow to a colleague or a journal and say have at it and they can hit the ground running
But even if we don’t do that it should be much easier for a website like MyBiC to execute a workflow.
In my mind I can see how this could accelerate the edit cycle I mentioned earlier.
Sometimes they’re hunting for p-values but often they are just exploring the data. The problem is it ties up a lot of our time just reading emails and re-running analyses. This is what I call the “parameter purgatory”
So tradiitonaly we could build full scale web apps in Shiny (Pichai and Jim have done this), but that’s really better for traditional database-driven portals, sometimes we just want to write and analyses once and then choose parameters from there. You don’t want to build an analysis to do some kind of meta-analysis.