Talk given at Bio-IT 2016, Cloud Computing track
Abstract:
As bioinformatics scientists, we tend to write custom tools for managing our workflows, even when viable, open-source alternatives are available from the tech community. Our field has, however, begun to adopt Docker containers to stabilize compute environments. In this talk, I will introduce Luigi, a workflow system built by engineers at Spotify to manage long-running big data processing jobs with complex dependencies. Focusing on a case study of next generation sequencing analysis in cancer genomics research, I will show how Luigi can connect simple, containerized applications into complex bioinformatics pipelines that can be easily integrated with compute, storage, and data warehousing on the cloud.
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Building cloud-enabled genomics workflows with Luigi and Docker
1. Building cloud-enabled cancer genomics
workflows with Luigi and Docker
Jake Feala, PhD
Principal Scientist, Bioinformatics @ Caperna
Founder @ Outlier Bio
2. Building cloud-enabled cancer genomics
workflows with Luigi and Docker
Jake Feala, PhD
Principal Scientist, Bioinformatics @ Caperna
Founder @ Outlier Bio
Much of the talk will be about fundamental practice of engineering data workflows, using cancer genomics and cloud as applications
Let’s start with philosophy. In my opinion we have absolutely just scratched the surface of the cancer genomics data available to us.
Huge insights can be found by asking simple questions of huge amounts of data.
Algorithms are important, but more data almost always trumps better algorithms
And huge amounts of data we have. Everyone has seen this. Sequencing tech crushed Moore’s Law in the last decade
Coupled to the fact that we can get rich information from seq data we could never imagine 10 years ago.
We now default to sequencing for almost all applications.
Anyone not taking full advantage of their DNA sequencing data risks falling behind
But all this is easier said than done. Asking simple questions of big data means being able to manage big data. With the cloud, individuals and small teams have access to tremendous computing power, but bioinformatics scientists are generally not trained to be able to use it
Someone not in bioinformatics recently asked why it was such a big deal to string a few NGS tools together. That reminded me that it does look easy on paper. But under the hood it usually involves many more steps behind the scenes
It is very easy for this to devolve into a mess, especially if there are multiple samples or if you want to make a tweak in a parameter and run it again
Because you want to reprocess as little as possible when you have huge datasets.
In my view pipelines are extremely important. Pipelines are where rubber meets the road, and distinguishes what you COULD do vs what you CAN do.
When I was a naïve new grad, my boss used to use “capabilities” as a deliverable of the team and put specific capabilities as yearly objectives. I used to think that was silly, that we’re capable of doing ANYTHING! But it eventually crystallized that getting tools to work together, scale, and integrate with the rest of the company’s data takes a concerted effort, that can only be proven by a push-button pipeline, which are hard to build
Which brings us to the true thesis of this presentation
To do things right in data science, you need data engineers. In the real world your analysis needs to work AGAIN and AGAIN
The data needs to be correct at every stage
Messy data needs to be cleaned and harmonized.
All this is the domain of engineers. If you are building a bioinformatics team it is crucial to include engineering expertise to support your computational scientists
I need to mention that there is a single constraint that drives all bioinformatics pipelines.
Everything fans in and out of the Linux command-line, using custom file formats
If you’re developing a tool, you’re in an isolated environment, detached from the context of how it will be used.
Small test datasets
Incentivized by publication => novelty of algorithm
No requirement for maintenance, robustness
When building a pipeline you have to make sure the inputs and outputs work together, that the dependencies don’t clash, and you have to manage shuttling the data to your compute environment if you want to scale to the cloud. You need to manage a lot more pieces.
This is a common first solution, and it works fine if you have one small analysis you want to do once. Hacked together and fragile, but
low cost,
quick startup,
control,
flexibility, and
debugging via SSH
makes it worth it (compared to commercial services) when doing rapid iterations of pipeline development
So what do we actually need from our pipeline technology?
You quickly run into problems when scaling up moving to production
Compute/SGE: When something breaks, you don't want to be commenting out lines of your bash script to restart!
Environment: Developing with VMs is slow, difficult to iterate
The real missing piece is a workflow system.
Let's step back and define the ideal system. We want to build workflows from composable modules of well-tested components with tight APIs and behavior, that seamlessly transition to the cloud.
Modules can be any language. Often Linux command-line for practical purposes but not necessarily
Scale for free, never think about parallelization
In enterprise, there's probably a data ecosstem you need to integrate with
Version of data, code, parameters, environmental dependencies (see reproducible)
Pick up where you left off, important for error tolerance. If a module fails, maybe some service was down and it will work again on retry. Don’t want to re-run the entire 10-hour pipeline. Scripts don’t get you there.
Hit the “freeze” button and be able to get the exact same result 6 months from now (ideally even on a different compute architecture). Important for academia but also pharma/clinical. Throwaway analyses are never really that, most analyses will need to be re-run
Some best practices are starting to emerge for meeting these req'ts
What people are starting to do is decompose pipelines into workflows of containerized modules. Each module can be any language, and can be tested locally and deployed anywhere.
But as you add these features, it gets more and more complex to manage the system.
Eventually, you might want to subscribe to a commercial platform to manage all this complexity. Lots of great options, highly recommended if you don't have the engineering capabilities to manage all of this, and can afford it, and if legal lets you put your data up there
Most commercial cloud bioinformatics platforms are using a similar pattern, and are converging on the same architecture.
In many cases you may need to build your own system. Luckily, and you can build this architecture using free/open source tools.
As a quick aside, there is one alternative architecture out there. Right now our field is stuck in what I think of as a local maximum where we're approaching the best we can do with under the constraint of VCF and BAM file formats.
Someday we may jump out of it and embrace a better framework. The best contender in my eyes is the Hadoop/Spark framework, and there is an NGS implementation of common GATK and samtools algorithms called ADAM, out of the Berkeley AMPLab, that I highly recommend checking out. GATK will also soon release a version designed for Spark. But the roadblock there is that we need all of our other tools to use this framework, so there needs to be a critical mass of adoption first.
Many free options for building a containerized app-workflow architecture. The major components (in my favored implementation) are S3 for storage, Docker for environments, and Luigi for workflows. Docker most of you have heard of, Luigi I may be introducing to many of you for the first time.
No need to talk much about S3. There is a slight learning curve to get used to object stores, and a little extra code, but you’ll never look back.
What is Docker? Essentially a nice API for developing with linux containers. Containers are kind of like a VM, but bootstrap on top of an existing linux OS to provide all the benefits of virtualization without the heavy VM
I probably don't have to go too deep here because the field is starting to learn about Docker, and it’s been a while since I’ve gotten a confused look when I say the word “Dockerized”.
The technology that may need the most introduction is Luigi
Before Luigi, Make was actually a really great option for building data pipelines.
With make, every task specifies its dependencies and its output target, and you start with what you want to build and work up.
Luigi basically takes that idea and moves it into Python, where now Tasks and Targets are objects that can do anything Python can do (which is everything)
I've found that Luigi is flexible enough to be the glue for EVERYTHING in your workflow
Luigi can connect any type of task from disparate systems and locations into a unified workflow. If Python can talk to it, you can make it part of a Luigi workflow, and that includes the commercial platforms I mentioned below. We’ve connected internal workflows to apps on commercial cloud platforms via their REST APIs, which worked like a charm.
Even manual steps can be wrapped by Luigi. One cool pattern I’ve seen is to wrap a manual step in a Luigi Task, which always raises an exception when it runs. The user creates the output file, and the task never runs again as long as that Target exists.
You can then define SOPs with mixed automatic/manual processes, including various types of activities (compute, manually verify, database ETL, publish visualizations, etc.) into a unified workflow with a single “start” button.
Luigi’s power is its API. Provides a simple way to separate work from dependencies/outputs. Use inheritance/subclassing to reuse code, often overriding input/output methods
Encode all of your filepaths into the “output” method, and never think too much about filepaths again. Completely separate the “what” and “how” from the “where”
If you want to write to S3 instead, just swap the output path
Python can do anything, including spawn a process to run Docker
Infrastructure requires some design and must be adapted to every situation.
One critique might be that you shouldn't need to develop too many pipelines, that gold standard workflows exist on all the commercial and academic platforms.
To counter, I want to show a simple example. What if we want to use RNA-seq to validate mutations across TCGA?
Break it down into tasks and outputs. Merge steps get their own task. For loops are allowed in luigi.Task.requires methods and can be used to loop over a predefined set of patients.
How would we do this? Start by building Dockerized apps that talk to S3
Here we put S3 credentials in the base image so they are inherited in the derived Docker images. Don’t bake credentials into any images you want to share on a public registry like DockerHub!
Connect the dots with Luigi.
Connect the dots with Luigi.
Connect the dots with Luigi.
You can imagine if you don’t allow overwriting S3 objects, and you build a well-tested, stateless, side-effect-free container, that always produces the same output with any input, it’s kind of like functional programming.
And the benefit of functional programming is easy scalability and composability. When you have that kind of confidence at the component level, you can build complex pipelines with tight modules
The engineers at AdRoll wrote a fantastic blog post about using a very similar architecture. This analogy between S3 -> container -> S3 patterns and functional programming was stolen from ideas in this blog post http://tech.adroll.com/blog/data/2015/09/22/data-pipelines-docker.html
From a user perpective, there are huge advantages, not least the ability to specify environment as code. You can build up an image from slices, and push and pull those slices to a central registry
You often need a bit of massaging to get data ready for your workflow. You can use specific tasks for each data source, then dispatch the correct task in the “requires” method from within a central “fan-in” task that all downstream workflows depend on.
The choice of which loading task to dispatch can depend on data source URLs or metadata linked to the sample ID
Some extra code will be needed to make sure the output of the upstream tasks is passed through as the output of your central task.