SlideShare ist ein Scribd-Unternehmen logo
1 von 55
Building cloud-enabled cancer genomics
workflows with Luigi and Docker
Jake Feala, PhD
Principal Scientist, Bioinformatics @ Caperna
Founder @ Outlier Bio
Building cloud-enabled cancer genomics
workflows with Luigi and Docker
Jake Feala, PhD
Principal Scientist, Bioinformatics @ Caperna
Founder @ Outlier Bio
Outline
Pipelines
Requirements for
the ideal workflow
platform
Challenges
Practical, free
alternatives
Convergence of
cloud bioinformatics
platforms
Luigi
Docker
S3*
Application:
Confirming cancer
mutations in RNA-seq
*Not technically free but nearly
**FOSS = Free and Open Source Software
Some enabling
infrastructure
Takeaways:
1. Engineer your workflows
2. Everyone is converging on architecture
3. You can build scalable pipelines with FOSS** tools
Copyright © 2016 Outlier Bio
data >> algorithms
Copyright © 2016 Outlier Bio
When in doubt, just sequence it!
Plummeting
sequencing costs
+
Powerful insights
= Mounds of data
Copyright © 2016 Outlier Bio
managing big genomics data is hard
Copyright © 2016 Outlier Bio
Pipelines
What you show your boss
AlignFastQ BAM
What you actually do
Align each lane
Merge SAM
Build index
Update IGV registry
Index BAM
SAM  BAM
Sort BAM
Copyright © 2016 Outlier Bio
Pipelines
What you show your boss
Align
And sometimes….
FastQ BAM
What you actually do
Align each lane
Merge SAM
Build index
Update IGV registry
Index BAM
SAM  BAM
Sort BAM
Copyright © 2016 Outlier Bio
Pipelines
What you show your boss
Align
And sometimes….
FastQ BAM
What you actually do
Align each lane
Merge SAM
Build index
Update IGV registry
Index BAM
SAM  BAM
Sort BAM
Copyright © 2016 Outlier Bio
And BTW these files are massive
Capability = automated pipeline*
• Pipelines are at the core of a bioinformatics group
• Best way to show a capability is a complete pipeline
– Any suitable input  quality-assured output
– Sure, you may be able to do a certain analysis,
but a working, tested pipeline proves it
• Lots of tools are available, but putting them together
correctly into a pipeline is hard
*”workflow” is synonymous with “pipeline” for the purposes of this talk
Lots of possible
input sources
and types
Lots of possible
downstream
analyses
Core capability
Copyright © 2016 Outlier Bio
good data science requires good data engineering
Copyright © 2016 Outlier Bio
Linux commands and custom file formats are the
“narrow waist” of bioinformatics
Copyright © 2016 Outlier Bio
This constraint underlies the design of bioinformatics pipelines
This works great for tool developers…
Copyright © 2016 Outlier Bio
…but quickly gets complicated for users
Common homegrown solution:
- Bash scripts
- VMs (AMIs)
- StarCluster or similar
- SunGrid Engine or similar
- Parallelize by sample
Copyright © 2016 Outlier Bio
B
Outline
Pipelines
Requirements for
the ideal workflow
platform
Challenges
Practical, free
alternatives
Convergence of
cloud bioinformatics
platforms
Luigi
Docker
S3*
Application:
Confirming cancer
mutations in RNA-seq
*Not technically free but nearly
**FOSS = Free and Open Source Software
Some enabling
infrastructure
Takeaways:
1. Engineer your workflows
2. Everyone is converging on architecture
3. You can build scalable pipelines with FOSS tools
Copyright © 2016 Outlier Bio
Limitations with commonly used technology
• Compute
– SGE is flat, just a job scheduler, bash only, no dependencies
• Environment
– AMIs and VMs are slow and heavyweight, no code, not good for
ongoing dev
• Workflow
– Makefiles have ugly syntax, static, inflexible
– GUIs and domain-specific systems like Galaxy/Taverna are not
easily programmable or general-purpose
Copyright © 2016 Outlier Bio
Properties of the ideal pipeline system
• General purpose: familiar language, can apply to any task
• Modular: any language, well-tested components with tight APIs
• Scalable: parallelize for free, independent of components
• Integrated: LIMS, metadata, viz, versioning, reporting
• Versioned: reproduce from snapshots in time
• Idempotent: resume from failure, guarantee outputs
Copyright © 2016 Outlier Bio
Outline
Pipelines
Requirements for
the ideal workflow
platform
Challenges
Practical, free
alternatives
Convergence of
cloud bioinformatics
platforms
Luigi
Docker
S3*
Application:
Confirming cancer
mutations in RNA-seq
*Not technically free but nearly
**FOSS = Free and Open Source Software
Some enabling
infrastructure
Takeaways:
1. Engineer your workflows
2. Everyone is converging on architecture
3. You can build scalable pipelines with FOSS tools
Copyright © 2016 Outlier Bio
Emerging best practice is to containerize and
decompose into atomic steps
Copyright © 2016 Outlier Bio
But atomicity and scaling bring added complexity
Copyright © 2016 Outlier Bio
Everyone is doing this same basic architecture
Copyright © 2016 Outlier Bio
That is, everyone except for ADAM/Spark. They are
shifting the paradigm
• Pipeline = sequence of transformations on a data model,
not a string of shell commands
• Separate method from implementation
• Distribute records (i.e. alignments) for horizontal scaling
• Needs more community tools before adoption
Fig 1, Nothaft et al. (2015) Rethinking
Data-Intensive Science Using Scalable
Analytics Systems
Copyright © 2016 Outlier Bio
https://bdgenomics.org
Outline
Pipelines
Requirements for
the ideal workflow
platform
Challenges
Practical, free
alternatives
Convergence of
cloud bioinformatics
platforms
Luigi
Docker
S3*
Application:
Confirming cancer
mutations in RNA-seq
*Not technically free but nearly
**FOSS = Free and Open Source Software
Some enabling
infrastructure
Takeaways:
1. Engineer your workflows
2. Everyone is converging on architecture
3. You can build scalable pipelines with FOSS tools
Copyright © 2016 Outlier Bio
A free alternative architecture
Components
• Storage: S3
• Environment: Docker
• Workflows: Luigi
• Compute: EC2 auto-scaling groups
• Infrastructure: to put the pieces together
Benefits
• Free (almost)
• Huge communities, battle tested
• Separate concerns, use best tool for each job
Copyright © 2016 Outlier Bio
S3 and object storage
Steps:
1. Learn how object store is different from a filesystem
2. Write a bit of extra code to manage transfers to/from S3
3. Never think about long-term storage again*
*within reason, if IT director is in the room.
Copyright © 2016 Outlier Bio
Docker
Copyright © 2016 Outlier BioFrom https://www.docker.com/what-docker
• Lightweight environment management
• Specify environment as code (Dockerfile)
• Portable (laptop  cluster with same behavior)
• Wide adoption in tech
• Gaining ground in bioinformatics
Luigi
• Like a Makefile but in pure Python*
• Executes locally => easy to debug
• OOP declarative framework with Tasks and Targets
– Tasks are Python objects, they can do anything
– Targets are Python objects, they can be anything
• Idempotent (resume where you left off) and atomic (no half-finished tasks)
• Batteries included (graph viz, CLI integration, S3, MySQL, Hadoop, Spark, Redshift, SGE, …
*i.e., not another DSL
What’s a Makefile?
• Build DAG of tasks
• Each task specifies dependencies and output
• Start with what you want to build, work up
Copyright © 2016 Outlier Bio
Luigi is flexible enough to do pretty much anything within a workflow
S3
EC2/Docker
Python code
S3
Database
Hadoop/Spark
Local client
Local file
storage
compute
data flows
Copyright © 2016 Outlier Bio
Local file
Manual
operation!
Luigi is flexible enough to do pretty much anything within a workflow
S3
EC2/Docker
Python code
S3
Database
Hadoop/Spark
Local client
Local file
storage
compute
data flows
Copyright © 2016 Outlier Bio
Local file
Manual
operation!
luigi.Target
luigi.Task
luigi.Task.requires
API
Anatomy of a Luigi Task
Copyright © 2016 Outlier Bio
Anatomy of a Luigi Task
Copyright © 2016 Outlier Bio
Build your filepaths in Task.output
Declare your parameters (can be passed from CLI)
Do the thing
Read the input
Write the output
Anatomy of a Luigi Task
Copyright © 2016 Outlier Bio
Now output to S3 instead
Anatomy of a Luigi Task
Copyright © 2016 Outlier Bio
Run a Docker container instead
Outline
Pipelines
Requirements for
the ideal workflow
platform
Challenges
Practical, free
alternatives
Convergence of
cloud bioinformatics
platforms
Luigi
Docker
S3*
Application:
Confirming cancer
mutations in RNA-seq
*Not technically free but nearly
**FOSS = Free and Open Source Software
Some enabling
infrastructure
Takeaways:
1. Engineer your workflows
2. Everyone is converging on architecture
3. You can build scalable pipelines with FOSS tools
Copyright © 2016 Outlier Bio
Some assembly required
Copyright © 2016 Outlier Bio
ECS
EC2 auto-scaling group
Local client
Container
registry
EC2
Docker
ECS agent
CloudWatch
Luigi
EC2
Docker
ECS agent
SQS
SGE
But lots of ways to do it!
Outline
Application:
Confirming cancer
mutations in RNA-seq
*Not technically free but nearly
**FOSS = Free and Open Source Software
Some enabling
infrastructure
Takeaways:
1. Engineer your workflows
2. Everyone is converging on architecture
3. You can build scalable pipelines with FOSS tools
Copyright © 2016 Outlier Bio
Pipelines
Requirements for
the ideal workflow
platform
Challenges
Practical, free
alternatives
Convergence of
cloud bioinformatics
platforms
Luigi
Docker
S3*
• Problem:
– Calling somatic cancer mutations is difficult
• Low allele frequencies
• Complex variants
• Purity, ploidy issues
– RNA-seq can add confidence, but
• TCGA does not provide RNA-seq variants
• Manual review is slow
• Automating and scaling requires a complex, custom workflow
An interesting* cancer genomics application
Copyright © 2016 Outlier Bio
*to me
Solution: RNA-seq mutation validation pipeline
Copyright © 2016 Outlier Bio
CGHub
GDAC
portal
Note: this workflow should soon be feasible with the NCI cloud pilot platforms
Download
RNA-seq
Download
somatic
mutations
RNA-seq
(BAM)
DNA
variants
(VCF)
Extract
RNA-seq
regions
Local region
(BAM)
MuTect
RNA
variants
(VCF)
Merge
RNA/DNA
variants
for each variant
for each patient
Merge
patient
variantsMerge cohort
variants
Dockerizing the apps
Copyright © 2016 Outlier Bio
GTDownload ./bioit/apps/gtdownload/Dockerfile
Samtools ./bioit/apps/samtools/Dockerfile
See https://github.com/outlierbio/bio-it for full project
$ docker build –t outlierbio/bioit .
Base image ./Dockerfile
$ cd bioit/apps/gtdownload
$ docker build –t outlierbio/gtdownload .
$ cd ../samtools
$ docker build –t outlierbio/samtools .
Luigi-izing the workflow
Copyright © 2016 Outlier BioSee https://github.com/outlierbio/bio-it for full project
Luigi-izing the workflow
Copyright © 2016 Outlier BioSee https://github.com/outlierbio/bio-it for full project
Luigi-izing the workflow
Copyright © 2016 Outlier BioSee https://github.com/outlierbio/bio-it for full project
Run it!
Copyright © 2016 Outlier BioSee https://github.com/outlierbio/bio-it for full project
$ luigi --module bioit.validate_rnaseq RunSample --sample-id=TCGA-AB-2929
Thanks! Questions?
Pipelines
Requirements for
the ideal workflow
platform
Challenges
Practical, free
alternatives
Convergence of
cloud bioinformatics
platforms
Luigi
Docker
S3*
Application:
Confirming cancer
mutations in RNA-seq
*Not technically free but nearly
**FOSS = Free and Open Source Software
Some enabling
infrastructure
Takeaways:
1. Engineer your workflows
2. Everyone is converging on architecture
3. You can build scalable pipelines with FOSS tools
Copyright © 2016 Outlier Bio
Further reading
• My blog: https://medium.com/outlier-bio-blog
• GitHub repo for this talk:
https://github.com/outlierbio/bioit
• Luigi: https://github.com/spotify/luigi
• Docker: https://docker.com
• Amazon AWS: http://lmgtfy.com/?q=aws
Extra slides
Copyright © 2016 Outlier Bio
More on reproducibility: 3 major components and
some possible solutions
• Code
– VCS
– Notebooks
• Data
– Metadata store
– S3
– Synapse
– Versioned data APIs
• Environment
– Package managers (pip –r)
– VM
– Docker
Copyright © 2016 Outlier Bio
The slide about how this is kind of like functional
programming
function
Immutable
output
• No side effects (pure)
• Stateless
• Same output for every input
• Compose complex, scalable functionality from small components
Immutable
input
containerized
data
transformation
Immutable
S3 object
Immutable
S3 object
Copyright © 2016 Outlier BioSee http://tech.adroll.com/blog/data/2015/09/22/data-pipelines-docker.html for more on this idea
• Specify environment as code (Dockerfile)
• A little like git
– commits (each with uuid hash)
– pull
– push
• Isolated like a VM
• Run like an app
– Takes arguments
– CLI allows interactive use
– Hop inside the container to debug
Docker (user perspective)
Copyright © 2016 Outlier Bio
EC2 auto-scaling groups
Copyright © 2016 Outlier BioFrom http://docs.aws.amazon.com/AutoScaling/
Dispatching pattern allows fan-in from data sources, fan-out to pipelines
Load sample BAM
CGHub download
S3 download
Local file
NCBI SRA
download
CGHub extract
FQ
SRA extract FQ
HTTP/FTP
download
exome
RNA-seq
xenograft
WGS
shRNA
Copyright © 2016 Outlier Bio
Dispatching pattern allows fan-in from data sources, fan-out to pipelines
Load sample
CGHub download
S3 download
Local file
NCBI SRA
download
CGHub extract
FQ
SRA extract FQ
HTTP/FTP
download
exome
RNA-seq
xenograft
WGS
shRNA
Copyright © 2016 Outlier Bio
Luigi makes this easy
• Base Dockerfile with AWS credentials and helpers
• Simple metadata DB to map sample IDs  raw data locations
– All other filepaths managed by pipeline code
– Level-up: store pipeline versions, parameters, batches
• Small test dataset for constant integration tests (use Jenkins)
Practical implementation notes
Copyright © 2016 Outlier Bio
Challenges
• Filesystem: Still need a BIG, FAST local filesystem
for NGS analysis
• Luigi has warts
– Parameter passing
– Delete files to re-run
– Lots of code edits to rewire the DAG
Copyright © 2016 Outlier Bio
5-year timeline (wishlist?)
• More distributed algorithms with ADAM/Spark
• Integration with Jupyter notebooks
• Open source template for spinning up a secure cloud
infrastructure
Copyright © 2016 Outlier Bio
Live demo???
Copyright © 2016 Outlier Bio

Weitere ähnliche Inhalte

Was ist angesagt?

How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowLaura Lorenz
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowSid Anand
 
Building Robust Pipelines with Airflow
Building Robust Pipelines with AirflowBuilding Robust Pipelines with Airflow
Building Robust Pipelines with AirflowErin Shellman
 
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache ZeppelinMoon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache ZeppelinFlink Forward
 
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Flink Forward
 
SSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine LearningSSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine Learningfelixcss
 
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...Databricks
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentationIlias Okacha
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi
 
Powering machine learning workflows with Apache Airflow and Python
Powering machine learning workflows with Apache Airflow and PythonPowering machine learning workflows with Apache Airflow and Python
Powering machine learning workflows with Apache Airflow and PythonTatiana Al-Chueyr
 
Sparklyr: Recap, Updates, and Use Cases with Javier Luraschi
Sparklyr: Recap, Updates, and Use Cases with Javier LuraschiSparklyr: Recap, Updates, and Use Cases with Javier Luraschi
Sparklyr: Recap, Updates, and Use Cases with Javier LuraschiDatabricks
 
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesSimplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesDatabricks
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierDatabricks
 
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...Flink Forward
 
Jorge de la Cruz [Veeam Software] | RESTful API – How to Consume, Extract, St...
Jorge de la Cruz [Veeam Software] | RESTful API – How to Consume, Extract, St...Jorge de la Cruz [Veeam Software] | RESTful API – How to Consume, Extract, St...
Jorge de la Cruz [Veeam Software] | RESTful API – How to Consume, Extract, St...InfluxData
 
Suneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JSuneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JFlink Forward
 
Natural Language Query and Conversational Interface to Apache Spark
Natural Language Query and Conversational Interface to Apache SparkNatural Language Query and Conversational Interface to Apache Spark
Natural Language Query and Conversational Interface to Apache SparkDatabricks
 
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Dan Halperin
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Databricks
 

Was ist angesagt? (20)

How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Building Robust Pipelines with Airflow
Building Robust Pipelines with AirflowBuilding Robust Pipelines with Airflow
Building Robust Pipelines with Airflow
 
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache ZeppelinMoon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
 
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
 
SSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine LearningSSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine Learning
 
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache Flink
 
Powering machine learning workflows with Apache Airflow and Python
Powering machine learning workflows with Apache Airflow and PythonPowering machine learning workflows with Apache Airflow and Python
Powering machine learning workflows with Apache Airflow and Python
 
Sparklyr: Recap, Updates, and Use Cases with Javier Luraschi
Sparklyr: Recap, Updates, and Use Cases with Javier LuraschiSparklyr: Recap, Updates, and Use Cases with Javier Luraschi
Sparklyr: Recap, Updates, and Use Cases with Javier Luraschi
 
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesSimplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
 
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
 
Jorge de la Cruz [Veeam Software] | RESTful API – How to Consume, Extract, St...
Jorge de la Cruz [Veeam Software] | RESTful API – How to Consume, Extract, St...Jorge de la Cruz [Veeam Software] | RESTful API – How to Consume, Extract, St...
Jorge de la Cruz [Veeam Software] | RESTful API – How to Consume, Extract, St...
 
Suneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JSuneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4J
 
Natural Language Query and Conversational Interface to Apache Spark
Natural Language Query and Conversational Interface to Apache SparkNatural Language Query and Conversational Interface to Apache Spark
Natural Language Query and Conversational Interface to Apache Spark
 
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
 
Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
 

Ähnlich wie Building cloud-enabled genomics workflows with Luigi and Docker

CNCF Introduction - Feb 2018
CNCF Introduction - Feb 2018CNCF Introduction - Feb 2018
CNCF Introduction - Feb 2018Krishna-Kumar
 
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a serviceCOMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a serviceAntonio García-Domínguez
 
Docker Application to Scientific Computing
Docker Application to Scientific ComputingDocker Application to Scientific Computing
Docker Application to Scientific ComputingPeter Bryzgalov
 
Drupal Dev Days Vienna 2023 - What is the secure software supply chain and th...
Drupal Dev Days Vienna 2023 - What is the secure software supply chain and th...Drupal Dev Days Vienna 2023 - What is the secure software supply chain and th...
Drupal Dev Days Vienna 2023 - What is the secure software supply chain and th...sparkfabrik
 
What is the Secure Supply Chain and the Current State of the PHP Ecosystem
What is the Secure Supply Chain and the Current State of the PHP EcosystemWhat is the Secure Supply Chain and the Current State of the PHP Ecosystem
What is the Secure Supply Chain and the Current State of the PHP Ecosystemsparkfabrik
 
Using Docker container technology with F5 Networks products and services
Using Docker container technology with F5 Networks products and servicesUsing Docker container technology with F5 Networks products and services
Using Docker container technology with F5 Networks products and servicesF5 Networks
 
Introduction to License Compliance and My research (D. German)
Introduction to License Compliance and My research (D. German)Introduction to License Compliance and My research (D. German)
Introduction to License Compliance and My research (D. German)dmgerman
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSSteve Wong
 
56K.cloud Docker Training
56K.cloud Docker Training56K.cloud Docker Training
56K.cloud Docker TrainingBrian Christner
 
2016 05 sanger
2016 05 sanger2016 05 sanger
2016 05 sangerChris Dwan
 
Containers: DevOp Enablers of Technical Solutions
Containers: DevOp Enablers of Technical SolutionsContainers: DevOp Enablers of Technical Solutions
Containers: DevOp Enablers of Technical SolutionsJules Pierre-Louis
 
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmGenomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmDmitri Zimine
 
Cloud Foundry Technical Overview at IBM Interconnect 2016
Cloud Foundry Technical Overview at IBM Interconnect 2016Cloud Foundry Technical Overview at IBM Interconnect 2016
Cloud Foundry Technical Overview at IBM Interconnect 2016Stormy Peters
 
What's New in Docker - February 2017
What's New in Docker - February 2017What's New in Docker - February 2017
What's New in Docker - February 2017Patrick Chanezon
 
How to Contribute to Cloud Native Computing Foundation
How to Contribute to Cloud Native Computing FoundationHow to Contribute to Cloud Native Computing Foundation
How to Contribute to Cloud Native Computing FoundationCodeOps Technologies LLP
 

Ähnlich wie Building cloud-enabled genomics workflows with Luigi and Docker (20)

CNCF Introduction - Feb 2018
CNCF Introduction - Feb 2018CNCF Introduction - Feb 2018
CNCF Introduction - Feb 2018
 
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a serviceCOMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
 
Docker Application to Scientific Computing
Docker Application to Scientific ComputingDocker Application to Scientific Computing
Docker Application to Scientific Computing
 
Beyond static configuration
Beyond static configurationBeyond static configuration
Beyond static configuration
 
Drupal Dev Days Vienna 2023 - What is the secure software supply chain and th...
Drupal Dev Days Vienna 2023 - What is the secure software supply chain and th...Drupal Dev Days Vienna 2023 - What is the secure software supply chain and th...
Drupal Dev Days Vienna 2023 - What is the secure software supply chain and th...
 
What is the Secure Supply Chain and the Current State of the PHP Ecosystem
What is the Secure Supply Chain and the Current State of the PHP EcosystemWhat is the Secure Supply Chain and the Current State of the PHP Ecosystem
What is the Secure Supply Chain and the Current State of the PHP Ecosystem
 
Using Docker container technology with F5 Networks products and services
Using Docker container technology with F5 Networks products and servicesUsing Docker container technology with F5 Networks products and services
Using Docker container technology with F5 Networks products and services
 
Clean sw 3_architecture
Clean sw 3_architectureClean sw 3_architecture
Clean sw 3_architecture
 
Introduction to License Compliance and My research (D. German)
Introduction to License Compliance and My research (D. German)Introduction to License Compliance and My research (D. German)
Introduction to License Compliance and My research (D. German)
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OS
 
56K.cloud Docker Training
56K.cloud Docker Training56K.cloud Docker Training
56K.cloud Docker Training
 
Cloud to Edge
Cloud to EdgeCloud to Edge
Cloud to Edge
 
56k.cloud training
56k.cloud training56k.cloud training
56k.cloud training
 
2016 05 sanger
2016 05 sanger2016 05 sanger
2016 05 sanger
 
Containers: DevOp Enablers of Technical Solutions
Containers: DevOp Enablers of Technical SolutionsContainers: DevOp Enablers of Technical Solutions
Containers: DevOp Enablers of Technical Solutions
 
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmGenomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
 
Modern Software Development
Modern Software DevelopmentModern Software Development
Modern Software Development
 
Cloud Foundry Technical Overview at IBM Interconnect 2016
Cloud Foundry Technical Overview at IBM Interconnect 2016Cloud Foundry Technical Overview at IBM Interconnect 2016
Cloud Foundry Technical Overview at IBM Interconnect 2016
 
What's New in Docker - February 2017
What's New in Docker - February 2017What's New in Docker - February 2017
What's New in Docker - February 2017
 
How to Contribute to Cloud Native Computing Foundation
How to Contribute to Cloud Native Computing FoundationHow to Contribute to Cloud Native Computing Foundation
How to Contribute to Cloud Native Computing Foundation
 

Kürzlich hochgeladen

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...gajnagarg
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...gajnagarg
 

Kürzlich hochgeladen (20)

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 

Building cloud-enabled genomics workflows with Luigi and Docker

  • 1. Building cloud-enabled cancer genomics workflows with Luigi and Docker Jake Feala, PhD Principal Scientist, Bioinformatics @ Caperna Founder @ Outlier Bio
  • 2. Building cloud-enabled cancer genomics workflows with Luigi and Docker Jake Feala, PhD Principal Scientist, Bioinformatics @ Caperna Founder @ Outlier Bio
  • 3. Outline Pipelines Requirements for the ideal workflow platform Challenges Practical, free alternatives Convergence of cloud bioinformatics platforms Luigi Docker S3* Application: Confirming cancer mutations in RNA-seq *Not technically free but nearly **FOSS = Free and Open Source Software Some enabling infrastructure Takeaways: 1. Engineer your workflows 2. Everyone is converging on architecture 3. You can build scalable pipelines with FOSS** tools Copyright © 2016 Outlier Bio
  • 4. data >> algorithms Copyright © 2016 Outlier Bio
  • 5. When in doubt, just sequence it! Plummeting sequencing costs + Powerful insights = Mounds of data Copyright © 2016 Outlier Bio
  • 6. managing big genomics data is hard Copyright © 2016 Outlier Bio
  • 7. Pipelines What you show your boss AlignFastQ BAM What you actually do Align each lane Merge SAM Build index Update IGV registry Index BAM SAM  BAM Sort BAM Copyright © 2016 Outlier Bio
  • 8. Pipelines What you show your boss Align And sometimes…. FastQ BAM What you actually do Align each lane Merge SAM Build index Update IGV registry Index BAM SAM  BAM Sort BAM Copyright © 2016 Outlier Bio
  • 9. Pipelines What you show your boss Align And sometimes…. FastQ BAM What you actually do Align each lane Merge SAM Build index Update IGV registry Index BAM SAM  BAM Sort BAM Copyright © 2016 Outlier Bio And BTW these files are massive
  • 10. Capability = automated pipeline* • Pipelines are at the core of a bioinformatics group • Best way to show a capability is a complete pipeline – Any suitable input  quality-assured output – Sure, you may be able to do a certain analysis, but a working, tested pipeline proves it • Lots of tools are available, but putting them together correctly into a pipeline is hard *”workflow” is synonymous with “pipeline” for the purposes of this talk Lots of possible input sources and types Lots of possible downstream analyses Core capability Copyright © 2016 Outlier Bio
  • 11. good data science requires good data engineering Copyright © 2016 Outlier Bio
  • 12. Linux commands and custom file formats are the “narrow waist” of bioinformatics Copyright © 2016 Outlier Bio This constraint underlies the design of bioinformatics pipelines
  • 13. This works great for tool developers… Copyright © 2016 Outlier Bio
  • 14. …but quickly gets complicated for users Common homegrown solution: - Bash scripts - VMs (AMIs) - StarCluster or similar - SunGrid Engine or similar - Parallelize by sample Copyright © 2016 Outlier Bio B
  • 15. Outline Pipelines Requirements for the ideal workflow platform Challenges Practical, free alternatives Convergence of cloud bioinformatics platforms Luigi Docker S3* Application: Confirming cancer mutations in RNA-seq *Not technically free but nearly **FOSS = Free and Open Source Software Some enabling infrastructure Takeaways: 1. Engineer your workflows 2. Everyone is converging on architecture 3. You can build scalable pipelines with FOSS tools Copyright © 2016 Outlier Bio
  • 16. Limitations with commonly used technology • Compute – SGE is flat, just a job scheduler, bash only, no dependencies • Environment – AMIs and VMs are slow and heavyweight, no code, not good for ongoing dev • Workflow – Makefiles have ugly syntax, static, inflexible – GUIs and domain-specific systems like Galaxy/Taverna are not easily programmable or general-purpose Copyright © 2016 Outlier Bio
  • 17. Properties of the ideal pipeline system • General purpose: familiar language, can apply to any task • Modular: any language, well-tested components with tight APIs • Scalable: parallelize for free, independent of components • Integrated: LIMS, metadata, viz, versioning, reporting • Versioned: reproduce from snapshots in time • Idempotent: resume from failure, guarantee outputs Copyright © 2016 Outlier Bio
  • 18. Outline Pipelines Requirements for the ideal workflow platform Challenges Practical, free alternatives Convergence of cloud bioinformatics platforms Luigi Docker S3* Application: Confirming cancer mutations in RNA-seq *Not technically free but nearly **FOSS = Free and Open Source Software Some enabling infrastructure Takeaways: 1. Engineer your workflows 2. Everyone is converging on architecture 3. You can build scalable pipelines with FOSS tools Copyright © 2016 Outlier Bio
  • 19. Emerging best practice is to containerize and decompose into atomic steps Copyright © 2016 Outlier Bio
  • 20. But atomicity and scaling bring added complexity Copyright © 2016 Outlier Bio
  • 21. Everyone is doing this same basic architecture Copyright © 2016 Outlier Bio
  • 22. That is, everyone except for ADAM/Spark. They are shifting the paradigm • Pipeline = sequence of transformations on a data model, not a string of shell commands • Separate method from implementation • Distribute records (i.e. alignments) for horizontal scaling • Needs more community tools before adoption Fig 1, Nothaft et al. (2015) Rethinking Data-Intensive Science Using Scalable Analytics Systems Copyright © 2016 Outlier Bio https://bdgenomics.org
  • 23. Outline Pipelines Requirements for the ideal workflow platform Challenges Practical, free alternatives Convergence of cloud bioinformatics platforms Luigi Docker S3* Application: Confirming cancer mutations in RNA-seq *Not technically free but nearly **FOSS = Free and Open Source Software Some enabling infrastructure Takeaways: 1. Engineer your workflows 2. Everyone is converging on architecture 3. You can build scalable pipelines with FOSS tools Copyright © 2016 Outlier Bio
  • 24. A free alternative architecture Components • Storage: S3 • Environment: Docker • Workflows: Luigi • Compute: EC2 auto-scaling groups • Infrastructure: to put the pieces together Benefits • Free (almost) • Huge communities, battle tested • Separate concerns, use best tool for each job Copyright © 2016 Outlier Bio
  • 25. S3 and object storage Steps: 1. Learn how object store is different from a filesystem 2. Write a bit of extra code to manage transfers to/from S3 3. Never think about long-term storage again* *within reason, if IT director is in the room. Copyright © 2016 Outlier Bio
  • 26. Docker Copyright © 2016 Outlier BioFrom https://www.docker.com/what-docker • Lightweight environment management • Specify environment as code (Dockerfile) • Portable (laptop  cluster with same behavior) • Wide adoption in tech • Gaining ground in bioinformatics
  • 27. Luigi • Like a Makefile but in pure Python* • Executes locally => easy to debug • OOP declarative framework with Tasks and Targets – Tasks are Python objects, they can do anything – Targets are Python objects, they can be anything • Idempotent (resume where you left off) and atomic (no half-finished tasks) • Batteries included (graph viz, CLI integration, S3, MySQL, Hadoop, Spark, Redshift, SGE, … *i.e., not another DSL What’s a Makefile? • Build DAG of tasks • Each task specifies dependencies and output • Start with what you want to build, work up Copyright © 2016 Outlier Bio
  • 28. Luigi is flexible enough to do pretty much anything within a workflow S3 EC2/Docker Python code S3 Database Hadoop/Spark Local client Local file storage compute data flows Copyright © 2016 Outlier Bio Local file Manual operation!
  • 29. Luigi is flexible enough to do pretty much anything within a workflow S3 EC2/Docker Python code S3 Database Hadoop/Spark Local client Local file storage compute data flows Copyright © 2016 Outlier Bio Local file Manual operation! luigi.Target luigi.Task luigi.Task.requires API
  • 30. Anatomy of a Luigi Task Copyright © 2016 Outlier Bio
  • 31. Anatomy of a Luigi Task Copyright © 2016 Outlier Bio Build your filepaths in Task.output Declare your parameters (can be passed from CLI) Do the thing Read the input Write the output
  • 32. Anatomy of a Luigi Task Copyright © 2016 Outlier Bio Now output to S3 instead
  • 33. Anatomy of a Luigi Task Copyright © 2016 Outlier Bio Run a Docker container instead
  • 34. Outline Pipelines Requirements for the ideal workflow platform Challenges Practical, free alternatives Convergence of cloud bioinformatics platforms Luigi Docker S3* Application: Confirming cancer mutations in RNA-seq *Not technically free but nearly **FOSS = Free and Open Source Software Some enabling infrastructure Takeaways: 1. Engineer your workflows 2. Everyone is converging on architecture 3. You can build scalable pipelines with FOSS tools Copyright © 2016 Outlier Bio
  • 35. Some assembly required Copyright © 2016 Outlier Bio ECS EC2 auto-scaling group Local client Container registry EC2 Docker ECS agent CloudWatch Luigi EC2 Docker ECS agent SQS SGE But lots of ways to do it!
  • 36. Outline Application: Confirming cancer mutations in RNA-seq *Not technically free but nearly **FOSS = Free and Open Source Software Some enabling infrastructure Takeaways: 1. Engineer your workflows 2. Everyone is converging on architecture 3. You can build scalable pipelines with FOSS tools Copyright © 2016 Outlier Bio Pipelines Requirements for the ideal workflow platform Challenges Practical, free alternatives Convergence of cloud bioinformatics platforms Luigi Docker S3*
  • 37. • Problem: – Calling somatic cancer mutations is difficult • Low allele frequencies • Complex variants • Purity, ploidy issues – RNA-seq can add confidence, but • TCGA does not provide RNA-seq variants • Manual review is slow • Automating and scaling requires a complex, custom workflow An interesting* cancer genomics application Copyright © 2016 Outlier Bio *to me
  • 38. Solution: RNA-seq mutation validation pipeline Copyright © 2016 Outlier Bio CGHub GDAC portal Note: this workflow should soon be feasible with the NCI cloud pilot platforms Download RNA-seq Download somatic mutations RNA-seq (BAM) DNA variants (VCF) Extract RNA-seq regions Local region (BAM) MuTect RNA variants (VCF) Merge RNA/DNA variants for each variant for each patient Merge patient variantsMerge cohort variants
  • 39. Dockerizing the apps Copyright © 2016 Outlier Bio GTDownload ./bioit/apps/gtdownload/Dockerfile Samtools ./bioit/apps/samtools/Dockerfile See https://github.com/outlierbio/bio-it for full project $ docker build –t outlierbio/bioit . Base image ./Dockerfile $ cd bioit/apps/gtdownload $ docker build –t outlierbio/gtdownload . $ cd ../samtools $ docker build –t outlierbio/samtools .
  • 40. Luigi-izing the workflow Copyright © 2016 Outlier BioSee https://github.com/outlierbio/bio-it for full project
  • 41. Luigi-izing the workflow Copyright © 2016 Outlier BioSee https://github.com/outlierbio/bio-it for full project
  • 42. Luigi-izing the workflow Copyright © 2016 Outlier BioSee https://github.com/outlierbio/bio-it for full project
  • 43. Run it! Copyright © 2016 Outlier BioSee https://github.com/outlierbio/bio-it for full project $ luigi --module bioit.validate_rnaseq RunSample --sample-id=TCGA-AB-2929
  • 44. Thanks! Questions? Pipelines Requirements for the ideal workflow platform Challenges Practical, free alternatives Convergence of cloud bioinformatics platforms Luigi Docker S3* Application: Confirming cancer mutations in RNA-seq *Not technically free but nearly **FOSS = Free and Open Source Software Some enabling infrastructure Takeaways: 1. Engineer your workflows 2. Everyone is converging on architecture 3. You can build scalable pipelines with FOSS tools Copyright © 2016 Outlier Bio Further reading • My blog: https://medium.com/outlier-bio-blog • GitHub repo for this talk: https://github.com/outlierbio/bioit • Luigi: https://github.com/spotify/luigi • Docker: https://docker.com • Amazon AWS: http://lmgtfy.com/?q=aws
  • 45. Extra slides Copyright © 2016 Outlier Bio
  • 46. More on reproducibility: 3 major components and some possible solutions • Code – VCS – Notebooks • Data – Metadata store – S3 – Synapse – Versioned data APIs • Environment – Package managers (pip –r) – VM – Docker Copyright © 2016 Outlier Bio
  • 47. The slide about how this is kind of like functional programming function Immutable output • No side effects (pure) • Stateless • Same output for every input • Compose complex, scalable functionality from small components Immutable input containerized data transformation Immutable S3 object Immutable S3 object Copyright © 2016 Outlier BioSee http://tech.adroll.com/blog/data/2015/09/22/data-pipelines-docker.html for more on this idea
  • 48. • Specify environment as code (Dockerfile) • A little like git – commits (each with uuid hash) – pull – push • Isolated like a VM • Run like an app – Takes arguments – CLI allows interactive use – Hop inside the container to debug Docker (user perspective) Copyright © 2016 Outlier Bio
  • 49. EC2 auto-scaling groups Copyright © 2016 Outlier BioFrom http://docs.aws.amazon.com/AutoScaling/
  • 50. Dispatching pattern allows fan-in from data sources, fan-out to pipelines Load sample BAM CGHub download S3 download Local file NCBI SRA download CGHub extract FQ SRA extract FQ HTTP/FTP download exome RNA-seq xenograft WGS shRNA Copyright © 2016 Outlier Bio
  • 51. Dispatching pattern allows fan-in from data sources, fan-out to pipelines Load sample CGHub download S3 download Local file NCBI SRA download CGHub extract FQ SRA extract FQ HTTP/FTP download exome RNA-seq xenograft WGS shRNA Copyright © 2016 Outlier Bio Luigi makes this easy
  • 52. • Base Dockerfile with AWS credentials and helpers • Simple metadata DB to map sample IDs  raw data locations – All other filepaths managed by pipeline code – Level-up: store pipeline versions, parameters, batches • Small test dataset for constant integration tests (use Jenkins) Practical implementation notes Copyright © 2016 Outlier Bio
  • 53. Challenges • Filesystem: Still need a BIG, FAST local filesystem for NGS analysis • Luigi has warts – Parameter passing – Delete files to re-run – Lots of code edits to rewire the DAG Copyright © 2016 Outlier Bio
  • 54. 5-year timeline (wishlist?) • More distributed algorithms with ADAM/Spark • Integration with Jupyter notebooks • Open source template for spinning up a secure cloud infrastructure Copyright © 2016 Outlier Bio
  • 55. Live demo??? Copyright © 2016 Outlier Bio

Hinweis der Redaktion

  1. Much of the talk will be about fundamental practice of engineering data workflows, using cancer genomics and cloud as applications
  2. Let’s start with philosophy. In my opinion we have absolutely just scratched the surface of the cancer genomics data available to us. Huge insights can be found by asking simple questions of huge amounts of data. Algorithms are important, but more data almost always trumps better algorithms
  3. And huge amounts of data we have. Everyone has seen this. Sequencing tech crushed Moore’s Law in the last decade Coupled to the fact that we can get rich information from seq data we could never imagine 10 years ago. We now default to sequencing for almost all applications. Anyone not taking full advantage of their DNA sequencing data risks falling behind
  4. But all this is easier said than done. Asking simple questions of big data means being able to manage big data. With the cloud, individuals and small teams have access to tremendous computing power, but bioinformatics scientists are generally not trained to be able to use it
  5. Someone not in bioinformatics recently asked why it was such a big deal to string a few NGS tools together. That reminded me that it does look easy on paper. But under the hood it usually involves many more steps behind the scenes
  6. It is very easy for this to devolve into a mess, especially if there are multiple samples or if you want to make a tweak in a parameter and run it again
  7. Because you want to reprocess as little as possible when you have huge datasets.
  8. In my view pipelines are extremely important. Pipelines are where rubber meets the road, and distinguishes what you COULD do vs what you CAN do. When I was a naïve new grad, my boss used to use “capabilities” as a deliverable of the team and put specific capabilities as yearly objectives. I used to think that was silly, that we’re capable of doing ANYTHING! But it eventually crystallized that getting tools to work together, scale, and integrate with the rest of the company’s data takes a concerted effort, that can only be proven by a push-button pipeline, which are hard to build
  9. Which brings us to the true thesis of this presentation To do things right in data science, you need data engineers. In the real world your analysis needs to work AGAIN and AGAIN The data needs to be correct at every stage Messy data needs to be cleaned and harmonized. All this is the domain of engineers. If you are building a bioinformatics team it is crucial to include engineering expertise to support your computational scientists
  10. I need to mention that there is a single constraint that drives all bioinformatics pipelines. Everything fans in and out of the Linux command-line, using custom file formats
  11. If you’re developing a tool, you’re in an isolated environment, detached from the context of how it will be used. Small test datasets Incentivized by publication => novelty of algorithm No requirement for maintenance, robustness
  12. When building a pipeline you have to make sure the inputs and outputs work together, that the dependencies don’t clash, and you have to manage shuttling the data to your compute environment if you want to scale to the cloud. You need to manage a lot more pieces. This is a common first solution, and it works fine if you have one small analysis you want to do once. Hacked together and fragile, but low cost, quick startup, control, flexibility, and debugging via SSH makes it worth it (compared to commercial services) when doing rapid iterations of pipeline development
  13. So what do we actually need from our pipeline technology?
  14. You quickly run into problems when scaling up moving to production Compute/SGE: When something breaks, you don't want to be commenting out lines of your bash script to restart! Environment: Developing with VMs is slow, difficult to iterate The real missing piece is a workflow system.
  15. Let's step back and define the ideal system. We want to build workflows from composable modules of well-tested components with tight APIs and behavior, that seamlessly transition to the cloud. Modules can be any language. Often Linux command-line for practical purposes but not necessarily Scale for free, never think about parallelization In enterprise, there's probably a data ecosstem you need to integrate with Version of data, code, parameters, environmental dependencies (see reproducible) Pick up where you left off, important for error tolerance. If a module fails, maybe some service was down and it will work again on retry. Don’t want to re-run the entire 10-hour pipeline. Scripts don’t get you there. Hit the “freeze” button and be able to get the exact same result 6 months from now (ideally even on a different compute architecture). Important for academia but also pharma/clinical. Throwaway analyses are never really that, most analyses will need to be re-run
  16. Some best practices are starting to emerge for meeting these req'ts
  17. What people are starting to do is decompose pipelines into workflows of containerized modules. Each module can be any language, and can be tested locally and deployed anywhere.
  18. But as you add these features, it gets more and more complex to manage the system. Eventually, you might want to subscribe to a commercial platform to manage all this complexity. Lots of great options, highly recommended if you don't have the engineering capabilities to manage all of this, and can afford it, and if legal lets you put your data up there
  19. Most commercial cloud bioinformatics platforms are using a similar pattern, and are converging on the same architecture. In many cases you may need to build your own system. Luckily, and you can build this architecture using free/open source tools.
  20. As a quick aside, there is one alternative architecture out there. Right now our field is stuck in what I think of as a local maximum where we're approaching the best we can do with under the constraint of VCF and BAM file formats. Someday we may jump out of it and embrace a better framework. The best contender in my eyes is the Hadoop/Spark framework, and there is an NGS implementation of common GATK and samtools algorithms called ADAM, out of the Berkeley AMPLab, that I highly recommend checking out. GATK will also soon release a version designed for Spark. But the roadblock there is that we need all of our other tools to use this framework, so there needs to be a critical mass of adoption first.
  21. Many free options for building a containerized app-workflow architecture. The major components (in my favored implementation) are S3 for storage, Docker for environments, and Luigi for workflows. Docker most of you have heard of, Luigi I may be introducing to many of you for the first time.
  22. No need to talk much about S3. There is a slight learning curve to get used to object stores, and a little extra code, but you’ll never look back.
  23. What is Docker? Essentially a nice API for developing with linux containers. Containers are kind of like a VM, but bootstrap on top of an existing linux OS to provide all the benefits of virtualization without the heavy VM I probably don't have to go too deep here because the field is starting to learn about Docker, and it’s been a while since I’ve gotten a confused look when I say the word “Dockerized”.
  24. The technology that may need the most introduction is Luigi Before Luigi, Make was actually a really great option for building data pipelines. With make, every task specifies its dependencies and its output target, and you start with what you want to build and work up. Luigi basically takes that idea and moves it into Python, where now Tasks and Targets are objects that can do anything Python can do (which is everything) I've found that Luigi is flexible enough to be the glue for EVERYTHING in your workflow
  25. Luigi can connect any type of task from disparate systems and locations into a unified workflow. If Python can talk to it, you can make it part of a Luigi workflow, and that includes the commercial platforms I mentioned below. We’ve connected internal workflows to apps on commercial cloud platforms via their REST APIs, which worked like a charm. Even manual steps can be wrapped by Luigi. One cool pattern I’ve seen is to wrap a manual step in a Luigi Task, which always raises an exception when it runs. The user creates the output file, and the task never runs again as long as that Target exists.
  26. You can then define SOPs with mixed automatic/manual processes, including various types of activities (compute, manually verify, database ETL, publish visualizations, etc.) into a unified workflow with a single “start” button.
  27. Luigi’s power is its API. Provides a simple way to separate work from dependencies/outputs. Use inheritance/subclassing to reuse code, often overriding input/output methods
  28. Encode all of your filepaths into the “output” method, and never think too much about filepaths again. Completely separate the “what” and “how” from the “where”
  29. If you want to write to S3 instead, just swap the output path
  30. Python can do anything, including spawn a process to run Docker
  31. Infrastructure requires some design and must be adapted to every situation.
  32. One critique might be that you shouldn't need to develop too many pipelines, that gold standard workflows exist on all the commercial and academic platforms. To counter, I want to show a simple example. What if we want to use RNA-seq to validate mutations across TCGA?
  33. Break it down into tasks and outputs. Merge steps get their own task. For loops are allowed in luigi.Task.requires methods and can be used to loop over a predefined set of patients.
  34. How would we do this? Start by building Dockerized apps that talk to S3 Here we put S3 credentials in the base image so they are inherited in the derived Docker images. Don’t bake credentials into any images you want to share on a public registry like DockerHub!
  35. Connect the dots with Luigi.
  36. Connect the dots with Luigi.
  37. Connect the dots with Luigi.
  38. You can imagine if you don’t allow overwriting S3 objects, and you build a well-tested, stateless, side-effect-free container, that always produces the same output with any input, it’s kind of like functional programming. And the benefit of functional programming is easy scalability and composability. When you have that kind of confidence at the component level, you can build complex pipelines with tight modules The engineers at AdRoll wrote a fantastic blog post about using a very similar architecture. This analogy between S3 -> container -> S3 patterns and functional programming was stolen from ideas in this blog post http://tech.adroll.com/blog/data/2015/09/22/data-pipelines-docker.html
  39. From a user perpective, there are huge advantages, not least the ability to specify environment as code. You can build up an image from slices, and push and pull those slices to a central registry
  40. You often need a bit of massaging to get data ready for your workflow. You can use specific tasks for each data source, then dispatch the correct task in the “requires” method from within a central “fan-in” task that all downstream workflows depend on.
  41. The choice of which loading task to dispatch can depend on data source URLs or metadata linked to the sample ID Some extra code will be needed to make sure the output of the upstream tasks is passed through as the output of your central task.