SlideShare ist ein Scribd-Unternehmen logo
1 von 39
1
Reproducible Bioinformatics Pipelines
with Docker & Anduril
Christian Frech, PhD
Bioinformatician at Children‘s Cancer Research Institute, Vienna
CeMM Special Seminar
September 25th
, 2015
Why care about reproducible pipelines
in bioinformatics?
 For your (future) self
 Quickly re-run analysis with different parameters/tools
 Best documentation how results have been produced
 For others
 Allow others to easily reproduce your findings
(“reproducibility crisis”)*
 Code re-use between projects and colleagues
2
*) http://theconversation.com/science-is-in-a-reproducibility-crisis-how-do-we-resolve-it-16998
Obstacles to computational reproducibility
 Software/script not available (even upon request)
 Black box: Code (or even virtual machine) available, but no
documentation how to run it
 Dependency hell: Software and documentation available,
but (too) difficult to get it running
 Code rot: Code breaks over time due to software updates
 404 Not Found: unstable URLs, e.g. links to lab homepages
3
Go figure…
Computational pipelines to the rescue
 In bioinformatics, data analysis typically consists of a series of
heterogeneous programs stringed together via file-based
inputs and outputs
 Example: FASTQ -> alignment (BWA) -> variants calling (GATK) -> variant
annotation (SnpEff) -> custom R script
 Simple automation via (bash/R/Python/Perl) scripting has its
limitations
 No error checking
 No partial execution
 No parallelization
4
No shortage of pipeline frameworks
 Script-based
 GNU Make, Snakemake, Bpipe, Ruffus, Drake, Rake,
Nextflow, …
 GUI-based
 Galaxy, GenePattern, Chipster, Taverna, Pegasus, …
 Various commercial solutions for more standardized
workflows (e.g. RNA-seq)
 Geared toward biologists without programming skills
(“point-and-click”)
5
See also https://www.biostars.org/p/79, https://www.biostars.org/p/91301/
Personal wish list for pipeline framework
 Script-based (maximum flexibility, minimum overhead)
 Powerful scripting language
 Cluster integration (preferably via slurm)
 Modular (allow code re-use b/w projects and colleagues)
 Component library for frequent tasks (e.g. join two CSV files)
 Reporting (HTML, PDF) to share results
 Free & open-source
 Bundle scripts/data with execution environment
6
What’s wrong with good ol’ GNU make?
 Available on all Linux platforms
 Stood the test of time
(developed in 1970s)
 Rapid development
(Bash scripting + target rules)
 Multi-threading (-j parameter)
7
 No cluster support
 Arcane syntax, cryptic pattern
rules
 Half-baked multi-output rules
 No type checking (everything is a
generic file)
 Difficult to modularize
(code re-use)
 Rebuild not triggered by recipe
change
 No reporting
PRO CON
Anduril
8
http://www.anduril.org
Anduril
 Developed since 2008 at Biomedicum Systems Biology Laboratory,
Helsinki, Finland
 http://research.med.helsinki.fi/gsb/hautaniemi/
 Built for scientific data analysis with focus on bioinformatics
 Proprietary workflow scripting language “Anduril script”
 Possibility to embed native code (Bash/R/Python/Perl)
 Version 2 will switch to Scala
 Open source & free
 Significo (http://www.significo.fi/) is commercial spin-off offering Anduril
consulting services
 No widespread adoption (yet?)
9
Anduril features
 Script-based (maximum flexibility, less overhead)
 Expressive scripting language
 Cluster integration (preferably via slurm)
 Modular to allow code re-use (b/w projects and colleagues)
 Ready-made component library for frequent analysis steps
 Reporting (HTML, PDF) to share results
 Free & open-source
 Bundle scripts/data with execution environment
10
X
Example workflow: RNA-seq alignment with GSNAP
inputBamDir = INPUT(path="/data/bam", recursive=false)
inputBamFiles = Folder2Array(folder1 = inputBamDir, filePattern = "C57C3ACXX_CV_([^_]+)_.*[.]bam$")
alignedBams = record()
for bam : std.iterArray(inputBamFiles) {
gsnap = GSNAP (
reads = INPUT(path=bam.file),
options = "--npaths=1 --max-mismatches=1 --novelsplicing=0",
@cpu = 10,
@memory = 40000,
@name = "gsnap_" + bam.key
)
alignedBams[bam.key] = gsnap.alignment
}
11
Anduril script
Execute with
$ anduril run workflow.and --exec-mode slurm
Distributed execution on cluster
Embedding native R code in Anduril script
12
ensembl = REvaluate(
table1 = ucsc,
script = StringInput(content=
'''
table.out <- table1
table.out$chrom <- gsub("^chr", "", table.out$chrom)
'''
)
)
Supports also inlining of Bash, Python, Java, and Perl scripts
Convert UCSC to Ensembl chromosome names in a CSV file
containing column ‘chrom’:
Anduril features
 Script-based (maximum flexibility, less overhead)
 Expressive scripting language
 Cluster integration (preferably via slurm)
 Modular to allow code re-use (b/w projects and colleagues)
 Ready-made component library for frequent analysis steps
 Reporting (HTML, PDF) to share results
 Free & open-source
 Bundle scripts/data with execution environment
13
?
 “Lightweight” virtualization technology for Unix-based systems
 Processes run in isolated namespaces (“containers”), but share same kernel
 Like VMs: containers portable between systems -> reproducibility!
 Unlike VMs: instant startup, no resource pre-allocation -> better hardware utilization
14
VM Container
How to bundle workflow with execution environment?
15
Container
Anduril
Workflow
Component 1
Component 2
Component 3
Pro: Single container, easy to maintain
Con: VM-like approach; huge, monolithic
container, difficult to share (against Docker
philosophy)
Pro: Completely modularized, easy to re-
use/share workflow components
Con: “container hell”?
Workflow
Anduril
Solution 1 Solution 2
Container A
Component 1
Container B
Component 2
Container C
Component 3
Hybrid solution
16
Pro: Workflow completely containerized (= portable);
only shared components in common containers
Con: Still (but greatly reduced) overhead for container
maintenance
Workflow
Anduril
Container A
Component 1
Component 2
Component 3
Master container
Project- and user-
specific components
installed in master
container
Shared components
installed in common
container (e.g.
container “RNA-seq”)
“Docker inside
docker”
Dockerized GSNAP in Anduril
17
inputBamDir = INPUT(path="/data/bam", recursive=false)
inputBamFiles = Folder2Array(folder1 = inputBamDir, filePattern = "C57C3ACXX_CV_([^_]+)_.*[.]bam$")
alignedBams = record()
for bam : std.iterArray(inputBamFiles) {
gsnap = GSNAP (
reads = INPUT(path=bam.file),
options = "--npaths=1 --max-mismatches=1 --novelsplicing=0",
docker = "cfrech/anduril-gsnap-2015-09-21",
@cpu = 10,
@memory = 40000,
@name = "gsnap_" + bam.key
)
alignedBams[bam.key] = gsnap.alignment
}
So, Anduril is great… but
 Proprietary scripting language
 Biggest hurdle for widespread adoption IMO
 Will likely improve with version 2 (which uses Scala)
 Documentation opaque for beginners
 WANTED: Simple step-by-step guide to build your first Anduril workflow
 High upfront investment to get going (because of the above)
 In-lining Bash/R/Perl/Python should be simpler
 Currently too much clutter when using “BashEvaluate” and alike
 Coding in Anduril sometimes “feels heavy” compared to other
frameworks (e.g. GNU Make)
 Will improve with fluency in workflow scripting language
18
Anduril RNA-seq case study
19
RNA-seq case study
Step 1: Configure Anduril workflow
title = “My project long title“
shortName = “My project short title“
authors = "Christian Frech"
// analyses to run
runNetworkAnalysis = true
runMutationAnalysis = true
runGSEA = true
// constants
PROJECT_BASE="/mnt/projects/myproject“
gtf = INPUT(path=PROJECT_BASE+"/data/Homo_sapiens.GRCh37.75.etv6runx1.gtf.gz")
referenceGenomeFasta = INPUT(path="/data/reference/human_g1k_v37.fasta")
...
20
+ description of samples, sample groups, and group comparisons in external
CSV file
RNA-seq case study
Step 2: Run Anduril workflow on cluster
$ anduril run main.and --exec-mode slurm
21
RNA-seq case study
Step 3: Go for lunch
22
RNA-seq case study
Step 4: Study PDF report
23
What follows are screenshots from this PDF report
24
QC: Read counts
25
QC: Gene body coverage
26
QC: Distribution of expression values per sample
27
QC: Sample PCA & heatmap
28
Vulcano plot for each comparison
29
Table report of DEGs for each comparison
30
Expression values of top diff. expressed
genes per comparison
31
GO term enrichment for each comparison
32
Interaction network of DEGs for each comparison
33
Chromosomal distribution of DEGs
34
GSEA heat map summarizing all comparisons
35
Rows = enriched gene sets
Columns = comparisons
Value = normalized enrichment score (NES)
Red = enriched for up-regulated genes
Blue = enriched for down-regulated genes
* = significant (FDR < 0.05)
** = highly significant (FDR < 0.01)
Future developments
 Push new Anduril components to public repository
(needs some refactoring, documentation, test cases)
 Help on Anduril2 manuscript
 Port custom Makefiles to Anduril (ongoing)
 Cloud deployment of dockerized workflow
 Couple slurm to AWS EC2
 Automatic spin-up of docker-enabled AMIs serving as
computing nodes
36
In the (not so) distant future …
$ docker pull cfrech/frech2015_et_al
$ docker run cfrech/frech2015_et_al --use-cloud --max-nodes 300 --out output
$ evince output/figure1.pdf
37
Further reading
 Discussion thread on Docker & Anduril
https://groups.google.com/forum/#!msg/anduril-dev/Et8-YG9O-Aw
38
Acknowledgement
39
 Marko Laakso (Significo)
 Sirku Kaarinen (Significo)
 Kristian Ovaska (Valuemotive)
 Pekka Lehti (Valuemotive)
 Ville Rantanen (University of
Helsinki, Hautaniemi lab)
 Nuno Andrade (CCRI)
 Andreas Heitger (CCRI)

Weitere ähnliche Inhalte

Was ist angesagt?

rnaseq_from_babelomics
rnaseq_from_babelomicsrnaseq_from_babelomics
rnaseq_from_babelomicsFrancisco Garc
 
wings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizewings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizeAnn Loraine
 
diffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packagediffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packageLi Shen
 
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2BITS
 
NGx Sequencing 101-platforms
NGx Sequencing 101-platformsNGx Sequencing 101-platforms
NGx Sequencing 101-platformsAllSeq
 
Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Torsten Seemann
 
Introduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqIntroduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqEnis Afgan
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotLi Shen
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...QIAGEN
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012Dan Gaston
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisUniversity of California, Davis
 
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...QIAGEN
 
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
ECCMID 2015 Meet-The-Expert: Bioinformatics ToolsECCMID 2015 Meet-The-Expert: Bioinformatics Tools
ECCMID 2015 Meet-The-Expert: Bioinformatics ToolsNick Loman
 
2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issuesDongyan Zhao
 

Was ist angesagt? (20)

rnaseq_from_babelomics
rnaseq_from_babelomicsrnaseq_from_babelomics
rnaseq_from_babelomics
 
wings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizewings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualize
 
diffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packagediffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis package
 
DNA_Services
DNA_ServicesDNA_Services
DNA_Services
 
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2
 
NGx Sequencing 101-platforms
NGx Sequencing 101-platformsNGx Sequencing 101-platforms
NGx Sequencing 101-platforms
 
Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015
 
Introduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqIntroduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-Seq
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
Exome Sequencing
Exome SequencingExome Sequencing
Exome Sequencing
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
 
ChIP-seq - Data processing
ChIP-seq - Data processingChIP-seq - Data processing
ChIP-seq - Data processing
 
ChipSeq Data Analysis
ChipSeq Data AnalysisChipSeq Data Analysis
ChipSeq Data Analysis
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
 
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
ECCMID 2015 Meet-The-Expert: Bioinformatics ToolsECCMID 2015 Meet-The-Expert: Bioinformatics Tools
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
 
2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues
 

Andere mochten auch

Principals, Practices, and Habits
Principals, Practices, and HabitsPrincipals, Practices, and Habits
Principals, Practices, and HabitsJeremy Leipzig
 
Next-generation sequencing from 2005 to 2020
Next-generation sequencing from 2005 to 2020Next-generation sequencing from 2005 to 2020
Next-generation sequencing from 2005 to 2020Christian Frech
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingmikaelhuss
 
Using Docker Containers to Improve Reproducibility in Software and Web Engine...
Using Docker Containers to Improve Reproducibility in Software and Web Engine...Using Docker Containers to Improve Reproducibility in Software and Web Engine...
Using Docker Containers to Improve Reproducibility in Software and Web Engine...Vincenzo Ferme
 
Deploying Data Science with Docker and AWS
Deploying Data Science with Docker and AWSDeploying Data Science with Docker and AWS
Deploying Data Science with Docker and AWSMatt McDonnell
 
Docker @ Data Science Meetup
Docker @ Data Science MeetupDocker @ Data Science Meetup
Docker @ Data Science MeetupDaniel Nüst
 
Using python and docker for data science
Using python and docker for data scienceUsing python and docker for data science
Using python and docker for data scienceCalvin Giles
 
SciPipe - A light-weight workflow library inspired by flow-based programming
SciPipe - A light-weight workflow library inspired by flow-based programmingSciPipe - A light-weight workflow library inspired by flow-based programming
SciPipe - A light-weight workflow library inspired by flow-based programmingSamuel Lampa
 
Docker for data science
Docker for data scienceDocker for data science
Docker for data scienceCalvin Giles
 
Agile deployment predictive analytics on hadoop
Agile deployment predictive analytics on hadoopAgile deployment predictive analytics on hadoop
Agile deployment predictive analytics on hadoopDataWorks Summit
 
Ngs microbiome
Ngs microbiomeNgs microbiome
Ngs microbiomejukais
 
Next Generation Sequencing 2013 Report by Yole Developpement
Next Generation Sequencing 2013 Report by Yole DeveloppementNext Generation Sequencing 2013 Report by Yole Developpement
Next Generation Sequencing 2013 Report by Yole DeveloppementYole Developpement
 
NGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsNGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsAnnelies Haegeman
 
Strategic review (Sample)
Strategic review (Sample)Strategic review (Sample)
Strategic review (Sample)guestbbb20c4
 

Andere mochten auch (17)

Principals, Practices, and Habits
Principals, Practices, and HabitsPrincipals, Practices, and Habits
Principals, Practices, and Habits
 
Next-generation sequencing from 2005 to 2020
Next-generation sequencing from 2005 to 2020Next-generation sequencing from 2005 to 2020
Next-generation sequencing from 2005 to 2020
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
 
Using Docker Containers to Improve Reproducibility in Software and Web Engine...
Using Docker Containers to Improve Reproducibility in Software and Web Engine...Using Docker Containers to Improve Reproducibility in Software and Web Engine...
Using Docker Containers to Improve Reproducibility in Software and Web Engine...
 
Deploying Data Science with Docker and AWS
Deploying Data Science with Docker and AWSDeploying Data Science with Docker and AWS
Deploying Data Science with Docker and AWS
 
Docker @ Data Science Meetup
Docker @ Data Science MeetupDocker @ Data Science Meetup
Docker @ Data Science Meetup
 
Using python and docker for data science
Using python and docker for data scienceUsing python and docker for data science
Using python and docker for data science
 
SciPipe - A light-weight workflow library inspired by flow-based programming
SciPipe - A light-weight workflow library inspired by flow-based programmingSciPipe - A light-weight workflow library inspired by flow-based programming
SciPipe - A light-weight workflow library inspired by flow-based programming
 
Docker for data science
Docker for data scienceDocker for data science
Docker for data science
 
Agile deployment predictive analytics on hadoop
Agile deployment predictive analytics on hadoopAgile deployment predictive analytics on hadoop
Agile deployment predictive analytics on hadoop
 
Ngs microbiome
Ngs microbiomeNgs microbiome
Ngs microbiome
 
Next Generation Sequencing 2013 Report by Yole Developpement
Next Generation Sequencing 2013 Report by Yole DeveloppementNext Generation Sequencing 2013 Report by Yole Developpement
Next Generation Sequencing 2013 Report by Yole Developpement
 
Introduction to next generation sequencing
Introduction to next generation sequencingIntroduction to next generation sequencing
Introduction to next generation sequencing
 
NGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsNGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platforms
 
Strategic review (Sample)
Strategic review (Sample)Strategic review (Sample)
Strategic review (Sample)
 
Hadoop gets Groovy
Hadoop gets GroovyHadoop gets Groovy
Hadoop gets Groovy
 
Teamcenter – sap integration gateway
Teamcenter – sap integration gatewayTeamcenter – sap integration gateway
Teamcenter – sap integration gateway
 

Ähnlich wie Reproducible bioinformatics pipelines with Docker and Anduril

Reproducibility: 10 Simple Rules
Reproducibility: 10 Simple RulesReproducibility: 10 Simple Rules
Reproducibility: 10 Simple RulesAnnika Eriksson
 
Bridging Concepts and Practice in eScience via Simulation-driven Engineering
Bridging Concepts and Practice in eScience via Simulation-driven EngineeringBridging Concepts and Practice in eScience via Simulation-driven Engineering
Bridging Concepts and Practice in eScience via Simulation-driven EngineeringRafael Ferreira da Silva
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational WorkflowsCarole Goble
 
Enabling open and reproducible computer systems research: the good, the bad a...
Enabling open and reproducible computer systems research: the good, the bad a...Enabling open and reproducible computer systems research: the good, the bad a...
Enabling open and reproducible computer systems research: the good, the bad a...Grigori Fursin
 
Containers in Science: neuroimaging use cases
Containers in Science: neuroimaging use casesContainers in Science: neuroimaging use cases
Containers in Science: neuroimaging use casesKrzysztof Gorgolewski
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchDirk Petersen
 
Systems Support for Many Task Computing
Systems Support for Many Task ComputingSystems Support for Many Task Computing
Systems Support for Many Task ComputingEric Van Hensbergen
 
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformaticsStephen Turner
 
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterParalyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterIOSR Journals
 
Open Access Week 2017: Life Sciences and Open Sciences - worfkflows and tools
Open Access Week 2017: Life Sciences and Open Sciences - worfkflows and toolsOpen Access Week 2017: Life Sciences and Open Sciences - worfkflows and tools
Open Access Week 2017: Life Sciences and Open Sciences - worfkflows and toolsOpenAIRE
 
Building cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and DockerBuilding cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and DockerJacob Feala
 
Cinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tuneCinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tunebaoilleach
 
Reproducibility challenges in computational settings: what are they, why shou...
Reproducibility challenges in computational settings: what are they, why shou...Reproducibility challenges in computational settings: what are they, why shou...
Reproducibility challenges in computational settings: what are they, why shou...Research Data Alliance
 
The Popper Experimentation Protocol and CLI tool
The Popper Experimentation Protocol and CLI toolThe Popper Experimentation Protocol and CLI tool
The Popper Experimentation Protocol and CLI toolIvo Jimenez
 
Developing and sharing reproducible bioinformatics pipelines: best practices
Developing and sharing reproducible bioinformatics pipelines: best practicesDeveloping and sharing reproducible bioinformatics pipelines: best practices
Developing and sharing reproducible bioinformatics pipelines: best practicesYohann Lelièvre
 
Computational Resources In Infectious Disease
Computational Resources In Infectious DiseaseComputational Resources In Infectious Disease
Computational Resources In Infectious DiseaseJoão André Carriço
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
Building Reproducible Network Data Analysis / Visualization Workflows
Building Reproducible Network Data Analysis / Visualization WorkflowsBuilding Reproducible Network Data Analysis / Visualization Workflows
Building Reproducible Network Data Analysis / Visualization WorkflowsKeiichiro Ono
 

Ähnlich wie Reproducible bioinformatics pipelines with Docker and Anduril (20)

Reproducible Science and Deep Software Variability
Reproducible Science and Deep Software VariabilityReproducible Science and Deep Software Variability
Reproducible Science and Deep Software Variability
 
Reproducibility: 10 Simple Rules
Reproducibility: 10 Simple RulesReproducibility: 10 Simple Rules
Reproducibility: 10 Simple Rules
 
Bridging Concepts and Practice in eScience via Simulation-driven Engineering
Bridging Concepts and Practice in eScience via Simulation-driven EngineeringBridging Concepts and Practice in eScience via Simulation-driven Engineering
Bridging Concepts and Practice in eScience via Simulation-driven Engineering
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows
 
Enabling open and reproducible computer systems research: the good, the bad a...
Enabling open and reproducible computer systems research: the good, the bad a...Enabling open and reproducible computer systems research: the good, the bad a...
Enabling open and reproducible computer systems research: the good, the bad a...
 
Containers in Science: neuroimaging use cases
Containers in Science: neuroimaging use casesContainers in Science: neuroimaging use cases
Containers in Science: neuroimaging use cases
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred Hutch
 
Systems Support for Many Task Computing
Systems Support for Many Task ComputingSystems Support for Many Task Computing
Systems Support for Many Task Computing
 
Open64 compiler
Open64 compilerOpen64 compiler
Open64 compiler
 
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
 
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterParalyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
 
Open Access Week 2017: Life Sciences and Open Sciences - worfkflows and tools
Open Access Week 2017: Life Sciences and Open Sciences - worfkflows and toolsOpen Access Week 2017: Life Sciences and Open Sciences - worfkflows and tools
Open Access Week 2017: Life Sciences and Open Sciences - worfkflows and tools
 
Building cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and DockerBuilding cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and Docker
 
Cinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tuneCinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tune
 
Reproducibility challenges in computational settings: what are they, why shou...
Reproducibility challenges in computational settings: what are they, why shou...Reproducibility challenges in computational settings: what are they, why shou...
Reproducibility challenges in computational settings: what are they, why shou...
 
The Popper Experimentation Protocol and CLI tool
The Popper Experimentation Protocol and CLI toolThe Popper Experimentation Protocol and CLI tool
The Popper Experimentation Protocol and CLI tool
 
Developing and sharing reproducible bioinformatics pipelines: best practices
Developing and sharing reproducible bioinformatics pipelines: best practicesDeveloping and sharing reproducible bioinformatics pipelines: best practices
Developing and sharing reproducible bioinformatics pipelines: best practices
 
Computational Resources In Infectious Disease
Computational Resources In Infectious DiseaseComputational Resources In Infectious Disease
Computational Resources In Infectious Disease
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Building Reproducible Network Data Analysis / Visualization Workflows
Building Reproducible Network Data Analysis / Visualization WorkflowsBuilding Reproducible Network Data Analysis / Visualization Workflows
Building Reproducible Network Data Analysis / Visualization Workflows
 

Kürzlich hochgeladen

bonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlsbonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlshansessene
 
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPRPirithiRaju
 
whole genome sequencing new and its types including shortgun and clone by clone
whole genome sequencing new  and its types including shortgun and clone by clonewhole genome sequencing new  and its types including shortgun and clone by clone
whole genome sequencing new and its types including shortgun and clone by clonechaudhary charan shingh university
 
linear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annovalinear Regression, multiple Regression and Annova
linear Regression, multiple Regression and AnnovaMansi Rastogi
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPirithiRaju
 
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxRitchAndruAgustin
 
How we decide powerpoint presentation.pptx
How we decide powerpoint presentation.pptxHow we decide powerpoint presentation.pptx
How we decide powerpoint presentation.pptxJosielynTars
 
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书zdzoqco
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxMedical College
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxkumarsanjai28051
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxzaydmeerab121
 
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPRPirithiRaju
 
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...HafsaHussainp
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxzeus70441
 

Kürzlich hochgeladen (20)

bonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlsbonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girls
 
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
 
whole genome sequencing new and its types including shortgun and clone by clone
whole genome sequencing new  and its types including shortgun and clone by clonewhole genome sequencing new  and its types including shortgun and clone by clone
whole genome sequencing new and its types including shortgun and clone by clone
 
linear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annovalinear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annova
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPR
 
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
 
Interferons.pptx.
Interferons.pptx.Interferons.pptx.
Interferons.pptx.
 
How we decide powerpoint presentation.pptx
How we decide powerpoint presentation.pptxHow we decide powerpoint presentation.pptx
How we decide powerpoint presentation.pptx
 
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptx
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptx
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptx
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptx
 
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
 
PLASMODIUM. PPTX
PLASMODIUM. PPTXPLASMODIUM. PPTX
PLASMODIUM. PPTX
 
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
 
AZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTXAZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTX
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptx
 

Reproducible bioinformatics pipelines with Docker and Anduril

  • 1. 1 Reproducible Bioinformatics Pipelines with Docker & Anduril Christian Frech, PhD Bioinformatician at Children‘s Cancer Research Institute, Vienna CeMM Special Seminar September 25th , 2015
  • 2. Why care about reproducible pipelines in bioinformatics?  For your (future) self  Quickly re-run analysis with different parameters/tools  Best documentation how results have been produced  For others  Allow others to easily reproduce your findings (“reproducibility crisis”)*  Code re-use between projects and colleagues 2 *) http://theconversation.com/science-is-in-a-reproducibility-crisis-how-do-we-resolve-it-16998
  • 3. Obstacles to computational reproducibility  Software/script not available (even upon request)  Black box: Code (or even virtual machine) available, but no documentation how to run it  Dependency hell: Software and documentation available, but (too) difficult to get it running  Code rot: Code breaks over time due to software updates  404 Not Found: unstable URLs, e.g. links to lab homepages 3 Go figure…
  • 4. Computational pipelines to the rescue  In bioinformatics, data analysis typically consists of a series of heterogeneous programs stringed together via file-based inputs and outputs  Example: FASTQ -> alignment (BWA) -> variants calling (GATK) -> variant annotation (SnpEff) -> custom R script  Simple automation via (bash/R/Python/Perl) scripting has its limitations  No error checking  No partial execution  No parallelization 4
  • 5. No shortage of pipeline frameworks  Script-based  GNU Make, Snakemake, Bpipe, Ruffus, Drake, Rake, Nextflow, …  GUI-based  Galaxy, GenePattern, Chipster, Taverna, Pegasus, …  Various commercial solutions for more standardized workflows (e.g. RNA-seq)  Geared toward biologists without programming skills (“point-and-click”) 5 See also https://www.biostars.org/p/79, https://www.biostars.org/p/91301/
  • 6. Personal wish list for pipeline framework  Script-based (maximum flexibility, minimum overhead)  Powerful scripting language  Cluster integration (preferably via slurm)  Modular (allow code re-use b/w projects and colleagues)  Component library for frequent tasks (e.g. join two CSV files)  Reporting (HTML, PDF) to share results  Free & open-source  Bundle scripts/data with execution environment 6
  • 7. What’s wrong with good ol’ GNU make?  Available on all Linux platforms  Stood the test of time (developed in 1970s)  Rapid development (Bash scripting + target rules)  Multi-threading (-j parameter) 7  No cluster support  Arcane syntax, cryptic pattern rules  Half-baked multi-output rules  No type checking (everything is a generic file)  Difficult to modularize (code re-use)  Rebuild not triggered by recipe change  No reporting PRO CON
  • 9. Anduril  Developed since 2008 at Biomedicum Systems Biology Laboratory, Helsinki, Finland  http://research.med.helsinki.fi/gsb/hautaniemi/  Built for scientific data analysis with focus on bioinformatics  Proprietary workflow scripting language “Anduril script”  Possibility to embed native code (Bash/R/Python/Perl)  Version 2 will switch to Scala  Open source & free  Significo (http://www.significo.fi/) is commercial spin-off offering Anduril consulting services  No widespread adoption (yet?) 9
  • 10. Anduril features  Script-based (maximum flexibility, less overhead)  Expressive scripting language  Cluster integration (preferably via slurm)  Modular to allow code re-use (b/w projects and colleagues)  Ready-made component library for frequent analysis steps  Reporting (HTML, PDF) to share results  Free & open-source  Bundle scripts/data with execution environment 10 X
  • 11. Example workflow: RNA-seq alignment with GSNAP inputBamDir = INPUT(path="/data/bam", recursive=false) inputBamFiles = Folder2Array(folder1 = inputBamDir, filePattern = "C57C3ACXX_CV_([^_]+)_.*[.]bam$") alignedBams = record() for bam : std.iterArray(inputBamFiles) { gsnap = GSNAP ( reads = INPUT(path=bam.file), options = "--npaths=1 --max-mismatches=1 --novelsplicing=0", @cpu = 10, @memory = 40000, @name = "gsnap_" + bam.key ) alignedBams[bam.key] = gsnap.alignment } 11 Anduril script Execute with $ anduril run workflow.and --exec-mode slurm Distributed execution on cluster
  • 12. Embedding native R code in Anduril script 12 ensembl = REvaluate( table1 = ucsc, script = StringInput(content= ''' table.out <- table1 table.out$chrom <- gsub("^chr", "", table.out$chrom) ''' ) ) Supports also inlining of Bash, Python, Java, and Perl scripts Convert UCSC to Ensembl chromosome names in a CSV file containing column ‘chrom’:
  • 13. Anduril features  Script-based (maximum flexibility, less overhead)  Expressive scripting language  Cluster integration (preferably via slurm)  Modular to allow code re-use (b/w projects and colleagues)  Ready-made component library for frequent analysis steps  Reporting (HTML, PDF) to share results  Free & open-source  Bundle scripts/data with execution environment 13 ?
  • 14.  “Lightweight” virtualization technology for Unix-based systems  Processes run in isolated namespaces (“containers”), but share same kernel  Like VMs: containers portable between systems -> reproducibility!  Unlike VMs: instant startup, no resource pre-allocation -> better hardware utilization 14 VM Container
  • 15. How to bundle workflow with execution environment? 15 Container Anduril Workflow Component 1 Component 2 Component 3 Pro: Single container, easy to maintain Con: VM-like approach; huge, monolithic container, difficult to share (against Docker philosophy) Pro: Completely modularized, easy to re- use/share workflow components Con: “container hell”? Workflow Anduril Solution 1 Solution 2 Container A Component 1 Container B Component 2 Container C Component 3
  • 16. Hybrid solution 16 Pro: Workflow completely containerized (= portable); only shared components in common containers Con: Still (but greatly reduced) overhead for container maintenance Workflow Anduril Container A Component 1 Component 2 Component 3 Master container Project- and user- specific components installed in master container Shared components installed in common container (e.g. container “RNA-seq”) “Docker inside docker”
  • 17. Dockerized GSNAP in Anduril 17 inputBamDir = INPUT(path="/data/bam", recursive=false) inputBamFiles = Folder2Array(folder1 = inputBamDir, filePattern = "C57C3ACXX_CV_([^_]+)_.*[.]bam$") alignedBams = record() for bam : std.iterArray(inputBamFiles) { gsnap = GSNAP ( reads = INPUT(path=bam.file), options = "--npaths=1 --max-mismatches=1 --novelsplicing=0", docker = "cfrech/anduril-gsnap-2015-09-21", @cpu = 10, @memory = 40000, @name = "gsnap_" + bam.key ) alignedBams[bam.key] = gsnap.alignment }
  • 18. So, Anduril is great… but  Proprietary scripting language  Biggest hurdle for widespread adoption IMO  Will likely improve with version 2 (which uses Scala)  Documentation opaque for beginners  WANTED: Simple step-by-step guide to build your first Anduril workflow  High upfront investment to get going (because of the above)  In-lining Bash/R/Perl/Python should be simpler  Currently too much clutter when using “BashEvaluate” and alike  Coding in Anduril sometimes “feels heavy” compared to other frameworks (e.g. GNU Make)  Will improve with fluency in workflow scripting language 18
  • 20. RNA-seq case study Step 1: Configure Anduril workflow title = “My project long title“ shortName = “My project short title“ authors = "Christian Frech" // analyses to run runNetworkAnalysis = true runMutationAnalysis = true runGSEA = true // constants PROJECT_BASE="/mnt/projects/myproject“ gtf = INPUT(path=PROJECT_BASE+"/data/Homo_sapiens.GRCh37.75.etv6runx1.gtf.gz") referenceGenomeFasta = INPUT(path="/data/reference/human_g1k_v37.fasta") ... 20 + description of samples, sample groups, and group comparisons in external CSV file
  • 21. RNA-seq case study Step 2: Run Anduril workflow on cluster $ anduril run main.and --exec-mode slurm 21
  • 22. RNA-seq case study Step 3: Go for lunch 22
  • 23. RNA-seq case study Step 4: Study PDF report 23
  • 24. What follows are screenshots from this PDF report 24
  • 26. QC: Gene body coverage 26
  • 27. QC: Distribution of expression values per sample 27
  • 28. QC: Sample PCA & heatmap 28
  • 29. Vulcano plot for each comparison 29
  • 30. Table report of DEGs for each comparison 30
  • 31. Expression values of top diff. expressed genes per comparison 31
  • 32. GO term enrichment for each comparison 32
  • 33. Interaction network of DEGs for each comparison 33
  • 35. GSEA heat map summarizing all comparisons 35 Rows = enriched gene sets Columns = comparisons Value = normalized enrichment score (NES) Red = enriched for up-regulated genes Blue = enriched for down-regulated genes * = significant (FDR < 0.05) ** = highly significant (FDR < 0.01)
  • 36. Future developments  Push new Anduril components to public repository (needs some refactoring, documentation, test cases)  Help on Anduril2 manuscript  Port custom Makefiles to Anduril (ongoing)  Cloud deployment of dockerized workflow  Couple slurm to AWS EC2  Automatic spin-up of docker-enabled AMIs serving as computing nodes 36
  • 37. In the (not so) distant future … $ docker pull cfrech/frech2015_et_al $ docker run cfrech/frech2015_et_al --use-cloud --max-nodes 300 --out output $ evince output/figure1.pdf 37
  • 38. Further reading  Discussion thread on Docker & Anduril https://groups.google.com/forum/#!msg/anduril-dev/Et8-YG9O-Aw 38
  • 39. Acknowledgement 39  Marko Laakso (Significo)  Sirku Kaarinen (Significo)  Kristian Ovaska (Valuemotive)  Pekka Lehti (Valuemotive)  Ville Rantanen (University of Helsinki, Hautaniemi lab)  Nuno Andrade (CCRI)  Andreas Heitger (CCRI)