SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Downloaden Sie, um offline zu lesen
Why is Bioinformatics 
(well, really, “genomics”) 
a Good Fit for Spark? 
Timothy Danford 
AMPLab
A One-Slide Introduction to Genomics
Bioinformatics computation is batch 
processing and workflows 
● Bioinformatics has a lot of 
“workflow engines” 
○ Galaxy, Taverna, Firehose, Zamboni, 
Queue, Luigi, bPipe 
○ bash scripts 
○ even make, fer cryin’ out loud 
○ a new one every day 
● Bioinformatics software 
development is still largely a 
research activity
State-of-the-Art infrastructure: 
shared filesystems, handwritten parallelism 
● Hand-written task creation 
● File formats instead of APIs or 
data models 
○ formats are poorly defined 
○ contain optional or 
redundant fields 
○ semantics are unclear 
● Workflow engines can’t take 
advantage of common 
parallelism between stages
So, why Spark?
Most of Genomics is 1-D Geometry
Most of Genomics is 1-D Geometry
The rest is iterative evaluation of 
probabilistic models!
Spark RDDs and Partitioners allow 
declarative parallelization for genomics 
● Genomics computation 
is parallelized in a small, 
standard number of 
ways 
○ by position 
○ by sample 
● Declarative, flexible 
partitioning schemes 
are useful
Spark can easily express genomics primitives: 
join by genomic overlap 
1. Calculate disjoint 
regions based on left 
(blue) set 
2. Partition both sets by 
disjoint regions 
3. Merge-join within each 
partition 
4. (Optional) aggregation 
across joined pairs
ADAM is Genomics + Spark 
● A rewrite of core bioinformatics tools and algorithms in Spark 
● Combines three 
technologies 
○ Spark 
○ Parquet 
○ Avro 
● Apache 2-licensed 
● Started at the AMPLab 
http://bdgenomics.org/
Avro and Parquet are just as critical to 
ADAM as Spark 
● Avro to define data models 
● Parquet for serialization format 
● Still need to answer design 
questions 
○ how wide are the schemas? 
○ how much do we follow existing 
formats? 
○ how do carry through projections?
Still need to convince bioinformaticians to 
rewrite their software! 
Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)
Still need to convince bioinformaticians to 
rewrite their software! 
● A single piece of a 
single filtering stage 
for a somatic variant 
caller 
● “11-base-pair window 
centered on a candidate 
mutation” actually 
turns out to be 
optimized for a 
particular file format 
and sort order 
Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)
The Future: 
Distributed and Incremental? 
● Today: 5k samples x 20 Gb / sample 
● Tomorrow: 1m+ samples @ 200+ Gb / sample? 
● More and more analysis is aggregative 
○ joint variant calling, 
○ panels of normal samples, 
○ collective variant annotation 
● And “data collection” will never be finished
Acknowledgements 
Matt Massie (AMPLab) 
Frank Nothaft (AMPLab) 
Carl Yeksigian (DataStax) 
Anthony Philippakis (Broad Institute) 
Jeff Hammerbacher (Cloudera / Mt. Sinai) 
Thank you! 
(questions?)

Weitere ähnliche Inhalte

Was ist angesagt?

ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014
fnothaft
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
Sri Ambati
 
Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shang
SAIL_QU
 
Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)
Kim Herzig
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Spark Summit
 

Was ist angesagt? (20)

Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
Fast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocado
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAM
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shang
 
The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...
 
Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)
 
Bridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly Detection
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
 
RDF Stream Processing: Let's React
RDF Stream Processing: Let's ReactRDF Stream Processing: Let's React
RDF Stream Processing: Let's React
 

Ähnlich wie Why is Bioinformatics a Good Fit for Spark?

2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
c.titus.brown
 
OpenDiscovery
OpenDiscoveryOpenDiscovery
OpenDiscovery
gwprice
 

Ähnlich wie Why is Bioinformatics a Good Fit for Spark? (20)

Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinity
 
Data Science as Scale
Data Science as ScaleData Science as Scale
Data Science as Scale
 
Are High Level Programming Languages for Multicore and Safety Critical Conver...
Are High Level Programming Languages for Multicore and Safety Critical Conver...Are High Level Programming Languages for Multicore and Safety Critical Conver...
Are High Level Programming Languages for Multicore and Safety Critical Conver...
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examples
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil
 
Towards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingTowards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and Benchmarking
 
Parallelization using open mp
Parallelization using open mpParallelization using open mp
Parallelization using open mp
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Data analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsData analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomics
 
Initial steps towards a production platform for DNA sequence analysis on the ...
Initial steps towards a production platform for DNA sequence analysis on the ...Initial steps towards a production platform for DNA sequence analysis on the ...
Initial steps towards a production platform for DNA sequence analysis on the ...
 
groovy & grails - lecture 1
groovy & grails - lecture 1groovy & grails - lecture 1
groovy & grails - lecture 1
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
 
INTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging Face
INTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging FaceINTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging Face
INTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging Face
 
Getting Started with SPARK
Getting Started with SPARKGetting Started with SPARK
Getting Started with SPARK
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
 
RDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation PruningRDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation Pruning
 
OpenDiscovery
OpenDiscoveryOpenDiscovery
OpenDiscovery
 

Kürzlich hochgeladen

Call Girls in Lucknow Just Call 👉👉7877925207 Top Class Call Girl Service Avai...
Call Girls in Lucknow Just Call 👉👉7877925207 Top Class Call Girl Service Avai...Call Girls in Lucknow Just Call 👉👉7877925207 Top Class Call Girl Service Avai...
Call Girls in Lucknow Just Call 👉👉7877925207 Top Class Call Girl Service Avai...
adilkhan87451
 
🌹Attapur⬅️ Vip Call Girls Hyderabad 📱9352852248 Book Well Trand Call Girls In...
🌹Attapur⬅️ Vip Call Girls Hyderabad 📱9352852248 Book Well Trand Call Girls In...🌹Attapur⬅️ Vip Call Girls Hyderabad 📱9352852248 Book Well Trand Call Girls In...
🌹Attapur⬅️ Vip Call Girls Hyderabad 📱9352852248 Book Well Trand Call Girls In...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
 
Russian Call Girls Lucknow Just Call 👉👉7877925207 Top Class Call Girl Service...
Russian Call Girls Lucknow Just Call 👉👉7877925207 Top Class Call Girl Service...Russian Call Girls Lucknow Just Call 👉👉7877925207 Top Class Call Girl Service...
Russian Call Girls Lucknow Just Call 👉👉7877925207 Top Class Call Girl Service...
adilkhan87451
 
Call Girl in Indore 8827247818 {LowPrice} ❤️ (ahana) Indore Call Girls * UPA...
Call Girl in Indore 8827247818 {LowPrice} ❤️ (ahana) Indore Call Girls  * UPA...Call Girl in Indore 8827247818 {LowPrice} ❤️ (ahana) Indore Call Girls  * UPA...
Call Girl in Indore 8827247818 {LowPrice} ❤️ (ahana) Indore Call Girls * UPA...
mahaiklolahd
 
Call Girl In Pune 👉 Just CALL ME: 9352988975 💋 Call Out Call Both With High p...
Call Girl In Pune 👉 Just CALL ME: 9352988975 💋 Call Out Call Both With High p...Call Girl In Pune 👉 Just CALL ME: 9352988975 💋 Call Out Call Both With High p...
Call Girl In Pune 👉 Just CALL ME: 9352988975 💋 Call Out Call Both With High p...
chetankumar9855
 

Kürzlich hochgeladen (20)

Call Girls in Lucknow Just Call 👉👉7877925207 Top Class Call Girl Service Avai...
Call Girls in Lucknow Just Call 👉👉7877925207 Top Class Call Girl Service Avai...Call Girls in Lucknow Just Call 👉👉7877925207 Top Class Call Girl Service Avai...
Call Girls in Lucknow Just Call 👉👉7877925207 Top Class Call Girl Service Avai...
 
Models Call Girls In Hyderabad 9630942363 Hyderabad Call Girl & Hyderabad Esc...
Models Call Girls In Hyderabad 9630942363 Hyderabad Call Girl & Hyderabad Esc...Models Call Girls In Hyderabad 9630942363 Hyderabad Call Girl & Hyderabad Esc...
Models Call Girls In Hyderabad 9630942363 Hyderabad Call Girl & Hyderabad Esc...
 
🌹Attapur⬅️ Vip Call Girls Hyderabad 📱9352852248 Book Well Trand Call Girls In...
🌹Attapur⬅️ Vip Call Girls Hyderabad 📱9352852248 Book Well Trand Call Girls In...🌹Attapur⬅️ Vip Call Girls Hyderabad 📱9352852248 Book Well Trand Call Girls In...
🌹Attapur⬅️ Vip Call Girls Hyderabad 📱9352852248 Book Well Trand Call Girls In...
 
Coimbatore Call Girls in Thudiyalur : 7427069034 High Profile Model Escorts |...
Coimbatore Call Girls in Thudiyalur : 7427069034 High Profile Model Escorts |...Coimbatore Call Girls in Thudiyalur : 7427069034 High Profile Model Escorts |...
Coimbatore Call Girls in Thudiyalur : 7427069034 High Profile Model Escorts |...
 
Russian Call Girls Service Jaipur {8445551418} ❤️PALLAVI VIP Jaipur Call Gir...
Russian Call Girls Service  Jaipur {8445551418} ❤️PALLAVI VIP Jaipur Call Gir...Russian Call Girls Service  Jaipur {8445551418} ❤️PALLAVI VIP Jaipur Call Gir...
Russian Call Girls Service Jaipur {8445551418} ❤️PALLAVI VIP Jaipur Call Gir...
 
Independent Call Girls In Jaipur { 8445551418 } ✔ ANIKA MEHTA ✔ Get High Prof...
Independent Call Girls In Jaipur { 8445551418 } ✔ ANIKA MEHTA ✔ Get High Prof...Independent Call Girls In Jaipur { 8445551418 } ✔ ANIKA MEHTA ✔ Get High Prof...
Independent Call Girls In Jaipur { 8445551418 } ✔ ANIKA MEHTA ✔ Get High Prof...
 
Top Rated Pune Call Girls (DIPAL) ⟟ 8250077686 ⟟ Call Me For Genuine Sex Serv...
Top Rated Pune Call Girls (DIPAL) ⟟ 8250077686 ⟟ Call Me For Genuine Sex Serv...Top Rated Pune Call Girls (DIPAL) ⟟ 8250077686 ⟟ Call Me For Genuine Sex Serv...
Top Rated Pune Call Girls (DIPAL) ⟟ 8250077686 ⟟ Call Me For Genuine Sex Serv...
 
Call Girls Mumbai Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Mumbai Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Mumbai Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Mumbai Just Call 8250077686 Top Class Call Girl Service Available
 
Top Rated Hyderabad Call Girls Chintal ⟟ 9332606886 ⟟ Call Me For Genuine Se...
Top Rated  Hyderabad Call Girls Chintal ⟟ 9332606886 ⟟ Call Me For Genuine Se...Top Rated  Hyderabad Call Girls Chintal ⟟ 9332606886 ⟟ Call Me For Genuine Se...
Top Rated Hyderabad Call Girls Chintal ⟟ 9332606886 ⟟ Call Me For Genuine Se...
 
Call Girls Jaipur Just Call 9521753030 Top Class Call Girl Service Available
Call Girls Jaipur Just Call 9521753030 Top Class Call Girl Service AvailableCall Girls Jaipur Just Call 9521753030 Top Class Call Girl Service Available
Call Girls Jaipur Just Call 9521753030 Top Class Call Girl Service Available
 
(Low Rate RASHMI ) Rate Of Call Girls Jaipur ❣ 8445551418 ❣ Elite Models & Ce...
(Low Rate RASHMI ) Rate Of Call Girls Jaipur ❣ 8445551418 ❣ Elite Models & Ce...(Low Rate RASHMI ) Rate Of Call Girls Jaipur ❣ 8445551418 ❣ Elite Models & Ce...
(Low Rate RASHMI ) Rate Of Call Girls Jaipur ❣ 8445551418 ❣ Elite Models & Ce...
 
Call Girls Service Jaipur {8445551418} ❤️VVIP BHAWNA Call Girl in Jaipur Raja...
Call Girls Service Jaipur {8445551418} ❤️VVIP BHAWNA Call Girl in Jaipur Raja...Call Girls Service Jaipur {8445551418} ❤️VVIP BHAWNA Call Girl in Jaipur Raja...
Call Girls Service Jaipur {8445551418} ❤️VVIP BHAWNA Call Girl in Jaipur Raja...
 
Andheri East ) Call Girls in Mumbai Phone No 9004268417 Elite Escort Service ...
Andheri East ) Call Girls in Mumbai Phone No 9004268417 Elite Escort Service ...Andheri East ) Call Girls in Mumbai Phone No 9004268417 Elite Escort Service ...
Andheri East ) Call Girls in Mumbai Phone No 9004268417 Elite Escort Service ...
 
Saket * Call Girls in Delhi - Phone 9711199012 Escorts Service at 6k to 50k a...
Saket * Call Girls in Delhi - Phone 9711199012 Escorts Service at 6k to 50k a...Saket * Call Girls in Delhi - Phone 9711199012 Escorts Service at 6k to 50k a...
Saket * Call Girls in Delhi - Phone 9711199012 Escorts Service at 6k to 50k a...
 
Jogeshwari ! Call Girls Service Mumbai - 450+ Call Girl Cash Payment 90042684...
Jogeshwari ! Call Girls Service Mumbai - 450+ Call Girl Cash Payment 90042684...Jogeshwari ! Call Girls Service Mumbai - 450+ Call Girl Cash Payment 90042684...
Jogeshwari ! Call Girls Service Mumbai - 450+ Call Girl Cash Payment 90042684...
 
Call Girls Coimbatore Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Coimbatore Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Coimbatore Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Coimbatore Just Call 8250077686 Top Class Call Girl Service Available
 
Russian Call Girls Lucknow Just Call 👉👉7877925207 Top Class Call Girl Service...
Russian Call Girls Lucknow Just Call 👉👉7877925207 Top Class Call Girl Service...Russian Call Girls Lucknow Just Call 👉👉7877925207 Top Class Call Girl Service...
Russian Call Girls Lucknow Just Call 👉👉7877925207 Top Class Call Girl Service...
 
Call Girl in Indore 8827247818 {LowPrice} ❤️ (ahana) Indore Call Girls * UPA...
Call Girl in Indore 8827247818 {LowPrice} ❤️ (ahana) Indore Call Girls  * UPA...Call Girl in Indore 8827247818 {LowPrice} ❤️ (ahana) Indore Call Girls  * UPA...
Call Girl in Indore 8827247818 {LowPrice} ❤️ (ahana) Indore Call Girls * UPA...
 
9630942363 Genuine Call Girls In Ahmedabad Gujarat Call Girls Service
9630942363 Genuine Call Girls In Ahmedabad Gujarat Call Girls Service9630942363 Genuine Call Girls In Ahmedabad Gujarat Call Girls Service
9630942363 Genuine Call Girls In Ahmedabad Gujarat Call Girls Service
 
Call Girl In Pune 👉 Just CALL ME: 9352988975 💋 Call Out Call Both With High p...
Call Girl In Pune 👉 Just CALL ME: 9352988975 💋 Call Out Call Both With High p...Call Girl In Pune 👉 Just CALL ME: 9352988975 💋 Call Out Call Both With High p...
Call Girl In Pune 👉 Just CALL ME: 9352988975 💋 Call Out Call Both With High p...
 

Why is Bioinformatics a Good Fit for Spark?

  • 1. Why is Bioinformatics (well, really, “genomics”) a Good Fit for Spark? Timothy Danford AMPLab
  • 3. Bioinformatics computation is batch processing and workflows ● Bioinformatics has a lot of “workflow engines” ○ Galaxy, Taverna, Firehose, Zamboni, Queue, Luigi, bPipe ○ bash scripts ○ even make, fer cryin’ out loud ○ a new one every day ● Bioinformatics software development is still largely a research activity
  • 4. State-of-the-Art infrastructure: shared filesystems, handwritten parallelism ● Hand-written task creation ● File formats instead of APIs or data models ○ formats are poorly defined ○ contain optional or redundant fields ○ semantics are unclear ● Workflow engines can’t take advantage of common parallelism between stages
  • 5.
  • 7. Most of Genomics is 1-D Geometry
  • 8. Most of Genomics is 1-D Geometry
  • 9. The rest is iterative evaluation of probabilistic models!
  • 10. Spark RDDs and Partitioners allow declarative parallelization for genomics ● Genomics computation is parallelized in a small, standard number of ways ○ by position ○ by sample ● Declarative, flexible partitioning schemes are useful
  • 11. Spark can easily express genomics primitives: join by genomic overlap 1. Calculate disjoint regions based on left (blue) set 2. Partition both sets by disjoint regions 3. Merge-join within each partition 4. (Optional) aggregation across joined pairs
  • 12. ADAM is Genomics + Spark ● A rewrite of core bioinformatics tools and algorithms in Spark ● Combines three technologies ○ Spark ○ Parquet ○ Avro ● Apache 2-licensed ● Started at the AMPLab http://bdgenomics.org/
  • 13. Avro and Parquet are just as critical to ADAM as Spark ● Avro to define data models ● Parquet for serialization format ● Still need to answer design questions ○ how wide are the schemas? ○ how much do we follow existing formats? ○ how do carry through projections?
  • 14. Still need to convince bioinformaticians to rewrite their software! Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)
  • 15. Still need to convince bioinformaticians to rewrite their software! ● A single piece of a single filtering stage for a somatic variant caller ● “11-base-pair window centered on a candidate mutation” actually turns out to be optimized for a particular file format and sort order Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)
  • 16. The Future: Distributed and Incremental? ● Today: 5k samples x 20 Gb / sample ● Tomorrow: 1m+ samples @ 200+ Gb / sample? ● More and more analysis is aggregative ○ joint variant calling, ○ panels of normal samples, ○ collective variant annotation ● And “data collection” will never be finished
  • 17. Acknowledgements Matt Massie (AMPLab) Frank Nothaft (AMPLab) Carl Yeksigian (DataStax) Anthony Philippakis (Broad Institute) Jeff Hammerbacher (Cloudera / Mt. Sinai) Thank you! (questions?)