DNA sequencing is producing a wave of data which will change the way that drugs are developed, patients diagnosed, and our understanding of human biology. To fulfill this promise, however, the tools for interpretation and analysis must scale to match the quantity and diversity of "big data genomics."
ADAM is an open-source genomics processing engine, built using Spark, Apache Avro, and Parquet. This talk will discuss some of the advantages that the Spark platform brings to genomics, the benefits of using technologies like Parquet in conjunction with Spark, and the challenges of adapting new technologies for existing tools in bioinformatics.
These are slides for a talk given at the Apache Spark Meetup in Boston on October 20, 2014.
3. Bioinformatics computation is batch
processing and workflows
● Bioinformatics has a lot of
“workflow engines”
○ Galaxy, Taverna, Firehose, Zamboni,
Queue, Luigi, bPipe
○ bash scripts
○ even make, fer cryin’ out loud
○ a new one every day
● Bioinformatics software
development is still largely a
research activity
4. State-of-the-Art infrastructure:
shared filesystems, handwritten parallelism
● Hand-written task creation
● File formats instead of APIs or
data models
○ formats are poorly defined
○ contain optional or
redundant fields
○ semantics are unclear
● Workflow engines can’t take
advantage of common
parallelism between stages
9. The rest is iterative evaluation of
probabilistic models!
10. Spark RDDs and Partitioners allow
declarative parallelization for genomics
● Genomics computation
is parallelized in a small,
standard number of
ways
○ by position
○ by sample
● Declarative, flexible
partitioning schemes
are useful
11. Spark can easily express genomics primitives:
join by genomic overlap
1. Calculate disjoint
regions based on left
(blue) set
2. Partition both sets by
disjoint regions
3. Merge-join within each
partition
4. (Optional) aggregation
across joined pairs
12. ADAM is Genomics + Spark
● A rewrite of core bioinformatics tools and algorithms in Spark
● Combines three
technologies
○ Spark
○ Parquet
○ Avro
● Apache 2-licensed
● Started at the AMPLab
http://bdgenomics.org/
13. Avro and Parquet are just as critical to
ADAM as Spark
● Avro to define data models
● Parquet for serialization format
● Still need to answer design
questions
○ how wide are the schemas?
○ how much do we follow existing
formats?
○ how do carry through projections?
14. Still need to convince bioinformaticians to
rewrite their software!
Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)
15. Still need to convince bioinformaticians to
rewrite their software!
● A single piece of a
single filtering stage
for a somatic variant
caller
● “11-base-pair window
centered on a candidate
mutation” actually
turns out to be
optimized for a
particular file format
and sort order
Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)
16. The Future:
Distributed and Incremental?
● Today: 5k samples x 20 Gb / sample
● Tomorrow: 1m+ samples @ 200+ Gb / sample?
● More and more analysis is aggregative
○ joint variant calling,
○ panels of normal samples,
○ collective variant annotation
● And “data collection” will never be finished
17. Acknowledgements
Matt Massie (AMPLab)
Frank Nothaft (AMPLab)
Carl Yeksigian (DataStax)
Anthony Philippakis (Broad Institute)
Jeff Hammerbacher (Cloudera / Mt. Sinai)
Thank you!
(questions?)