Strata Big Data Science Talk on ADAM

ADAM: Fast, Scalable
Genome Analysis
Frank Austin Nothaft
AMPLab, University of California, Berkeley
with: Matt Massie, André Schumacher, Timothy Danford, Chris
Hartl, Jey Kottalam, Arun Aruha, Neal Sidhwaney, Michael
Linderman, Jeff Hammerbacher, Anthony Joseph, and Dave
Patterson

https://github.com/bigdatagenomics
Wednesday, February 12, 14

Problem
• Whole genome ﬁles are large
• Biological systems are complex
• Population analysis requires petabytes of
data

• Analysis time is often a matter of life and
death


Shredded Book Analogy
Dickens accidentally shreds the ﬁrst printing of A Tale of Two Cities
Text printed on 5 long spools
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

Slide credit to Michael Schatz http://schatzlab.cshl.edu/

It was the best of
It was the best
It was the
It was
It

times, it was the worst
of times, it was the

best of times, it was

the best of times, it

was the best of times,

worst of times, it was

the worst of times, it

was the worst of times,
it was the worst of

age of wisdom, it was
the age of wisdom, it

was the age of wisdom,
it was the age of

times, it was the age

the age of foolishness, …
was the age of foolishness,

it was the age of

wisdom, it was the age
of wisdom, it was the

foolishness, …
of foolishness, …

age of foolishness, …


It was the best of
It was the best
It was the
It was
It

times, it was the worst

best of times, it was

the best of times, it

was the best of times,

worst of times, it was

the worst of times, it

was the worst of times,
it was the worst of

age of wisdom, it was
the age of wisdom, it

was the age of wisdom,
it was the age of

times, it was the age

the age of foolishness, …
was the age of foolishness,

it was the age of

wisdom, it was the age
of wisdom, it was the

foolishness, …
of foolishness, …

age of foolishness, …

• How can he reconstruct the text?
– 5 copies x 138, 656 words / 5 words per fragment = 138k fragments
– The short fragments from every copy are mixed together
– Some fragments are identical

What is ADAM?
• File formats: columnar ﬁle format that

allows efﬁcient parallel access to genomes

• API: interface for transforming, analyzing,
and querying genomic data

• CLI: a handy toolkit for quickly processing
genomes


SNAP
ADAM

SiRen CAGE
Avocado

MLBase
GraphX

The Broad Institute “Best Practices” Pipeline

Design Goals
• Develop processing pipeline that enables
efficient, scalable use of cluster/cloud

• Provide data format that has efficient

parallel/distributed access across platforms

• Enhance semantics of data and allow more
flexible data access patterns


Whole Genome
Data Sizes
Input

Pipeline
Stage

Output

SNAP

1GB Fasta
Alignment
150GB Fastq

250GB BAM

ADAM

250GB BAM Preprocessing

200GB
ADAM

Avocado

200GB
ADAM

10MB ADAM

Variant
Calling

Variants found at about 1 in 1,000 loci

ADAM Stack
RDD

‣Transform records using Apache Spark
‣Query with SQL using Shark
‣Graph processing with GraphX
‣Machine learning using MLBase

‣Schema-driven records w/ Apache Avro
Record/Split ‣Store and retrieve records using Parquet
‣Read BAM Files using Hadoop-BAM
File/Block
Physical

‣Hadoop Distributed Filesystem
‣Local Filesystem
‣Commodity Hardware
‣Cloud Systems - Amazon, GCE, Azure

Commits

Implementation

•
•
•
•

Work accelerated at end of September

•

Proof of concept implementations at The Broad
Institute, Duke, Harvard, UCSC

•

Global Alliance looking at ADAM as potential standard


15K lines of Scala code
100% Apache-licensed open-source
Code contributions from Mt. Sinai, GenomeBridge, The
Broad Institute

AWS EC2 Performance
NA12878 Whole Genome
234GB as BAM, 229 GB as ADAM
Sort

Mark Duplicates

1064
1222

Picard

536

ADAM 1 node

32

ADAM 32 nodes

20
28
1

10

53x
100

Minutes (log scale)

33x

70

ADAM 100 nodes

2x

1000

10000

Mt. Sinai Performance
NA12878 Whole Genome
234GB as BAM, 229 GB as ADAM
Sort

Mark Duplicates

1064
1222

Picard

29
50
17
34
11
15

ADAM 16 nodes

ADAM 32 nodes

ADAM 64 nodes

8

ADAM 82 nodes
1

10

62x
96x
133x

14
100
Minutes (log scale)


36x

1000

10000

Scalability

HG00096 à 16GB
NA12878 à 234GB

Acknowledgements
• UC Berkeley: Matt Massie, André
Schumacher, Jey Kottalam, Christos
Kozanitis

• Mt. Sinai: Arun Ahuja, Neal Sidhwaney,
Michael Linderman, Jeff Hammerbacher

• GenomeBridge: Timothy Danford, Carl
Yeksigian


Acknowledgements
This research is supported in part by NSF CISE
Expeditions award CCF-1139158 and DARPA
XData Award FA8750-12-2-0331, and gifts from
Amazon Web Services, Google, SAP, Apple, Inc.,
Cisco, Clearstory Data, Cloudera, Ericsson,
Facebook, GameOnTalis, General Electric,
Hortonworks, Huawei, Intel, Microsoft, NetApp,
Oracle, Samsung, Splunk,VMware, WANdisco and
Yahoo!.


“Before seeing your data, I would not
think sorting could be done that fast.”
- research scientist at the Broad Institute


Call for contributions
• As an open source project, we welcome
contributions

• We maintain a list of open enhancements at
our Github issue tracker

• Enhancements tagged with “Pick me up!”
don’t require a genomics background

• Github: https://github.com/bigdatagenomics/

Strata Big Data Science Talk on ADAM

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Strata Big Data Science Talk on ADAM

Similar to Strata Big Data Science Talk on ADAM (19)

Recently uploaded

Recently uploaded (20)

Strata Big Data Science Talk on ADAM