High Profile Call Girls Jaipur Vani 8445551418 Independent Escort Service Jaipur
Strata Big Data Science Talk on ADAM
1. ADAM: Fast, Scalable
Genome Analysis
Frank Austin Nothaft
AMPLab, University of California, Berkeley
with: Matt Massie, André Schumacher, Timothy Danford, Chris
Hartl, Jey Kottalam, Arun Aruha, Neal Sidhwaney, Michael
Linderman, Jeff Hammerbacher, Anthony Joseph, and Dave
Patterson
https://github.com/bigdatagenomics
Wednesday, February 12, 14
2. Problem
• Whole genome files are large
• Biological systems are complex
• Population analysis requires petabytes of
data
• Analysis time is often a matter of life and
death
Wednesday, February 12, 14
3. Shredded Book Analogy
Dickens accidentally shreds the first printing of A Tale of Two Cities
Text printed on 5 long spools
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
Slide credit to Michael Schatz http://schatzlab.cshl.edu/
Wednesday, February 12, 14
4. Shredded Book Analogy
Dickens accidentally shreds the first printing of A Tale of Two Cities
Text printed on 5 long spools
It was the best of
It was the best
It was the
It was
It
times, it was the worst
of times, it was the
best of times, it was
the best of times, it
was the best of times,
of times, it was the
worst of times, it was
the worst of times, it
was the worst of times,
it was the worst of
age of wisdom, it was
the age of wisdom, it
was the age of wisdom,
it was the age of
times, it was the age
the age of foolishness, …
was the age of foolishness,
it was the age of
wisdom, it was the age
of wisdom, it was the
foolishness, …
of foolishness, …
age of foolishness, …
Slide credit to Michael Schatz http://schatzlab.cshl.edu/
Wednesday, February 12, 14
5. Shredded Book Analogy
Dickens accidentally shreds the first printing of A Tale of Two Cities
Text printed on 5 long spools
It was the best of
It was the best
It was the
It was
It
times, it was the worst
of times, it was the
best of times, it was
the best of times, it
was the best of times,
of times, it was the
worst of times, it was
the worst of times, it
was the worst of times,
it was the worst of
age of wisdom, it was
the age of wisdom, it
was the age of wisdom,
it was the age of
times, it was the age
the age of foolishness, …
was the age of foolishness,
it was the age of
wisdom, it was the age
of wisdom, it was the
foolishness, …
of foolishness, …
age of foolishness, …
• How can he reconstruct the text?
– 5 copies x 138, 656 words / 5 words per fragment = 138k fragments
– The short fragments from every copy are mixed together
– Some fragments are identical
Slide credit to Michael Schatz http://schatzlab.cshl.edu/
Wednesday, February 12, 14
6. Shredded Book Analogy
Dickens accidentally shreds the first printing of A Tale of Two Cities
Text printed on 5 long spools
It was the best of
It was the best
It was the
It was
It
times, it was the worst
of times, it was the
best of times, it was
the best of times, it
was the best of times,
of times, it was the
worst of times, it was
the worst of times, it
was the worst of times,
it was the worst of
age of wisdom, it was
the age of wisdom, it
was the age of wisdom,
it was the age of
times, it was the age
the age of foolishness, …
was the age of foolishness,
it was the age of
wisdom, it was the age
of wisdom, it was the
foolishness, …
of foolishness, …
age of foolishness, …
• How can he reconstruct the text?
– 5 copies x 138, 656 words / 5 words per fragment = 138k fragments
– The short fragments from every copy are mixed together
– Some fragments are identical
Slide credit to Michael Schatz http://schatzlab.cshl.edu/
Wednesday, February 12, 14
7. What is ADAM?
• File formats: columnar file format that
allows efficient parallel access to genomes
• API: interface for transforming, analyzing,
and querying genomic data
• CLI: a handy toolkit for quickly processing
genomes
Wednesday, February 12, 14
9. Design Goals
• Develop processing pipeline that enables
efficient, scalable use of cluster/cloud
• Provide data format that has efficient
parallel/distributed access across platforms
• Enhance semantics of data and allow more
flexible data access patterns
Wednesday, February 12, 14
10. Whole Genome
Data Sizes
Input
Pipeline
Stage
Output
SNAP
1GB Fasta
Alignment
150GB Fastq
250GB BAM
ADAM
250GB BAM Preprocessing
200GB
ADAM
Avocado
200GB
ADAM
10MB ADAM
Variant
Calling
Variants found at about 1 in 1,000 loci
Wednesday, February 12, 14
11. ADAM Stack
RDD
‣Transform records using Apache Spark
‣Query with SQL using Shark
‣Graph processing with GraphX
‣Machine learning using MLBase
‣Schema-driven records w/ Apache Avro
Record/Split ‣Store and retrieve records using Parquet
‣Read BAM Files using Hadoop-BAM
File/Block
Physical
Wednesday, February 12, 14
‣Hadoop Distributed Filesystem
‣Local Filesystem
‣Commodity Hardware
‣Cloud Systems - Amazon, GCE, Azure
12. Commits
Implementation
•
•
•
•
Work accelerated at end of September
•
Proof of concept implementations at The Broad
Institute, Duke, Harvard, UCSC
•
Global Alliance looking at ADAM as potential standard
Wednesday, February 12, 14
15K lines of Scala code
100% Apache-licensed open-source
Code contributions from Mt. Sinai, GenomeBridge, The
Broad Institute
13. AWS EC2 Performance
NA12878 Whole Genome
234GB as BAM, 229 GB as ADAM
Sort
Mark Duplicates
1064
1222
Picard
536
ADAM 1 node
32
ADAM 32 nodes
20
28
1
10
53x
100
Minutes (log scale)
Wednesday, February 12, 14
33x
70
ADAM 100 nodes
2x
1000
10000
14. Mt. Sinai Performance
NA12878 Whole Genome
234GB as BAM, 229 GB as ADAM
Sort
Mark Duplicates
1064
1222
Picard
29
50
17
34
11
15
ADAM 16 nodes
ADAM 32 nodes
ADAM 64 nodes
8
ADAM 82 nodes
1
10
62x
96x
133x
14
100
Minutes (log scale)
Wednesday, February 12, 14
36x
1000
10000
16. Acknowledgements
• UC Berkeley: Matt Massie, André
Schumacher, Jey Kottalam, Christos
Kozanitis
• Mt. Sinai: Arun Ahuja, Neal Sidhwaney,
Michael Linderman, Jeff Hammerbacher
• GenomeBridge: Timothy Danford, Carl
Yeksigian
Wednesday, February 12, 14
17. Acknowledgements
This research is supported in part by NSF CISE
Expeditions award CCF-1139158 and DARPA
XData Award FA8750-12-2-0331, and gifts from
Amazon Web Services, Google, SAP, Apple, Inc.,
Cisco, Clearstory Data, Cloudera, Ericsson,
Facebook, GameOnTalis, General Electric,
Hortonworks, Huawei, Intel, Microsoft, NetApp,
Oracle, Samsung, Splunk,VMware, WANdisco and
Yahoo!.
Wednesday, February 12, 14
18. “Before seeing your data, I would not
think sorting could be done that fast.”
- research scientist at the Broad Institute
Wednesday, February 12, 14
19. Call for contributions
• As an open source project, we welcome
contributions
• We maintain a list of open enhancements at
our Github issue tracker
• Enhancements tagged with “Pick me up!”
don’t require a genomics background
• Github: https://github.com/bigdatagenomics/
Wednesday, February 12, 14