3. The Sequencing Abstraction
It was the best of times, it was the worst of times…
the worst of
It was the the best of
worst of times
times, it was
• Humans have 46 chromosomes and each
chromosome looks like a long strong
• We get randomly distributed substrings, and want
to reassemble original, whole string
Metaphor borrowed from Michael Schatz
best of times
was the worst
4. Genomics = Big Data
• Sequencing run produces >100 GB of raw data
• Want to process 1,000’s of samples at once to
improve statistical power
• Current pipelines take about a week to run and are
not horizontally scalable
6. What’s our goal?
• Human genome is 3.3B letters long, but our reads
are only 50-250 letters long
• Sequence of the average human genome is known
• Insight: Each human genome only differs at 1 in
1000 positions, so we can align short reads to
average genome, and compute diff
7. Align Reads
It was the best of times, it was the worst of times…
best of times
was the worst
It was the the best of
times, it was
the worst of
worst of times
8. Align Reads
It was the best of times, it was the worst of times…
It was the
the best of
best of times
was the worst
times, it was
the worst of
worst of times
9. Align Reads
It was the best of times, it was the worst of times…
It was the
the best of
best of times
was the worst
times, it was
the worst of
worst of times
10. Align Reads
It was the best of times, it was the worst of times…
It was the
the best of
best of times
was the worst
times, it was
the worst of
worst of times
11. Align Reads
It was the best of times, it was the worst of times…
It was the
the best of
times, it was
the worst of
best of times
was the worst
worst of times
12. Align Reads
It was the best of times, it was the worst of times…
It was the
the best of
times, it was
the worst of
best of times
worst of times
was the worst
13. Align Reads
It was the best of times, it was the worst of times…
It was the
the best of
times, it was
the worst of
best of times
worst of times
was the worst
14. Align Reads
It was the best of times, it was the worst of times…
It was the
the best of
times, it was
the worst of
worst of times
best of times
was the worst
15. Assemble Reads
It was the best of times, it was the worst of times…
It was the
the best of
times, it was
the worst of
worst of times
best of times
was the worst
16. Assemble Reads
It was the best of times, it was the worst of times…
It was the best of times, it was
the worst of
worst of times
best of times
was the worst
17. Assemble Reads
It was the best of times, it was the worst of times…
It was the best of times, it was
was the worst
the worst of
worst of times
18. Assemble Reads
It was the best of times, it was the worst of times…
It was the best of times, it was
the worst
the worst of
worst of times
19. Assemble Reads
It was the best of times, it was the worst of times…
It was the best of times, it was the worst
of
worst of times
20. Assemble Reads
It was the best of times, it was the worst of times…
It was the best of times, it was the worst of times
21. Overall Pipeline Structure
From “GATK Best Practices”, https://www.broadinstitute.org/gatk/guide/best-practices
22. Overall Pipeline Structure
End to end pipeline takes ~120 hours
The stages take ~100 hours; ADAM works here
From “GATK Best Practices”, https://www.broadinstitute.org/gatk/guide/best-practices
24. Key Observations
• Current genomics pipelines are I/O limited
• Most genomics algorithms can be formulated as
either data/graph parallel computation
• Genomics is heavy on iteration/pipelining, data
access pattern is write once, read many times
• High coverage, whole genome (>220 GB) will
become main dataset for human genetics
25. ADAM Principles
• Use schema as “narrow waist”
• Columnar data representation +
in-memory computing eliminates
disk bandwidth bottleneck
• Minimize data movement: send
code to data
Application
Transformations
Presentation
Enriched Models
Evidence Access
MapReduce/DBMS
Schema
Data Models
Materialized Data
Columnar Storage
Data Distribution
Parallel FS/Sharding
Physical Storage
Disk
26. Data Independence
• Many current genomics systems require data to be
stored and processed in sorted order
• This is an abstraction inversion!
• Narrow waist at schema forces processing to be
abstract from data, data to be abstract from disk
• Do tricks at the processing level (fast coordinate-system
joins) to give necessary programming
abstractions
27. Data Format
• Genomics algorithms frequently
access global metadata
• Schema is fully denormalized,
allows O(1) access to metadata
• Make all fields nullable to allow for
arbitrary column projections
• Avro enables literate
programming
record AlignmentRecord {
union { null, Contig } contig = null;
union { null, long } start = null;
union { null, long } end = null;
union { null, int } mapq = null;
union { null, string } readName = null;
union { null, string } sequence = null;
union { null, string } mateReference = null;
union { null, long } mateAlignmentStart = null;
union { null, string } cigar = null;
union { null, string } qual = null;
union { null, string } recordGroupName = null;
union { int, null } basesTrimmedFromStart = 0;
union { int, null } basesTrimmedFromEnd = 0;
union { boolean, null } readPaired = false;
union { boolean, null } properPair = false;
union { boolean, null } readMapped = false;
union { boolean, null } mateMapped = false;
union { boolean, null } firstOfPair = false;
union { boolean, null } secondOfPair = false;
union { boolean, null } failedVendorQualityChecks = false;
union { boolean, null } duplicateRead = false;
union { boolean, null } readNegativeStrand = false;
union { boolean, null } mateNegativeStrand = false;
union { boolean, null } primaryAlignment = false;
union { boolean, null } secondaryAlignment = false;
union { boolean, null } supplementaryAlignment = false;
union { null, string } mismatchingPositions = null;
union { null, string } origQual = null;
union { null, string } attributes = null;
union { null, string } recordGroupSequencingCenter = null;
union { null, string } recordGroupDescription = null;
union { null, long } recordGroupRunDateEpoch = null;
union { null, string } recordGroupFlowOrder = null;
union { null, string } recordGroupKeySequence = null;
union { null, string } recordGroupLibrary = null;
union { null, int } recordGroupPredictedMedianInsertSize = null;
union { null, string } recordGroupPlatform = null;
union { null, string } recordGroupPlatformUnit = null;
union { null, string } recordGroupSample = null;
union { null, Contig } mateContig = null;
}
28. Parquet
• ASF Incubator project, based on
Google Dremel
• http://www.parquet.io
• High performance columnar
store with support for projections
and push-down predicates
• 3 layers of parallelism:
• File/row group
• Column chunk
• Page
Image from Parquet format definition: https://github.com/Parquet/parquet-format
29. Access to Remote Data
• For genomics, we often have a really huge dataset
which we only want to analyze part of
• This dataset might be stored in S3/equivalent
block store
• Minimize data movement by allowing Parquet to
support predicate pushdown/projections into S3
• Work is in progress, found at https://github.com/
bigdatagenomics/adam/tree/multi-loader
30. Performance
• Reduced pipeline time
from 100 hrs to ~1hr
• Linear speedup through
128 nodes, when
processing 234GB of data
• For flagstat, columnar
projection leads to a 5x
speedup
31. ADAM Status
• Apache 2 licensed OSS
• 25 contributors across 10 institutions
• Pushing for production 1.0 release towards end of year
• Working with GA4GH to use concepts from ADAM to
improve broader genomics data management techniques
32. Acknowledgements
• UC Berkeley: Matt Massie, André Schumacher, Jey Kottalam, Christos
Kozanitis, Dave Patterson, Anthony Joseph
• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Ryan Williams, Michael
Linderman, Jeff Hammerbacher
• GenomeBridge: Timothy Danford, Carl Yeksigian
• The Broad Institute: Chris Hartl
• Cloudera: Uri Laserson
• Microsoft Research: Jeremy Elson, Ravi Pandya
• And other open source contributors, including Michael Heuer, Neil
Ferguson, Andy Petrella, Xavier Tordoir!