3. Pipeline Issues Today:
Time and Scale
• The time to go from reads to answers is
too long
• Processing thousands of BAM files for
statistical analysis doesn’t scale
Saturday, November 2, 13
4. ADAM:
Speed and Scale
• Read BAM once, perform transformations
(e.g. sort, mark duplicates, BQSR) in
distributed memory, write the analysisready ADAM file once
• Use a distribute filesystem (HDFS), a fast
execution system (Spark) and columnar
data formats (Parquet) to scale
Saturday, November 2, 13
5. Unlocking Genomic Data
Shark (SQL)
Hadoop
M/R
Spark
Impala (SQL)
ADAM ADAM ADAM ADAM ADAM
ADAM ADAM ADAM ADAM ADAM
ADAM ADAM ADAM ADAM ADAM ADAM
Hadoop Distributed File System (HDFS)
Local Filesystem
ADAM ADAM
ADAM
ADAM
Saturday, November 2, 13
BAM
6. record ADAMRecord {
union
union
union
union
union
union
union
union
union
union
union
{
{
{
{
{
{
{
{
{
{
{
null,
null,
null,
null,
null,
null,
null,
null,
null,
null,
null,
string } referenceName = null;
int } referenceId = null;
long } start = null;
int } mapq = null;
string } readName = null;
string } sequence = null;
string } mateReference = null;
long } mateAlignmentStart = null;
string } cigar = null;
string } qual = null;
string } recordGroupId = null;
union
union
union
union
union
union
union
union
union
union
union
{
{
{
{
{
{
{
{
{
{
{
boolean,
boolean,
boolean,
boolean,
boolean,
boolean,
boolean,
boolean,
boolean,
boolean,
boolean,
null
null
null
null
null
null
null
null
null
null
null
}
}
}
}
}
}
}
}
}
}
}
http://avro.apache.org/
readPaired = false;
properPair = false;
readMapped = false;
mateMapped = false;
readNegativeStrand = false;
mateNegativeStrand = false;
firstOfPair = false;
secondOfPair = false;
primaryAlignment = false;
failedVendorQualityChecks = false;
duplicateRead = false;
union { null, string } mismatchingPositions = null;
union { null, string } attributes = null;
union
union
union
union
union
union
union
union
union
union
}
{
{
{
{
{
{
{
{
{
{
null,
null,
null,
null,
null,
null,
null,
null,
null,
null,
string } recordGroupSequencingCenter = null;
string } recordGroupDescription = null;
long } recordGroupRunDateEpoch = null;
string } recordGroupFlowOrder = null;
string } recordGroupKeySequence = null;
string } recordGroupLibrary = null;
int } recordGroupPredictedMedianInsertSize = null;
string } recordGroupPlatform = null;
string } recordGroupPlatformUnit = null;
string } recordGroupSample = null;
union { null, int } mateReferenceId = null;
Saturday, November 2, 13
10. Low-Coverage BAM
Experiment
• 14GB Low-coverage BAM with 145M reads
• 10-node ec2 cluster m2.4xlarge
• Reduced to 13GB with ADAM
• Conversion/upload to HDFS 22mins
• Sorted in 7minutes
Saturday, November 2, 13
11. High-Coverage BAM
Experiment
• Input: 237GB NA12878- high coverage,
PCR free, whole-genome BAM
• Conversion took 4hrs on ec2 m2.4xlarge
(8cpu, 68.4gb mem)
• Output size: 237GB BAM reduced to
212GB ADAM
Saturday, November 2, 13
12. Current Features
•
•
•
•
•
Saturday, November 2, 13
Convert BAM to ADAM (read-oriented)
Sort an ADAM file by reference
Generate ADAMPileups
Print mpileup output
Very soon ADAM will be able to mark duplicates
(initial benchmarks look good)
13. In progress...
•
Frank is working on a distributed variant caller (https://
github.com/fnothaft/avocado), local realignment, adam2bam
•
Chris Hartl is integrating ADAM with GATK (https://
github.com/chartl/GAParquet) DiagnoseTargets, adding new
VCF formats to ADAM, BQSR
•
Christos Kozanitis has been working on Shark and Impala
integration for ad-hoc SQL read queries
•
Collaborations with Mt. Sinai, GenomeBridge and the Broad
Institute who are interested in using ADAM
Saturday, November 2, 13