ADAM

ADAM

https://github.com/massie/adam
Matt Massie
University of California, Berkeley
massie@berkeley.edu

Saturday, November 2, 13

SAM

BAM

ADAM

Sequence Alignment Map (SAM)
Binary Alignment Map (BAM)
Avro Data Alignment Map (ADAM)


Pipeline Issues Today:
Time and Scale
• The time to go from reads to answers is
too long

• Processing thousands of BAM ﬁles for
statistical analysis doesn’t scale


ADAM:
Speed and Scale
• Read BAM once, perform transformations
(e.g. sort, mark duplicates, BQSR) in
distributed memory, write the analysisready ADAM ﬁle once

• Use a distribute ﬁlesystem (HDFS), a fast
execution system (Spark) and columnar
data formats (Parquet) to scale


Unlocking Genomic Data
Shark (SQL)
Hadoop
M/R

Spark

Impala (SQL)

ADAM ADAM ADAM ADAM ADAM
ADAM ADAM ADAM ADAM ADAM
ADAM ADAM ADAM ADAM ADAM ADAM
Hadoop Distributed File System (HDFS)

Local Filesystem
ADAM ADAM
ADAM
ADAM

BAM

record ADAMRecord {
union
union
union
union
union
union
union
union
union
union
union

{
{
{
{
{
{
{
{
{
{
{

null,
null,
null,
null,
null,
null,
null,
null,
null,
null,
null,

string } referenceName = null;
int } referenceId = null;
long } start = null;
int } mapq = null;
string } readName = null;
string } sequence = null;
string } mateReference = null;
long } mateAlignmentStart = null;
string } cigar = null;
string } qual = null;
string } recordGroupId = null;

union
union
union
union
union
union
union
union
union
union
union

{
{
{
{
{
{
{
{
{
{
{

boolean,
boolean,
boolean,
boolean,
boolean,
boolean,
boolean,
boolean,
boolean,
boolean,
boolean,

null
null
null
null
null
null
null
null
null
null
null

}
}
}
}
}
}
}
}
}
}
}

http://avro.apache.org/

readPaired = false;
properPair = false;
readMapped = false;
mateMapped = false;
readNegativeStrand = false;
mateNegativeStrand = false;
firstOfPair = false;
secondOfPair = false;
primaryAlignment = false;
failedVendorQualityChecks = false;
duplicateRead = false;

union { null, string } mismatchingPositions = null;
union { null, string } attributes = null;
union
union
union
union
union
union
union
union
union
union

}

{
{
{
{
{
{
{
{
{
{

null,
null,
null,
null,
null,
null,
null,
null,
null,
null,

string } recordGroupSequencingCenter = null;
string } recordGroupDescription = null;
long } recordGroupRunDateEpoch = null;
string } recordGroupFlowOrder = null;
string } recordGroupKeySequence = null;
string } recordGroupLibrary = null;
int } recordGroupPredictedMedianInsertSize = null;
string } recordGroupPlatform = null;
string } recordGroupPlatformUnit = null;
string } recordGroupSample = null;

union { null, int } mateReferenceId = null;


Parquet
http://parquet.io

Column-oriented layout
Row-oriented layout

https://blog.twitter.com/2013/dremel-made-simple-with-parquet

Genomic Data Example
chrom20 TCGA

4M

chrom20 GAAT

4M1D

chrom20 CCGAT

5M

Column Oriented
chrom20 chrom20 chrom20

TCGA

GAAT

CCGAT

4M

4M1D

5M

Row Oriented
chrom20


TCGA

4M

chrom20

GAAT

4M1D

chrom20 CCGAT

5M

http://spark.incubator.apache.org/


Low-Coverage BAM
Experiment
• 14GB Low-coverage BAM with 145M reads
• 10-node ec2 cluster m2.4xlarge
• Reduced to 13GB with ADAM
• Conversion/upload to HDFS 22mins
• Sorted in 7minutes

High-Coverage BAM
Experiment
• Input: 237GB NA12878- high coverage,
PCR free, whole-genome BAM

• Conversion took 4hrs on ec2 m2.4xlarge
(8cpu, 68.4gb mem)

• Output size: 237GB BAM reduced to
212GB ADAM


Current Features
•
•
•
•
•


Convert BAM to ADAM (read-oriented)
Sort an ADAM ﬁle by reference
Generate ADAMPileups
Print mpileup output
Very soon ADAM will be able to mark duplicates
(initial benchmarks look good)

In progress...
•

Frank is working on a distributed variant caller (https://
github.com/fnothaft/avocado), local realignment, adam2bam

•

Chris Hartl is integrating ADAM with GATK (https://
github.com/chartl/GAParquet) DiagnoseTargets, adding new
VCF formats to ADAM, BQSR

•

Christos Kozanitis has been working on Shark and Impala
integration for ad-hoc SQL read queries

•

Collaborations with Mt. Sinai, GenomeBridge and the Broad
Institute who are interested in using ADAM


ADAM

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (8)

Ähnlich wie ADAM

Ähnlich wie ADAM (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

ADAM