1. The Genome Analysis Toolkit
A MapReduce framework for analyzing next-generation
DNA sequencing data
Ma#
Hanna
and
Mark
DePristo
Genome
Sequencing
and
Analysis
Group
Medical
and
Popula<on
Gene<cs
Program
Broad
Ins<tute
of
Harvard
and
MIT
2. The Genome Analysis Toolkit
Agenda
• GATK
Overview
and
Concepts
• GATK
Workflow
• Example:
A
Simple
Bayesian
Genotyper
2
2 2
3. GATK: Overview and Concepts
Motivation
Coverage in xMHC region of JPT individuals"
• Dataset size greatly increases analysis complexity.
• Implementation issues can prematurely terminate
long-running jobs or introduce subtle bugs.
3
4. GATK: Overview
Simplifying the process of writing analysis tools for resequencing data
• The
framework
is
designed
to
support
most
common
paradigms
of
analysis
algorithms
– Provides
structured
access
to
reads
in
BAM
format,
reference
context,
as
well
as
reference-‐associated
meta
data
• General-‐purpose
– Op<mized
for
ease
of
use
and
completeness
of
func<onality
within
scope
• Efficient
– Engineering
investment
on
performance
of
cri<cal
data
structures
and
manipula<on
rou<nes
• Convenient
– Structured
plug-‐in
model
makes
developing
in
Java
against
the
framework
rela<vely
painfree
4
5. GATK: Overview
The MapReduce design philosophy
Data elements a
b
c
d
e
Operations are
f(x) independent of
each other
X = f(x) A
B
C
D
E
r(x,y, …, z) Results depends on
all sites
R = r(A, R(B,…,E)) R
Result is:
Map Function f applied to each element of list
Reduce Function r recursively reduced over each f(…)
5
6. GATK: Overview
Rapid development of efficient and robust analysis tools
Genome
Analysis
Provides the Toolkit
(GATK)
boilerplate infrastructure
code required
to perform any
NGS analysis
Traversal
engine
Analysis
tool
Provided
by
framework
Implemented
by
user
6
7. GATK: Workflow
Introduction
• GATK
Overview
and
Concepts
• GATK
Workflow
• An
example
of
one
of
the
GATK’s
most
common
workflows
• Data
access
pa#ern:
by
locus
• Inputs:
reads,
reference,
dbSNP
• Example:
A
Simple
Bayesian
Genotyper
7
8. GATK: Workflow
The sharding system: dividing data into processor-sized pieces
Reads
Reference
dbSNP
• Divides data into small chunks that can be
processed independently
• Handles extraction of subsets of data
• Groups small intervals together to avoid
repetitive decompression
8
9. GATK: Workflow
Traversal engines: preparing data for processing
Builds data structures
easy consumed by the
analysis
9
10. GATK: Workflow
Interaction between sharding system and traversal engines
• Datasets are split into shards, which can be processed sequentially or in parallel
• When processing sequentially, the reduce value of each shard is used to
bootstrap the next shard.
• When processing in parallel, the result of each shard is computed independently
and then “tree-reduced” together.
10
11. GATK: Workflow
Walkers: Analyses written by end-users
dbsnp
exons
A
ref
A
reads C
C
A
C
Analysis
tool
• Walkers (analyses) can easily be written by end users. The GATK is
distributed with a significant library of walkers.
• Only the reads, reference, and reference metadata applicable to a single-
base location is presented to the analysis tool.
• The GATK provides tools to filter the pileup automatically or on demand.
11
12. GATK: Workflow
Other data access patterns
Other data access patterns:
Traversal Type Description
Reads Call map per read, along with the reference
and reference-ordered metadata spanning
that read.
Duplicates Call map for each set of duplicate reads.
Read pair (naïve) Call map for each read and its mate (naïve,
requires the input BAM to be sorted in
query name order).
Straightforward (but not necessarily easy) to add any new
access pattern involving streaming data.
12
13. GATK: Additional features
Additional inputs and outputs
Reference metadata
• Support for additional input data that is sorted in reference
order can easily be added to the GATK.
• Input types can be added by creating two new classes: a
feature (data access object) and a codec (parser).
• New file formats are indexed automatically.
• New data types are autodiscovered via a classpath search.
• Joint initiative with IGV.
Additional I/O
• Analysis parameters can be added to a walker by annotating a
field in the walker with an @Argument annotation.
• Command-line argument types can become very sophisticated.
13
14. Walkers: Example
A simple Bayesian genotyper
• GATK
Overview
and
Concepts
• GATK
Workflow
• Example:
A
Simple
Bayesian
Genotyper
• A
func<onal
genotyper
in
under
150
lines
of
code
• A
minimal
example:
calls
are
much
lower
in
quality
than
the
UnifiedGenotyper
14
15. Walkers: Example
A simple Bayesian genotyper: the model
Likelihood of the
Likelihood for Prior for the data given the
the genotype genotype genotype Independent base model
Bayesian
model
L(G | D) = P(G) P(D | G) = ∏
b∈{good _ bases}
P(b | G)
• Likelihood
of
data
computed
using
pileup
of
bases
and
associated
quality
scores
at
given
locus
• Only
“good
bases”
are
included:
those
sa<sfying
minimum
base
quality,
mapping
read
quality,
pair
mapping
quality,
NQS
• L(G|D)
computed
for
all
10
genotypes
See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper
for a more complete approach
15
16. Walkers: Example
A simple Bayesian genotyper
• Walker specifies the data access pattern and
declares command-line arguments.
• Inheritance defines traversal type.
• Annotation defines command-line argument.
public class GATKPaperGenotyper extends LocusWalker<Integer,Long> {
@Argument(fullName = "log_odds_score",
shortName = "LOD",
doc = "The LOD threshold",
required = false)
private double LODScore = 3.0;
16
17. Walkers: Example
A simple Bayesian genotyper
• Walker prepares the input dataset.
• ReadBackedPileup utility can be used to filter pileup on
demand.
public Integer map(RefMetaDataTracker tracker,
ReferenceContext ref,
AlignmentContext context) {
double likelihoods[] =
DiploidGenotypePriors.getReferencePolarizedPrior(
ref.getBase(),
DiploidGenotypePriors.HUMAN_HETEROZYGOSITY,
0.01);
// get the bases and qualities from the pileup
ReadBackedPileup pileup = context.getBasePileup().
getPileupWithoutMappingQualityZeroReads();
byte bases[] = pileup.getBases();
byte quals[] = pileup.getQuals();
…
17
18. Walkers: Example
A simple Bayesian genotyper
• Calculate the likelihood for each possible genotype.
• Determine the best of the calculated genotypes.
for (GENOTYPE genotype : GENOTYPE.values())
for (int index = 0; index < bases.length; index++) {
// our epsilon is the de-Phred scored base quality
double epsilon = Math.pow(10, quals[index] / -10.0);
byte pileupBase = bases[index];
double p = 0;
for (char r : genotype.toString().toCharArray())
p += r == pileupBase ? 1 - epsilon : epsilon / 3;
likelihoods[genotype.ordinal()] += Math.log10(p /
genotype.length());
}
Integer sortedList[] = MathUtils.sortPermutation(likelihoods);
18
19. Walkers: Example
A simple Bayesian genotyper
• Conditionally output the results.
• Use reduce to calculate number of genotypes called.
• Writing to provided output stream is guaranteed to be
thread-safe.
…
if (lod > LODScore)
out.printf("%st%st%.4ft%c%n", context.getLocation(),
selectedGenotype, lod, (char)ref.getBase());
return 1;
}
}
// end of map() function
public Long reduce(Integer value, Long sum) {
return value + sum;
}
public void onTraversalDone(Integer result) {
out.printf("Simple Genotyper genotyped %d loci.”, result);
}
19
20. Walkers: Threading performance
A simple Bayesian genotyper
GATK
performance
improves
nearly linearly
as processors
are added
20
21. Genome Analysis Toolkit
1000 Genomes Project
• Supports
any
BAM-‐ Ini<al
alignment
compa<ble
aligner
• All
of
these
tools
MSA
realignment
have
been
developed
in
the
GATK
Q-‐score
recalibra<on
• They
are
memory
and
CPU
efficient,
Base
error
cluster
friendly
and
are
modeling
easily
parallelized
• They
are
now
Genotyping
publically
and
are
being
used
at
many
sites
around
the
world
SNP
filtering
More
info:
h#p://www.broadins<tute.org/gsa/wiki/
Support
:
h#p://www.getsa<sfac<on.com/gsa/
21
22. Acknowledgments
Genome sequencing and Broad postdocs, staff, 1000 Genomes project
analysis group (MPG) and faculty In general but notably:
Kiran Garimella (Analysis Lead) Anthony Philippakis Matt Hurles
Michael Melgar Vineeta Agarwala Philip Awadalla
Chris Hartl Manny Rivas Richard Durbin
Sherman Jia Jared Maguire Goncalo Abecasis
Eric Banks (Development lead) Carrie Sougnez Richard Gibbs
Ryan Poplin David Jaffe Gabor Marth
Guillermo del Angel Nick Patterson Thomas Keane
Aaron McKenna Steve Schaffner Gil McVean
Khalid Shakir Shamil Sunyaev Gerton Lunter
Brett Thomas Paul de Bakker Heng Li
Corin Boyko
Copy number group Cancer genome
Bob Handsaker analysis
Genome Sequencing Platform Jim Nemesh Kristian Cibulskis
In general but notably: Josh Korn Andrey Sivachenko
Lauren Ambrogio Steve McCarroll Gad Getz
Illumina Production Team
Tim Fennell Integrative Genomics
Kathleen Tibbetts Viewer (IGV) MPG directorship
Alec Wysoker Jim Robinson Stacey Gabriel
Ben Weisburd Jesse Whitworth David Altshuler
Toby Bloom Helga Thorvaldsdottir Mark Daly
22