Highly Sensitive Cloud-Based Read Mapping with CloudBurst
1. CloudBurst
• CloudBurst : Highly Sensitive Short Read
Mapping with MapReduce
• New parallel read-mapping algorithm
optimized for mapping NGS data to the
human genome and other reference
genomes
• SNP discovery, genotyping, and personal
genomics
2. CloudBurst
• It is modeled after the short read mapping
program RMAP
• Reports either all alignments or the unambiguous
best alignment for each read with any number of
mismatches or differences
• This level of sensitivity could be prohibitively time
consuming, but CloudBurst uses the open-source
Hadoop implementation of MapReduce to
parallelize execution using multiple compute
nodes.
3. CloudBurst
• Running time
– scales linearly with the number of reads mapped
– with near linear speedup as the number of
processors increases.
• CloudBurst reduces the running time from
hours to mere minutes for typical jobs
involving mapping of millions of short reads to
the human genome.
4. Algorithm Overview
• CloudBurst uses seed-and-extend algorithms to
map reads to a reference genome.
• Seed
– k differences : the alignment must have a region of
length s=r/k+1 called a seed that exactly matches the
reference.
• Extend
– CloudBurst attempts to extend the alignment into an
end-to-end alignment with at most k mismatches or
differences
5. Algorithm Overview
• CloudBurst uses the Hadoop implementation of
MapReduce to catalog and extend the seeds
• Map phase emits
– all length-s k-mers from the reference sequences
– all non-overlapping length-s kmers from the reads
• Shuffle phase
– read and reference kmers are brought together
• Reduce phase
– the seeds are extended into end-to-end alignments