This document discusses profiling the AllPaths-LG genome assembler to optimize its performance. It analyzed the CPU and memory usage of each program step on various systems. The profiling identified seven routines that use the most CPU time and I/O. Modules like FindErrors, AlignReads, and CommonPather were prioritized for optimization to reduce assembly time. Future work will involve more detailed profiling and exploring code optimizations for the most resource-intensive modules.
Customer Case Study: How Novel Compute Technology Transforms Medical and Life...
Parallel Benchmarking and Performance Profiling of de novo Genome Assembly Algorithms
1. Implementation
Profiling Process of AllPathsLG was performed for the following unpaired data sets using AllPathsLG-46513
including the memusage script by Liu Yongchao (University of Mainz) on BioU and job accounting scripts on
Blacklight.
Abstract
Next Generation Sequencers (NGS) provide high
throughput by parallelizing the sequencing
process, and producing millions of sequences in a
relatively short amount of time. Because NGS is
still relatively new, the methods to assemble data
have not been fully explored from an optimization
perspective. One such assembler is ALLPATHS-LG,
whose optimization profiling is the focus of this
poster.
In order to carry out the profiling tasks, the CPU
and memory usage of each step of the program
was analyzed using profilers. The profiling process
highlighted which steps were taking the most
amount of time, and if possible, each step was
optimized accordingly. In order to maximize the
efficiency and throughput of the program as a
whole, steps with the highest amount of I/O,
memory, and CPU time were given the most
priority, in order to decrease the amount of time
for sequence assembly.
Background
NGS data output has increased at a rate that
outpaces Moore’s law, more than doubling each
year since it was invented. In 2007, a single
sequencing run could produce a maximum of
around one gigabase (Gb) of data. By 2011, that
rate has nearly reached a terabase (Tb) of data in
a single sequencing run—nearly a 1000× increase
in four years. With the ability to rapidly generate
large volumes of sequencing data, NGS enables
researchers to move quickly from an idea to full
data sets in a matter of hours or days. Researchers
can now sequence more than five human
genomes in a single run, producing data in roughly
one week, for a reagent cost of less than $5,000
per genome. This optimization of the sequence
alignment code, will help cut both time and cost.
Analysis
Profiling the code on BioU and Blacklight resulted
in the identification of seven routines that
consume large amounts of CPU time as shown on
the graphs. Additionally, these modules have the
most I/O associated with them which makes the
good candidates for optimization.
In order to maximize the optimization, different
factors such as elapsed time, memory used, and
I/O have to be taken into account. Modules such
as FindErrors, AlignReads, and CommonPather are
good candidates for optimization.
Acknowledgements
References
Sante Gnerre, Iain MacCallum, Dariusz
Przybylski, Filipe J. Ribeiro, Joshua N. Burton,
Bruce J. Walker, Ted Sharpe, Giles Hall,
Terrance P. Shea, Sean Sykes, Aaron M. Berlin,
Daniel Aird, Maura Costello, Riza Daza, Louise
Williams, Robert Nicol, Andreas Gnirke, Chad
Nusbaum, Eric S. Lander, and David B. Jaffe.
High-quality draft assemblies of mammalian
genomes from massively parallel sequence
data. PNAS [Online] 2010.
Gperftools.
https://code.google.com/p/gperftools/wiki/Go
oglePerformanceTools>. June 13,2013
This research was supported by the NIH Grants
T36-GM-095335 and 2-P41-RR06
Alexander J. Ropelewski
Dr. Bienvenido Velez
Pittsburgh Super Computing Center
Parallel Benchmarking and Performance Profiling of de novo Genome Assembly Algorithms
Appropriate for NGS Data
Jan Salomon1, Alex Ropelewski2; Bienvenido Velez3
1Electrical and Computer Engineering Department, University of Puerto Rico, Mayaguez
2Pittsburgh Supercomputing Center, Pittsburgh, PA
BioU Results
Species Number of Fragment Reads Fragment Read Length Number of Jump Reads Jump Read Length
Bifidobacterium bifidum NCIMB 41171 1096991 101 1193262 93
Neisseria gonorrhoeae FA19 1748810 101 902879 101
Coprobacillus sp. D6 1271918 101 1775443 101
Enterococcus casseliflavus 899205 1588485 101 1265671 101
Eubacterium sp. 3_1_31 826347 93 828826 93
0 500 1000 1500 2000 2500 3000
PostPatcher
TagCircularScaffolds
KPatch
UnibaseCopyNumber3
CleanCorrectedReads
CleanAssembly
CloseUnipathGaps
RebuildAssemblyFiles
FixLocal
CommonPather
FindErrors
Other (<97)
UnipathPatcher
LocalizeReadsLG
AlignReads
Time Taken (seconds)
AllPathsLGModule
Combined Elapsed Time Per Step
0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000
ShaveUnipathGraph
UnipathPatcher
CloseUnipathGaps
RemoveDodgyReads
FixLocal
CleanCorrectedReads
MergeNeighborhoods1
AlignReads
FindErrors
UnibaseCopyNumber3
LocalizeReadsLG
Other (<3.8MB)
CommonPather
Memory Used (MB)
AllPathsLGModule
Combined VMRSS(MB)
Blacklight Results
Blacklight I/O Profiling Results
0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000
SamplePairedRea
UnipathPatcher
FixSomeIndels
MergeNeighborho
FixLocal
RemoveHighCNAli
ShaveUnipathGra
KPatch
AlignReads
CleanCorrectedR
UnibaseCopyNumb
RecoverUnipaths
LocalizeReadsLG
FindErrors
CommonPather
Other (<54000)
Size (MB)
AllPathsLGModule
Logical I/O Reads
0 100000 200000 300000 400000 500000 600000 700000 800000
SamplePairedRea
RemoveHighCNAli
FixSomeIndels
KPatch
MergeNeighborho
UnipathPatcher
ShaveUnipathGra
AlignReads
CleanCorrectedR
RecoverUnipaths
FixLocal
LocalizeReadsLG
FindErrors
UnibaseCopyNumb
CommonPather
Other (<83000)
Size (MB)
AllPathsLGModule
Logical I/O Written Command Name Characters Read Characters Written
AlignReads 110550.85 110970.454
CleanCorrectedR 113703.52 114114.204
CommonPather 369853.39 372239.993
FindErrors 224893.88 225584.826
FixLocal 66690.56 69286.932
FixSomeIndels 59103.36 60129.073
KPatch 109716.77 109789.758
LocalizeReadsLG 163704.31 164067.357
MergeNeighborho 69287.47 69671.636
Other 881481.07 889387.44
RecoverUnipaths 150995.8 151131.303
RemoveHighCNAli 73932.16 74184.158
SamplePairedRea 53789.79 54281.473
ShaveUnipathGra 83679.33 83973.843
UnibaseCopyNumb 122950.22 124257.897
UnipathPatcher 53471.03 54533.504
Future Work
Future work will involve profiling at a finer
detailed level than the coarse method described
in this poster as well as exploring code
optimizations for the most resource intensive
modules.
0 200 400 600 800 1000 1200 1400 1600
FixSomeIndels
SamplePairedReadStats
RemoveDodgyReads
ValidateAllPathsInputs
MakeScaffoldsLG
LocalizeReadsLG
RemoveDodgyReads
UnipathPatcher
CloseUnipathGaps
CleanCorrectedReads
UnibaseCopyNumber3
CommonPather
AlignReads
FixLocal
FindErrors
Other (<110)
Time Taken (seconds)
AllPathsLGModule
Elapsed Time
0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000
RemoveHighCNAligns
SamplePairedReadStats
ErrorCorrectJump
FixSomeIndels
UnipathPatcher
ShaveUnipathGraph
FixLocal
RemoveDodgyReads
CloseUnipathGaps
AlignReads
CleanCorrectedReads
FindErrors
UnibaseCopyNumber3
LocalizeReadsLG
Other (<2457)
CommonPather
Memory Used (MB)
AllPathsLGModule
Memory Used
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
bifido
clap19
copro
entero
eubac
Time Taken (percentage)
DataSet
Percentage of Time Taken of Top 7 Modules
AlignReads
CleanCorrectedReads
CloseUnipathGaps
CommonPather
FindErrors
UnibaseCopyNumber3
UnipathPatcher
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
bifido
clap19
copro
entero
eubac
Time Taken (percentage)
DataSet
Percentage of Time Taken of Top 7 Modules
AlignReads
CleanCorrectedReads
CloseUnipathGaps
CommonPather
FindErrors
UnibaseCopyNumber3
UnipathPatcher