Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Talk at dnGASP workshop, April 5, 2011
1. Combining "overlap-layout-
consensus" and de Brujin graph
approaches for de novo genome
assembly
Alexey Sergushichev, Anton Alexandrov, Sergey Kazakov,
Sergey Melnikov, Vladislav Isenbaev, Fedor Tsarev
St. Petersburg State University of IT, Mechanics and Optics, Russia
In collaboration with:
Egor Prokhortchouk and Ekaterina Khrameeva
Genoanalytica, Moscow, Russia
Sequence Mapping and Assembly Assessment Project
dnGASP workshop
Barcelona, April 5th, 2011
2. Introduction
• Imagine you have two computers:
– 24 core (Intel Xeon 2.40GHz), 24 GB RAM
– 24 core (AMD Opteron 6174 2.2GHz), 64 GB
RAM
• …But you don’t know about the second
one ☺
• You are to assemble the genome from
dnGASP contest
2
4. Errors Correction: Reads
Truncation
• Scan each part of each PE-read from end until
first base with quality less than 90%
• Truncate each part of each read on that position
4
5. Errors Correction: Frequency
Analysis
• Consider all 30 character substrings of
reads and reverse complements of them
• Calculate number of occurrences for each
of these substrings
– Occurs rarely – contains error (is untrusted)
– Occurs frequently – is trusted
• Threshold for each case chosen manually
5
7. Errors Correction: Buckets
• Memory:
– Each substring stored as a 64-bit integer
– Number of occurrences – 32-bit integer
– ~6·109 distinct 30-mers in all PE-reads – 72Gb
• Split 30-mers to buckets according to their
prefixes
• Prefix of length k → 4k buckets
7
8. Errors Correction
• Processing each bucket separately
• Consider some untrusted 30-mer
– Try to change one base in it: (30-k)·3 ways
– If only one resulting 30-mer is trusted, fix the corresponding read
• To fix error in prefix we can load 3k more buckets into
RAM or...
• Not load – consider reverse complement of 30-mer
A G T A C A T
A T G T A C T
8
9. Errors Correction: Results
• Used machine with 24 cores and 24 GB
RAM for 24 hours
• Number of distinct 30-mers:
– Before: 6 533 327 606
– After: 3 911 459 530 (~40% less)
• Number of trusted 30-mers:
– Before: 3 070 814 230
– After: 3 369 674 264 (~10% more)
9
11. Quasi-contigs Assembly
…to this
114 114
AGCT...
~500
• Construct de Brujin graph from reads
• Find paths between vertices corresponding to
ends of reads – with brute-force algorithm
11
12. T-Services Company
• Overall performance of cluster over 20 Tflops,
based on:
– 2 x AMD Opteron 6174 «Magny-Cours»
2,2GHz 64 GB RAM DDR3 1333 MHz
– 2 х Intel Xeon E5410 2.33 Ghz 16 Gb RAM
DDR2 667 MHz
– 2 х Intel Xeon E5450 3.0 Ghz 16 Gb RAM
DDR2 667 MHz
• Provided exclusive access to node with 64 GB of
RAM
12
13. Quasi-Contigs Assembly
Parameters
• Used machine with 24 cores and 64 GB of
RAM for 20 hours
• Vertices – 30-mers
• Edges – trusted 31-mers
• Minimal length of quasi-contig – 334
• Maximal length of quasi-contig – 550
13
14. Quasi-Contigs Assembly Results
• 67% of inserts restored to quasi-contigs:
– ~27% – many ways to restore
– ~6% – no way to restore
14
17. Overall Results
n mean N50 max Sum
Newbler: A 401257 3694 7379 6279498 1.482e9
AbySS: A 422207 4635 12580 6279661 1.492e9
AbySS: B 417403 4808 22788 6279463 1.516e9
AbySS: C 526028 3647 14170 6279463 1.522e9
AbySS: D 580217 3275 8070 6279463 1.525e9
17
18. Work in Progress
• Develop a software module to replace
Newbler (contig assembly from quasi-
contigs)
• Develop a software module to replace
AbySS for scaffold assembly
• Improve quality of quasi-contigs assembly
• Reduce RAM requirements
18