SlideShare a Scribd company logo
1 of 19
Download to read offline
Combining "overlap-layout-
        consensus" and de Brujin graph
        approaches for de novo genome
                  assembly
     Alexey Sergushichev, Anton Alexandrov, Sergey Kazakov,
        Sergey Melnikov, Vladislav Isenbaev, Fedor Tsarev
St. Petersburg State University of IT, Mechanics and Optics, Russia

                      In collaboration with:
          Egor Prokhortchouk and Ekaterina Khrameeva
                 Genoanalytica, Moscow, Russia

      Sequence Mapping and Assembly Assessment Project
                     dnGASP workshop
                 Barcelona, April 5th, 2011
Introduction
• Imagine you have two computers:
  – 24 core (Intel Xeon 2.40GHz), 24 GB RAM
  – 24 core (AMD Opteron 6174 2.2GHz), 64 GB
    RAM
• …But you don’t know about the second
  one ☺
• You are to assemble the genome from
  dnGASP contest

                                               2
Algorithm




            3
Errors Correction: Reads
              Truncation
• Scan each part of each PE-read from end until
  first base with quality less than 90%
• Truncate each part of each read on that position




                                                     4
Errors Correction: Frequency
             Analysis
• Consider all 30 character substrings of
  reads and reverse complements of them
• Calculate number of occurrences for each
  of these substrings
  – Occurs rarely – contains error (is untrusted)
  – Occurs frequently – is trusted
• Threshold for each case chosen manually


                                                    5
Errors Correction: Distribution
                 Curve
 3000000000

 2500000000

 2000000000

 1500000000

 1000000000

 500000000

         0
              1   3   5   7   9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47




• < 4 occurrences – untrusted
• Other 30-mers – trusted
                                                                                           6
Errors Correction: Buckets
• Memory:
  – Each substring stored as a 64-bit integer
  – Number of occurrences – 32-bit integer
  – ~6·109 distinct 30-mers in all PE-reads – 72Gb
• Split 30-mers to buckets according to their
  prefixes
• Prefix of length k → 4k buckets


                                                 7
Errors Correction
• Processing each bucket separately
• Consider some untrusted 30-mer
   – Try to change one base in it: (30-k)·3 ways
   – If only one resulting 30-mer is trusted, fix the corresponding read
• To fix error in prefix we can load 3k more buckets into
  RAM or...
• Not load – consider reverse complement of 30-mer

                     A G T A C A T


                     A T G T A C T
                                                                       8
Errors Correction: Results
• Used machine with 24 cores and 24 GB
  RAM for 24 hours
• Number of distinct 30-mers:
  – Before: 6 533 327 606
  – After: 3 911 459 530 (~40% less)
• Number of trusted 30-mers:
  – Before: 3 070 814 230
  – After: 3 369 674 264 (~10% more)
                                         9
Quasi-contigs Assembly
• Input = set of PE reads
• Goal is to fill the gap between ends


            From this picture…




                                         10
Quasi-contigs Assembly
                  …to this
     114                             114
                   AGCT...
                    ~500

• Construct de Brujin graph from reads
• Find paths between vertices corresponding to
  ends of reads – with brute-force algorithm

                                             11
T-Services Company
• Overall performance of cluster over 20 Tflops,
  based on:
   – 2 x AMD Opteron 6174 «Magny-Cours»
     2,2GHz 64 GB RAM DDR3 1333 MHz
   – 2 х Intel Xeon E5410 2.33 Ghz 16 Gb RAM
     DDR2 667 MHz
   – 2 х Intel Xeon E5450 3.0 Ghz 16 Gb RAM
     DDR2 667 MHz
• Provided exclusive access to node with 64 GB of
  RAM
                                                12
Quasi-Contigs Assembly
            Parameters
• Used machine with 24 cores and 64 GB of
  RAM for 20 hours
• Vertices – 30-mers
• Edges – trusted 31-mers
• Minimal length of quasi-contig – 334
• Maximal length of quasi-contig – 550


                                        13
Quasi-Contigs Assembly Results
• 67% of inserts restored to quasi-contigs:
  – ~27% – many ways to restore
  – ~6% – no way to restore




                                              14
Quasi-Contigs Assembly Results
  1,40E-02




             Pink – inserts lengths
  1,20E-02


             Blue – quasi-contigs lengths
  1,00E-02




  8,00E-03




  6,00E-03




  4,00E-03




  2,00E-03




  0,00E+00
         4

         2

         0

         8

         6

         4

         2

         0

         8

         6

         4

         2

         0

         8

         6

         4

         2

         0

         8

         6

         4

         2

         0

         8

         6

         4

         2

         0

         8

         6

         4

         2

         0

         8

         6

         4
       26

       27

       28

       28

       29

       30

       31

       32

       32

       33

       34

       35

       36

       36

       37

       38

       39

       40

       40

       41

       42

       43

       44

       44

       45

       46

       47

       48

       48

       49

       50

       51

       52

       52

       53

       54
                                            15
Contigs & Scaffolds Assembly
• Contigs assembly
  – Newbler
  – Used quasi-contigs from 24 files (of 88)
  – 60 hours
• Scaffolds assembly
  – AbySS
  – 40 hours per library


                                               16
Overall Results
               n     mean    N50     max       Sum

Newbler: A   401257 3694    7379    6279498   1.482e9

AbySS: A     422207 4635    12580   6279661   1.492e9

AbySS: B     417403 4808    22788   6279463   1.516e9

AbySS: C     526028 3647    14170   6279463   1.522e9

AbySS: D     580217 3275    8070    6279463   1.525e9


                                                   17
Work in Progress
• Develop a software module to replace
  Newbler (contig assembly from quasi-
  contigs)
• Develop a software module to replace
  AbySS for scaffold assembly
• Improve quality of quasi-contigs assembly
• Reduce RAM requirements

                                          18
Questions?




             19

More Related Content

Viewers also liked

Sequencing, Alignment and Assembly
Sequencing, Alignment and AssemblySequencing, Alignment and Assembly
Sequencing, Alignment and Assembly
Shaun Jackman
 
Trabajo de tecnologia
Trabajo de tecnologiaTrabajo de tecnologia
Trabajo de tecnologia
karlangas0717
 
Drucker chapter 3
Drucker chapter 3Drucker chapter 3
Drucker chapter 3
detjen
 
Google presentations
Google presentationsGoogle presentations
Google presentations
Wade Stewart
 
Y jmrxzmobile rsearch case study ver.final
Y jmrxzmobile rsearch case study ver.finalY jmrxzmobile rsearch case study ver.final
Y jmrxzmobile rsearch case study ver.final
MROC Japan
 

Viewers also liked (20)

Overlap Layout Consensus assembly
Overlap Layout Consensus assemblyOverlap Layout Consensus assembly
Overlap Layout Consensus assembly
 
Доклад на семинаре в лаборатории алгоритмической биологии АУ
Доклад на семинаре в лаборатории алгоритмической биологии АУДоклад на семинаре в лаборатории алгоритмической биологии АУ
Доклад на семинаре в лаборатории алгоритмической биологии АУ
 
Overview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsOverview of Genome Assembly Algorithms
Overview of Genome Assembly Algorithms
 
Metagenome Sequence Assembly (CABBIO 20150629 Buenos Aires)
Metagenome Sequence Assembly (CABBIO 20150629 Buenos Aires)Metagenome Sequence Assembly (CABBIO 20150629 Buenos Aires)
Metagenome Sequence Assembly (CABBIO 20150629 Buenos Aires)
 
Sequencing, Alignment and Assembly
Sequencing, Alignment and AssemblySequencing, Alignment and Assembly
Sequencing, Alignment and Assembly
 
Pyrosequencing 454
Pyrosequencing 454Pyrosequencing 454
Pyrosequencing 454
 
How to write bioinformatics software people will use and cite - t.seemann - ...
How to write bioinformatics software people will use and cite -  t.seemann - ...How to write bioinformatics software people will use and cite -  t.seemann - ...
How to write bioinformatics software people will use and cite - t.seemann - ...
 
Molecular marker
Molecular markerMolecular marker
Molecular marker
 
Introducción a SlideShare
Introducción a SlideShareIntroducción a SlideShare
Introducción a SlideShare
 
Trabajo de tecnologia
Trabajo de tecnologiaTrabajo de tecnologia
Trabajo de tecnologia
 
Prototyping: Helping to take away the suck
Prototyping: Helping to take away the suckPrototyping: Helping to take away the suck
Prototyping: Helping to take away the suck
 
Drucker chapter 3
Drucker chapter 3Drucker chapter 3
Drucker chapter 3
 
Hekikai Steel Louvre Project
Hekikai Steel Louvre ProjectHekikai Steel Louvre Project
Hekikai Steel Louvre Project
 
20111101 get social or get lost hortifair
20111101 get social or get lost hortifair20111101 get social or get lost hortifair
20111101 get social or get lost hortifair
 
Planning session for value chain case study
Planning session for value chain case studyPlanning session for value chain case study
Planning session for value chain case study
 
Portfolio_Eberly
Portfolio_EberlyPortfolio_Eberly
Portfolio_Eberly
 
HTML5: A brave new world of markup
HTML5: A brave new world of markupHTML5: A brave new world of markup
HTML5: A brave new world of markup
 
Google presentations
Google presentationsGoogle presentations
Google presentations
 
Y jmrxzmobile rsearch case study ver.final
Y jmrxzmobile rsearch case study ver.finalY jmrxzmobile rsearch case study ver.final
Y jmrxzmobile rsearch case study ver.final
 
Optimizing content for the "mobile web"
Optimizing content for the "mobile web"Optimizing content for the "mobile web"
Optimizing content for the "mobile web"
 

Similar to Talk at dnGASP workshop, April 5, 2011

Scaling classical clone detection tools for ultra large datasets
Scaling classical clone detection tools for ultra large datasetsScaling classical clone detection tools for ultra large datasets
Scaling classical clone detection tools for ultra large datasets
imanmahsa
 
Stop-the-world GCs on milticores
Stop-the-world GCs on milticoresStop-the-world GCs on milticores
Stop-the-world GCs on milticores
Aliya Ibragimova
 
GPU-Quicksort
GPU-QuicksortGPU-Quicksort
GPU-Quicksort
daced
 

Similar to Talk at dnGASP workshop, April 5, 2011 (20)

Scaling classical clone detection tools for ultra large datasets
Scaling classical clone detection tools for ultra large datasetsScaling classical clone detection tools for ultra large datasets
Scaling classical clone detection tools for ultra large datasets
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Stop-the-world GCs on milticores
Stop-the-world GCs on milticoresStop-the-world GCs on milticores
Stop-the-world GCs on milticores
 
ACM 2013-02-25
ACM 2013-02-25ACM 2013-02-25
ACM 2013-02-25
 
[db analytics showcase Sapporo 2018] B33 H2O4GPU and GoAI: harnessing the pow...
[db analytics showcase Sapporo 2018] B33 H2O4GPU and GoAI: harnessing the pow...[db analytics showcase Sapporo 2018] B33 H2O4GPU and GoAI: harnessing the pow...
[db analytics showcase Sapporo 2018] B33 H2O4GPU and GoAI: harnessing the pow...
 
Programming the Cell Processor A simple raytracer from pseudo-code to spu-code
Programming the Cell Processor A simple raytracer from pseudo-code to spu-codeProgramming the Cell Processor A simple raytracer from pseudo-code to spu-code
Programming the Cell Processor A simple raytracer from pseudo-code to spu-code
 
Computer System Architecture Lecture Note 8.1 primary Memory
Computer System Architecture Lecture Note 8.1 primary MemoryComputer System Architecture Lecture Note 8.1 primary Memory
Computer System Architecture Lecture Note 8.1 primary Memory
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
GPU-Quicksort
GPU-QuicksortGPU-Quicksort
GPU-Quicksort
 
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
 
Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programming
 
Moving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC SystemsMoving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC Systems
 
QCon London.pdf
QCon London.pdfQCon London.pdf
QCon London.pdf
 
Argonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer ArchitectureArgonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer Architecture
 
2018AOI論壇_深度學習於表面瑕疪檢測_元智大學蔡篤銘
2018AOI論壇_深度學習於表面瑕疪檢測_元智大學蔡篤銘2018AOI論壇_深度學習於表面瑕疪檢測_元智大學蔡篤銘
2018AOI論壇_深度學習於表面瑕疪檢測_元智大學蔡篤銘
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural Networks
 
04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers
 
POLARDB for MySQL - Parallel Query
POLARDB for MySQL - Parallel QueryPOLARDB for MySQL - Parallel Query
POLARDB for MySQL - Parallel Query
 
Everything I Ever Learned About JVM Performance Tuning @Twitter
Everything I Ever Learned About JVM Performance Tuning @TwitterEverything I Ever Learned About JVM Performance Tuning @Twitter
Everything I Ever Learned About JVM Performance Tuning @Twitter
 

More from Fedor Tsarev

Сборка генома: мифы и реальность. Доклад на пленарном заседании III Всероссий...
Сборка генома: мифы и реальность. Доклад на пленарном заседании III Всероссий...Сборка генома: мифы и реальность. Доклад на пленарном заседании III Всероссий...
Сборка генома: мифы и реальность. Доклад на пленарном заседании III Всероссий...
Fedor Tsarev
 
On NP-Hardness of the Paired de Bruijn Sound Cycle Problem
On NP-Hardness of the Paired de Bruijn Sound Cycle ProblemOn NP-Hardness of the Paired de Bruijn Sound Cycle Problem
On NP-Hardness of the Paired de Bruijn Sound Cycle Problem
Fedor Tsarev
 
Сборка генома de novo: мифы и реальность
Сборка генома de novo: мифы и реальностьСборка генома de novo: мифы и реальность
Сборка генома de novo: мифы и реальность
Fedor Tsarev
 
Problem solving on acm international collegiate programming contest
Problem solving on acm international collegiate programming contestProblem solving on acm international collegiate programming contest
Problem solving on acm international collegiate programming contest
Fedor Tsarev
 
05 динамическое программирование
05 динамическое программирование05 динамическое программирование
05 динамическое программирование
Fedor Tsarev
 
04 динамическое программирование - основные концепции
04 динамическое программирование - основные концепции04 динамическое программирование - основные концепции
04 динамическое программирование - основные концепции
Fedor Tsarev
 

More from Fedor Tsarev (12)

We are the champions: programming world champions from Russia. Why and what for?
We are the champions: programming world champions from Russia. Why and what for?We are the champions: programming world champions from Russia. Why and what for?
We are the champions: programming world champions from Russia. Why and what for?
 
Becoming a World Champion in Programming: Keep Calm and Compete
Becoming a World Champion in Programming: Keep Calm and CompeteBecoming a World Champion in Programming: Keep Calm and Compete
Becoming a World Champion in Programming: Keep Calm and Compete
 
Сборка генома: мифы и реальность. Доклад на пленарном заседании III Всероссий...
Сборка генома: мифы и реальность. Доклад на пленарном заседании III Всероссий...Сборка генома: мифы и реальность. Доклад на пленарном заседании III Всероссий...
Сборка генома: мифы и реальность. Доклад на пленарном заседании III Всероссий...
 
On NP-Hardness of the Paired de Bruijn Sound Cycle Problem
On NP-Hardness of the Paired de Bruijn Sound Cycle ProblemOn NP-Hardness of the Paired de Bruijn Sound Cycle Problem
On NP-Hardness of the Paired de Bruijn Sound Cycle Problem
 
Сборка генома de novo: мифы и реальность
Сборка генома de novo: мифы и реальностьСборка генома de novo: мифы и реальность
Сборка генома de novo: мифы и реальность
 
Problem solving on acm international collegiate programming contest
Problem solving on acm international collegiate programming contestProblem solving on acm international collegiate programming contest
Problem solving on acm international collegiate programming contest
 
05 динамическое программирование
05 динамическое программирование05 динамическое программирование
05 динамическое программирование
 
05 динамическое программирование
05 динамическое программирование05 динамическое программирование
05 динамическое программирование
 
04 динамическое программирование - основные концепции
04 динамическое программирование - основные концепции04 динамическое программирование - основные концепции
04 динамическое программирование - основные концепции
 
01 линейные структуры данных
01 линейные структуры данных01 линейные структуры данных
01 линейные структуры данных
 
03 двоичные деревья поиска и очередь с приоритетами
03 двоичные деревья поиска и очередь с приоритетами03 двоичные деревья поиска и очередь с приоритетами
03 двоичные деревья поиска и очередь с приоритетами
 
02 сортировка и поиск
02 сортировка и поиск02 сортировка и поиск
02 сортировка и поиск
 

Recently uploaded

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Talk at dnGASP workshop, April 5, 2011

  • 1. Combining "overlap-layout- consensus" and de Brujin graph approaches for de novo genome assembly Alexey Sergushichev, Anton Alexandrov, Sergey Kazakov, Sergey Melnikov, Vladislav Isenbaev, Fedor Tsarev St. Petersburg State University of IT, Mechanics and Optics, Russia In collaboration with: Egor Prokhortchouk and Ekaterina Khrameeva Genoanalytica, Moscow, Russia Sequence Mapping and Assembly Assessment Project dnGASP workshop Barcelona, April 5th, 2011
  • 2. Introduction • Imagine you have two computers: – 24 core (Intel Xeon 2.40GHz), 24 GB RAM – 24 core (AMD Opteron 6174 2.2GHz), 64 GB RAM • …But you don’t know about the second one ☺ • You are to assemble the genome from dnGASP contest 2
  • 4. Errors Correction: Reads Truncation • Scan each part of each PE-read from end until first base with quality less than 90% • Truncate each part of each read on that position 4
  • 5. Errors Correction: Frequency Analysis • Consider all 30 character substrings of reads and reverse complements of them • Calculate number of occurrences for each of these substrings – Occurs rarely – contains error (is untrusted) – Occurs frequently – is trusted • Threshold for each case chosen manually 5
  • 6. Errors Correction: Distribution Curve 3000000000 2500000000 2000000000 1500000000 1000000000 500000000 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 • < 4 occurrences – untrusted • Other 30-mers – trusted 6
  • 7. Errors Correction: Buckets • Memory: – Each substring stored as a 64-bit integer – Number of occurrences – 32-bit integer – ~6·109 distinct 30-mers in all PE-reads – 72Gb • Split 30-mers to buckets according to their prefixes • Prefix of length k → 4k buckets 7
  • 8. Errors Correction • Processing each bucket separately • Consider some untrusted 30-mer – Try to change one base in it: (30-k)·3 ways – If only one resulting 30-mer is trusted, fix the corresponding read • To fix error in prefix we can load 3k more buckets into RAM or... • Not load – consider reverse complement of 30-mer A G T A C A T A T G T A C T 8
  • 9. Errors Correction: Results • Used machine with 24 cores and 24 GB RAM for 24 hours • Number of distinct 30-mers: – Before: 6 533 327 606 – After: 3 911 459 530 (~40% less) • Number of trusted 30-mers: – Before: 3 070 814 230 – After: 3 369 674 264 (~10% more) 9
  • 10. Quasi-contigs Assembly • Input = set of PE reads • Goal is to fill the gap between ends From this picture… 10
  • 11. Quasi-contigs Assembly …to this 114 114 AGCT... ~500 • Construct de Brujin graph from reads • Find paths between vertices corresponding to ends of reads – with brute-force algorithm 11
  • 12. T-Services Company • Overall performance of cluster over 20 Tflops, based on: – 2 x AMD Opteron 6174 «Magny-Cours» 2,2GHz 64 GB RAM DDR3 1333 MHz – 2 х Intel Xeon E5410 2.33 Ghz 16 Gb RAM DDR2 667 MHz – 2 х Intel Xeon E5450 3.0 Ghz 16 Gb RAM DDR2 667 MHz • Provided exclusive access to node with 64 GB of RAM 12
  • 13. Quasi-Contigs Assembly Parameters • Used machine with 24 cores and 64 GB of RAM for 20 hours • Vertices – 30-mers • Edges – trusted 31-mers • Minimal length of quasi-contig – 334 • Maximal length of quasi-contig – 550 13
  • 14. Quasi-Contigs Assembly Results • 67% of inserts restored to quasi-contigs: – ~27% – many ways to restore – ~6% – no way to restore 14
  • 15. Quasi-Contigs Assembly Results 1,40E-02 Pink – inserts lengths 1,20E-02 Blue – quasi-contigs lengths 1,00E-02 8,00E-03 6,00E-03 4,00E-03 2,00E-03 0,00E+00 4 2 0 8 6 4 2 0 8 6 4 2 0 8 6 4 2 0 8 6 4 2 0 8 6 4 2 0 8 6 4 2 0 8 6 4 26 27 28 28 29 30 31 32 32 33 34 35 36 36 37 38 39 40 40 41 42 43 44 44 45 46 47 48 48 49 50 51 52 52 53 54 15
  • 16. Contigs & Scaffolds Assembly • Contigs assembly – Newbler – Used quasi-contigs from 24 files (of 88) – 60 hours • Scaffolds assembly – AbySS – 40 hours per library 16
  • 17. Overall Results n mean N50 max Sum Newbler: A 401257 3694 7379 6279498 1.482e9 AbySS: A 422207 4635 12580 6279661 1.492e9 AbySS: B 417403 4808 22788 6279463 1.516e9 AbySS: C 526028 3647 14170 6279463 1.522e9 AbySS: D 580217 3275 8070 6279463 1.525e9 17
  • 18. Work in Progress • Develop a software module to replace Newbler (contig assembly from quasi- contigs) • Develop a software module to replace AbySS for scaffold assembly • Improve quality of quasi-contigs assembly • Reduce RAM requirements 18