SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Downloaden Sie, um offline zu lesen
Fast algorithms for large scale
   genome alignment and
   comparison

                                                             Davide Eynard
                                                       eynard@elet.polimi.it

                         Dipartimento di Elettronica e Informazione
                                               Politecnico di Milano

                                          2007/05/28

Algorithms for Computational Molecular Biology
The article(s)

        A.L. Delcher, S. Kasif, R.D. Fleischmann, J.
         Peterson, O. White, S.L. Salzberg: “Alignment of
         whole genomes”, 1999
        A.L. Delcher, A. Philippy, J. Carlton, S.L.
         Salzberg: “Fast algorithms for large-scale
         genome alignment and comparison”, 2002
        S. Kurtz, A. Philippy, A.L. Delcher, M. Smoot, M.
         Shumway, C. Antonescu, S.L. Salzberg:
         “Versatile and open software for comparing large
         genomes”, 2004



p. 2    2007/05/28          ACMB
The problem

        When the genome sequence of two closely
         related organisms becomes available, one of the
         first questions researchers want to ask is how the
         two genomes align
        Aligning (very) long sequences
          • Single gene sequences may be as long as tens of
            thousand of nucleotides
          • Whole genomes are usually millions of nucleotides
            or larger!




p. 3    2007/05/28           ACMB
The challenge

        Naïve
          • O(n2) space and time
        Hashing
          • faster, but still partly O(n2)
        Dynamic Programming
          • O(n) space, takes more time
        MUMmer
          • Suffix trees: O(n) space and time
          • LIS: O(k log k) where k is the number of MUMs




p. 4    2007/05/28               ACMB
The algorithm

       1) Perform a Maximal Unique Match (MUM)
         decomposition of the two genomes
       2) Sort the matches found in the MUM alignment,
         and extract the LIS (Longest Increasing
         Sequence) of matches that occur in the same
         order in both genomes
       3) Close the gaps in the alignment, performing
         local identification of large inserts, repeats, small
         mutated regions, tandem repeats and SNPs
       4) Output the alignment



p. 5    2007/05/28            ACMB
MUM: the suffix tree




p. 6   2007/05/28          ACMB
Longest Increasing Subsequence




p. 7   2007/05/28   ACMB
Closing the gaps




p. 8   2007/05/28         ACMB
MUMmer v2.0

        Relaxes the uniqueness constraint
        Faster, takes less space
        Algorithmic improvements
          • memory
          • streaming query
          • new module to cluster matches
        Able to align not only simple DNA sequences, but
         also human chromosomes
        Able to align incomplete genomes and protein
         sequences



p. 9    2007/05/28           ACMB
Time-space improvements

         The amount of memory used in the suffix tree
          has been reduced
           • from at most 37bytes/bp to at most 20bytes/bp
         Speed has increased
           • E.coli vs. V.cholerae, from 74sec,293MB to 27sec,
               100MB
         Suffix tree is used to store only one sequence,
          while the second one (query) is streamed against
          the suffix tree
           • once the suffix tree has been built, multiple queries
             can be streamed
           • quick way to find the next match
           • matches are maximal on the right hand side
p. 10    2007/05/28             ACMB
Streaming queries




p. 11   2007/05/28         ACMB
Clustering of matches

         Old version computed a single longest alignment
          between the sequences
         New version works as follows:
           • first, the system outputs a series of separate,
             independent alignment regions
           • clustering is performed by finding pairs of matches
             that are sufficiently close
           • finally, a LIS computation is done within each
             component to yield the most consistent sequence
             of matches in the cluster




p. 12    2007/05/28             ACMB
Alignment of incomplete genomes

         In a typical Whole-Genome Shotgun-Sequencing,
          the genome is broken up into millions of pieces
           • If the reads are generated at random, then >99%
             of a genome will be covered by sequencing
             enough reads to cover the genome eight times
           • The result of assembly is usually a collection of
             large, unordered DNA sequences called contigs
         NUCmer (nucleotide MUMmer) is a multiple-
          contig alignment program that uses MUMmer 2
          as its core aligment engine




p. 13    2007/05/28            ACMB
Alignment of incomplete genomes

        1)NUCmer input: two multi-fasta files representing
          partial or complete assemblies
        2)Create a map of all contig positions within each
          file
        3)Concatenate files separately and run MUMmer to
          find exact matches
        4)Map matches to separate contigs
        5)MUMs are clustered together if they are
          separated by no more than a user-specifiedd
          distance
        6)Dynamic programming is used to align
          sequences between the MUMs

p. 14    2007/05/28         ACMB
NUCmer




p. 15   2007/05/28    ACMB
PROmer

        1)Given two multi-fasta files, PROmer translates the
          DNA to amino acids
        2)An index is created that maps all protein
          sequences and lengths to the source DNA
        3)Pseudo-proteomes (amino acid sequences) are
          passed to MUMmer
        4)The index is used to translate the matches back
          to the original DNA input
        5)Clustering step




p. 16    2007/05/28          ACMB
MUMmer v3.0

         New improvements in code
           • slightly faster than 2.0, 25% less memory
         More modular and configurable
           • possibility to build hybrid systems
         Ability to run a multi-contig query against a multi-
          contig reference
         Non-unique maximal matches
         Speed-up of Nucmer and Promer modules
          (approx. 10-fold)
         Graphical viewers



p. 17    2007/05/28             ACMB
Graphical interfaces




p. 18   2007/05/28           ACMB
Graphical interfaces




p. 19   2007/05/28           ACMB
Graphical interfaces




p. 20   2007/05/28           ACMB
That's All, Folks



                          Thank you!
                     Questions are welcome




p. 21   2007/05/28          ACMB

Weitere ähnliche Inhalte

Ähnlich wie Fast algorithms for large scale genome alignment and comparison

20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08Computer Science Club
 
Cell Processor Based Sequence Alignment
Cell Processor Based Sequence AlignmentCell Processor Based Sequence Alignment
Cell Processor Based Sequence Alignmentguestbe9138
 
A new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binaryA new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binaryIAEME Publication
 
Lightning
LightningLightning
LightningArvados
 
Computational Analysis with ICM
Computational Analysis with ICMComputational Analysis with ICM
Computational Analysis with ICMVernon D Dutch Jr
 
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR ToolkitImplemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR ToolkitShubham Verma
 
Report-de Bruijn Graph
Report-de Bruijn GraphReport-de Bruijn Graph
Report-de Bruijn GraphAshwani kumar
 
How we revealed genomes secrets?
How we revealed genomes secrets? How we revealed genomes secrets?
How we revealed genomes secrets? ehsan sepahi
 
Associative memory implementation with artificial neural networks
Associative memory implementation with artificial neural networksAssociative memory implementation with artificial neural networks
Associative memory implementation with artificial neural networkseSAT Publishing House
 
Computer Simulation of Nano-Structures
Computer Simulation of Nano-StructuresComputer Simulation of Nano-Structures
Computer Simulation of Nano-StructuresAqeel Khudhair
 
Making effective use of graphics processing units (GPUs) in computations
Making effective use of graphics processing units (GPUs) in computationsMaking effective use of graphics processing units (GPUs) in computations
Making effective use of graphics processing units (GPUs) in computationsOregon State University
 
Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64PeterMaf
 
Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64PeterMaf
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesAdina Chuang Howe
 
Applications of paralleL processing
Applications of paralleL processingApplications of paralleL processing
Applications of paralleL processingPage Maker
 
20110524zurichngs 2nd pub
20110524zurichngs 2nd pub20110524zurichngs 2nd pub
20110524zurichngs 2nd pubsesejun
 

Ähnlich wie Fast algorithms for large scale genome alignment and comparison (20)

20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assembly
 
Cell Processor Based Sequence Alignment
Cell Processor Based Sequence AlignmentCell Processor Based Sequence Alignment
Cell Processor Based Sequence Alignment
 
A new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binaryA new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binary
 
2012 stamps-mbl-1
2012 stamps-mbl-12012 stamps-mbl-1
2012 stamps-mbl-1
 
Lightning
LightningLightning
Lightning
 
Computational Analysis with ICM
Computational Analysis with ICMComputational Analysis with ICM
Computational Analysis with ICM
 
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR ToolkitImplemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
 
Report-de Bruijn Graph
Report-de Bruijn GraphReport-de Bruijn Graph
Report-de Bruijn Graph
 
How we revealed genomes secrets?
How we revealed genomes secrets? How we revealed genomes secrets?
How we revealed genomes secrets?
 
Associative memory implementation with artificial neural networks
Associative memory implementation with artificial neural networksAssociative memory implementation with artificial neural networks
Associative memory implementation with artificial neural networks
 
Final doc of dna
Final  doc of dnaFinal  doc of dna
Final doc of dna
 
JBUON-21-1-33
JBUON-21-1-33JBUON-21-1-33
JBUON-21-1-33
 
Computer Simulation of Nano-Structures
Computer Simulation of Nano-StructuresComputer Simulation of Nano-Structures
Computer Simulation of Nano-Structures
 
Making effective use of graphics processing units (GPUs) in computations
Making effective use of graphics processing units (GPUs) in computationsMaking effective use of graphics processing units (GPUs) in computations
Making effective use of graphics processing units (GPUs) in computations
 
Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64
 
Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop Slides
 
Applications of paralleL processing
Applications of paralleL processingApplications of paralleL processing
Applications of paralleL processing
 
20110524zurichngs 2nd pub
20110524zurichngs 2nd pub20110524zurichngs 2nd pub
20110524zurichngs 2nd pub
 

Mehr von Davide Eynard

Building Compatible Bases on Graphs, Images, and Manifolds
Building Compatible Bases on Graphs, Images, and ManifoldsBuilding Compatible Bases on Graphs, Images, and Manifolds
Building Compatible Bases on Graphs, Images, and ManifoldsDavide Eynard
 
Laplacian Colormaps: a framework for structure-preserving color transformations
Laplacian Colormaps: a framework for structure-preserving color transformationsLaplacian Colormaps: a framework for structure-preserving color transformations
Laplacian Colormaps: a framework for structure-preserving color transformationsDavide Eynard
 
Notes on Spectral Clustering
Notes on Spectral ClusteringNotes on Spectral Clustering
Notes on Spectral ClusteringDavide Eynard
 
An integrated approach to discover tag semantics
An integrated approach to discover tag semanticsAn integrated approach to discover tag semantics
An integrated approach to discover tag semanticsDavide Eynard
 
SAnno: a unifying framework for semantic annotation
SAnno: a unifying framework for semantic annotationSAnno: a unifying framework for semantic annotation
SAnno: a unifying framework for semantic annotationDavide Eynard
 
A Virtuous Cycle of Semantics and Participation
A Virtuous Cycle of Semantics and ParticipationA Virtuous Cycle of Semantics and Participation
A Virtuous Cycle of Semantics and ParticipationDavide Eynard
 
ReSearch - Searching for Researchers
ReSearch - Searching for ResearchersReSearch - Searching for Researchers
ReSearch - Searching for ResearchersDavide Eynard
 
PhDLinux: A Linux Crash Course for PhD Students
PhDLinux: A Linux Crash Course for PhD StudentsPhDLinux: A Linux Crash Course for PhD Students
PhDLinux: A Linux Crash Course for PhD StudentsDavide Eynard
 
Exploiting user gratification for collaborative semantic annotation
Exploiting user gratification for collaborative semantic annotationExploiting user gratification for collaborative semantic annotation
Exploiting user gratification for collaborative semantic annotationDavide Eynard
 
Performance Attacks on Intrusion Detection Systems
Performance Attacks on Intrusion Detection SystemsPerformance Attacks on Intrusion Detection Systems
Performance Attacks on Intrusion Detection SystemsDavide Eynard
 
Cracking Codes With Genetic Algorithms
Cracking Codes With Genetic AlgorithmsCracking Codes With Genetic Algorithms
Cracking Codes With Genetic AlgorithmsDavide Eynard
 
Unambiguous Recognizable Two-dimensional Languages
Unambiguous Recognizable Two-dimensional LanguagesUnambiguous Recognizable Two-dimensional Languages
Unambiguous Recognizable Two-dimensional LanguagesDavide Eynard
 
Research on collaborative information sharing systems
Research on collaborative information sharing systemsResearch on collaborative information sharing systems
Research on collaborative information sharing systemsDavide Eynard
 

Mehr von Davide Eynard (15)

Building Compatible Bases on Graphs, Images, and Manifolds
Building Compatible Bases on Graphs, Images, and ManifoldsBuilding Compatible Bases on Graphs, Images, and Manifolds
Building Compatible Bases on Graphs, Images, and Manifolds
 
Laplacian Colormaps: a framework for structure-preserving color transformations
Laplacian Colormaps: a framework for structure-preserving color transformationsLaplacian Colormaps: a framework for structure-preserving color transformations
Laplacian Colormaps: a framework for structure-preserving color transformations
 
Notes on Spectral Clustering
Notes on Spectral ClusteringNotes on Spectral Clustering
Notes on Spectral Clustering
 
An integrated approach to discover tag semantics
An integrated approach to discover tag semanticsAn integrated approach to discover tag semantics
An integrated approach to discover tag semantics
 
SAnno: a unifying framework for semantic annotation
SAnno: a unifying framework for semantic annotationSAnno: a unifying framework for semantic annotation
SAnno: a unifying framework for semantic annotation
 
A Virtuous Cycle of Semantics and Participation
A Virtuous Cycle of Semantics and ParticipationA Virtuous Cycle of Semantics and Participation
A Virtuous Cycle of Semantics and Participation
 
Talk Hpl
Talk HplTalk Hpl
Talk Hpl
 
ReSearch - Searching for Researchers
ReSearch - Searching for ResearchersReSearch - Searching for Researchers
ReSearch - Searching for Researchers
 
PhDLinux: A Linux Crash Course for PhD Students
PhDLinux: A Linux Crash Course for PhD StudentsPhDLinux: A Linux Crash Course for PhD Students
PhDLinux: A Linux Crash Course for PhD Students
 
Exploiting user gratification for collaborative semantic annotation
Exploiting user gratification for collaborative semantic annotationExploiting user gratification for collaborative semantic annotation
Exploiting user gratification for collaborative semantic annotation
 
Performance Attacks on Intrusion Detection Systems
Performance Attacks on Intrusion Detection SystemsPerformance Attacks on Intrusion Detection Systems
Performance Attacks on Intrusion Detection Systems
 
Cracking Codes With Genetic Algorithms
Cracking Codes With Genetic AlgorithmsCracking Codes With Genetic Algorithms
Cracking Codes With Genetic Algorithms
 
Rewire the Net
Rewire the NetRewire the Net
Rewire the Net
 
Unambiguous Recognizable Two-dimensional Languages
Unambiguous Recognizable Two-dimensional LanguagesUnambiguous Recognizable Two-dimensional Languages
Unambiguous Recognizable Two-dimensional Languages
 
Research on collaborative information sharing systems
Research on collaborative information sharing systemsResearch on collaborative information sharing systems
Research on collaborative information sharing systems
 

Kürzlich hochgeladen

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 

Fast algorithms for large scale genome alignment and comparison

  • 1. Fast algorithms for large scale genome alignment and comparison Davide Eynard eynard@elet.polimi.it Dipartimento di Elettronica e Informazione Politecnico di Milano 2007/05/28 Algorithms for Computational Molecular Biology
  • 2. The article(s)  A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, S.L. Salzberg: “Alignment of whole genomes”, 1999  A.L. Delcher, A. Philippy, J. Carlton, S.L. Salzberg: “Fast algorithms for large-scale genome alignment and comparison”, 2002  S. Kurtz, A. Philippy, A.L. Delcher, M. Smoot, M. Shumway, C. Antonescu, S.L. Salzberg: “Versatile and open software for comparing large genomes”, 2004 p. 2 2007/05/28 ACMB
  • 3. The problem  When the genome sequence of two closely related organisms becomes available, one of the first questions researchers want to ask is how the two genomes align  Aligning (very) long sequences • Single gene sequences may be as long as tens of thousand of nucleotides • Whole genomes are usually millions of nucleotides or larger! p. 3 2007/05/28 ACMB
  • 4. The challenge  Naïve • O(n2) space and time  Hashing • faster, but still partly O(n2)  Dynamic Programming • O(n) space, takes more time  MUMmer • Suffix trees: O(n) space and time • LIS: O(k log k) where k is the number of MUMs p. 4 2007/05/28 ACMB
  • 5. The algorithm 1) Perform a Maximal Unique Match (MUM) decomposition of the two genomes 2) Sort the matches found in the MUM alignment, and extract the LIS (Longest Increasing Sequence) of matches that occur in the same order in both genomes 3) Close the gaps in the alignment, performing local identification of large inserts, repeats, small mutated regions, tandem repeats and SNPs 4) Output the alignment p. 5 2007/05/28 ACMB
  • 6. MUM: the suffix tree p. 6 2007/05/28 ACMB
  • 8. Closing the gaps p. 8 2007/05/28 ACMB
  • 9. MUMmer v2.0  Relaxes the uniqueness constraint  Faster, takes less space  Algorithmic improvements • memory • streaming query • new module to cluster matches  Able to align not only simple DNA sequences, but also human chromosomes  Able to align incomplete genomes and protein sequences p. 9 2007/05/28 ACMB
  • 10. Time-space improvements  The amount of memory used in the suffix tree has been reduced • from at most 37bytes/bp to at most 20bytes/bp  Speed has increased • E.coli vs. V.cholerae, from 74sec,293MB to 27sec, 100MB  Suffix tree is used to store only one sequence, while the second one (query) is streamed against the suffix tree • once the suffix tree has been built, multiple queries can be streamed • quick way to find the next match • matches are maximal on the right hand side p. 10 2007/05/28 ACMB
  • 11. Streaming queries p. 11 2007/05/28 ACMB
  • 12. Clustering of matches  Old version computed a single longest alignment between the sequences  New version works as follows: • first, the system outputs a series of separate, independent alignment regions • clustering is performed by finding pairs of matches that are sufficiently close • finally, a LIS computation is done within each component to yield the most consistent sequence of matches in the cluster p. 12 2007/05/28 ACMB
  • 13. Alignment of incomplete genomes  In a typical Whole-Genome Shotgun-Sequencing, the genome is broken up into millions of pieces • If the reads are generated at random, then >99% of a genome will be covered by sequencing enough reads to cover the genome eight times • The result of assembly is usually a collection of large, unordered DNA sequences called contigs  NUCmer (nucleotide MUMmer) is a multiple- contig alignment program that uses MUMmer 2 as its core aligment engine p. 13 2007/05/28 ACMB
  • 14. Alignment of incomplete genomes 1)NUCmer input: two multi-fasta files representing partial or complete assemblies 2)Create a map of all contig positions within each file 3)Concatenate files separately and run MUMmer to find exact matches 4)Map matches to separate contigs 5)MUMs are clustered together if they are separated by no more than a user-specifiedd distance 6)Dynamic programming is used to align sequences between the MUMs p. 14 2007/05/28 ACMB
  • 15. NUCmer p. 15 2007/05/28 ACMB
  • 16. PROmer 1)Given two multi-fasta files, PROmer translates the DNA to amino acids 2)An index is created that maps all protein sequences and lengths to the source DNA 3)Pseudo-proteomes (amino acid sequences) are passed to MUMmer 4)The index is used to translate the matches back to the original DNA input 5)Clustering step p. 16 2007/05/28 ACMB
  • 17. MUMmer v3.0  New improvements in code • slightly faster than 2.0, 25% less memory  More modular and configurable • possibility to build hybrid systems  Ability to run a multi-contig query against a multi- contig reference  Non-unique maximal matches  Speed-up of Nucmer and Promer modules (approx. 10-fold)  Graphical viewers p. 17 2007/05/28 ACMB
  • 18. Graphical interfaces p. 18 2007/05/28 ACMB
  • 19. Graphical interfaces p. 19 2007/05/28 ACMB
  • 20. Graphical interfaces p. 20 2007/05/28 ACMB
  • 21. That's All, Folks Thank you! Questions are welcome p. 21 2007/05/28 ACMB