SlideShare a Scribd company logo
1 of 12
Pairwise sequence Alignment

                    Dr Avril Coghlan
                   alc@sanger.ac.uk

Note: this talk contains animations which can only be seen by
downloading and using ‘View Slide show’ in Powerpoint
Sequence comparison
• How can we compare the human & Drosophila
  melanogaster Eyeless protein sequences?
  One method is a dotplot
• A dotplot is a graphical (visual) approach
  Regions of local similarity between the 2 sequences appear as diagonal
       lines of coloured cells (‘dots’)
                Fruitfly Eyeless




                                                   Window-size = 10,
                                                   Threshold = 5




                                   Human Eyeless
Sequence alignment
• A second method for comparing sequences is a
  sequence alignment
• An alignment is an arrangement in columns of 2
  sequences, highlighting their similarity
  The sequences are padded with gaps (dashes) so that wherever
  possible, alignment columns contain identical letters from the   two
  sequences involved
  An insertion or deletion is represented by ‘–’ (a gap)
  The symbol “|” is used to represent matches
  eg. here is an alignment for amino acid sequences
  “QKGSYPVRSTC” & “QKGSGPVRSTC”:

            Q K G S Y P V R S T C             This alignment has
                                              There are 10 matches
                                                     is 1 mismatch
            | | | |   | | | | | |
            Q K G S G P V R S T C              11 columns
            1 2 3 4 5 6 7 8 9 10 11
Sequence alignment
• An alignment of the human and fruitfly
  (Drosophila melanogaster) Eyeless proteins:
What does an alignment mean?
• An alignment is tells you tells you what mutations
  occurred in the sequences since the sequences
  shared a common ancestor
  eg. an alignment of the human & fruitfly Eyeless suggests:
  (i) there were probably deletion(s) at the start of the human
  Eyeless, or insertion(s) at the start of fruitfly Eyeless




  (ii) there was probably a G→N substitution in human Eyeless, or a N→G
         substitution in fruitfly Eyeless (see arrow)
How do we make an alignment?
• Given two or more sequences, what is the best way
  to align them to each other
  We want the alignment columns to contain identical letters
• Comparison of similar sequences of similar length is
  straightforward
  eg. for amino acid sequences “QKGSYPVRSTC” & “QKGSGPVRSTC”, we
       line up the identical letters in columns:

               Q K G S Y P V R S T C            sequence 1
               | | | |   | | | | | |
               Q K G S G P V R S T C            sequence 2

  The alignment implies that one mutation occurred since the two
  sequences shared a common ancestor
  That is, the alignment implies there was a G→Y substitution in
  sequence 1 or a Y→G substitution in sequence 2
Problem
• Are there other possible plausible alignments for
  sequences “QKGSYPVRSTC” & “QKGSGPVRSTC”?
Answer
• Are there other possible plausible alignments for
  sequences “QKGSYPVRSTC” & “QKGSGPVRSTC”?
  There are many other possible alignments, eg. :

  Q K G S Y - P V R S T C
  | | |       | | | | | |
  Q K G - S G P V R S T C
  Q K G S - Y P V R S T C
  | | | |       | | | | |
  Q K G S G P - V R S T C
  Q K G - - - - - S Y P V R S T C
  | | |           |           | |
  Q K G S G P V R S - - - - - T C
  Q K - G S Y P V R S T C
  | |                   |
  Q K G S G P V R S T - C                  etc. etc. etc. . . .
Number of possible pairwise alignments
• There are lots of different possible alignments for
  two sequences that are both of length n
  The number of possible alignments of 2 seqs of length n letters (amino
  acids/nucleotides) is ( ) (“2n2n
                                 choose n”)
                                       n
      2n
  (   n)   can be calculated as ( 2n
                                   n   ) =   (2*n) !
                                             n! * n!
  where n! (‘n factorial’) = n * (n - 1) * (n – 2) * (n – 3) * ... * 3 * 2 * 1
• For example, for “QKGSYPVRSTC” &
  “QKGSGPVRSTC”, n (length) = 11 letters
  The number of possible alignments of these two sequences is
  (2*11) = ( 22 ) = (2*11) !  =           22!
    11       11
                    11! * 11!     39916800*3991680

  = 1.124001e+21/1.593351e+15 = 705,432 possible alignments
Number of possible pairwise alignments
• Even for relatively short sequences, (2n ) is large, so
                                        n
  there are lots of possible alignments
  eg. for two sequences that are both 11 letters long, there are
  705,432 possible alignments
• In fact, the number of possible alignments, ( 2n ),
                                                n
  increases exponentially with the sequence length (n)
  ie. ( 2n ) is approximately equal to 22n
        n

                                                        For two sequences of
    Number of                                           17 letters long (n=17),
    possible                                            there are 2.3 billion
    alignments                                          possible alignments



                         Length of sequences (n)
• Many of the possible alignments for 2 seqs are
  implausible as they imply many mutations occurred
  (but it’s known mutations are rare)
  eg. for amino acid sequences “QKGSYPVRSTC” & “QKGSGPVRSTC”, the
        alignment made by lining the identical letters into columns only
        implies one mutation:
  Q K G S Y P V R S T C              This alignment implies that 1 G→Y or
  | | | |   | | | | | |              Y→G substitution occurred
  Q K G S G P V R S T C

  Many of the alternative alignments for these two sequences        imply
  that many more mutations occurred, eg. :

  Q K G S Y - P V R S T C             This alignment implies that 1 S→Y or
  | | |       | | | | | |             Y→S substitution occurred;
  Q K G - S G P V R S T C
                                      that 1 insertion of S or deletion of S
                                      occurred;
                                      and that 1 deletion of G or insertion of G
                                      occurred
Further Reading
•   Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn
•   Practical on pairwise alignment in R in the Little Book of R for
    Bioinformatics:
    https://a-little-book-of-r-for-
    bioinformatics.readthedocs.org/en/latest/src/chapter4.html

More Related Content

What's hot

What's hot (20)

Blast and fasta
Blast and fastaBlast and fasta
Blast and fasta
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
Scoring matrices
Scoring matricesScoring matrices
Scoring matrices
 
sequence of file formats in bioinformatics
sequence of file formats in bioinformaticssequence of file formats in bioinformatics
sequence of file formats in bioinformatics
 
Dot matrix
Dot matrixDot matrix
Dot matrix
 
Cath
CathCath
Cath
 
Protein Structure Prediction
Protein Structure PredictionProtein Structure Prediction
Protein Structure Prediction
 
Proteins databases
Proteins databasesProteins databases
Proteins databases
 
Clustal
ClustalClustal
Clustal
 
Phylogenetic tree construction
Phylogenetic tree constructionPhylogenetic tree construction
Phylogenetic tree construction
 
Fasta
FastaFasta
Fasta
 
Gene prediction methods vijay
Gene prediction methods  vijayGene prediction methods  vijay
Gene prediction methods vijay
 
Protein 3 d structure prediction
Protein 3 d structure predictionProtein 3 d structure prediction
Protein 3 d structure prediction
 
Protein data bank
Protein data bankProtein data bank
Protein data bank
 
PIR- Protein Information Resource
PIR- Protein Information ResourcePIR- Protein Information Resource
PIR- Protein Information Resource
 
Sequence alignment global vs. local
Sequence alignment  global vs. localSequence alignment  global vs. local
Sequence alignment global vs. local
 
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICSSTRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
 
Entrez databases
Entrez databasesEntrez databases
Entrez databases
 
Structural databases
Structural databases Structural databases
Structural databases
 

Viewers also liked

Nucleic acid database
Nucleic acid database Nucleic acid database
Nucleic acid database
bhargvi sharma
 

Viewers also liked (10)

Sequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsSequence Alignment In Bioinformatics
Sequence Alignment In Bioinformatics
 
Metamorphic Malware Analysis and Detection
Metamorphic Malware Analysis and DetectionMetamorphic Malware Analysis and Detection
Metamorphic Malware Analysis and Detection
 
Using Machine Learning in Networks Intrusion Detection Systems
Using Machine Learning in Networks Intrusion Detection SystemsUsing Machine Learning in Networks Intrusion Detection Systems
Using Machine Learning in Networks Intrusion Detection Systems
 
Global alignment
Global alignmentGlobal alignment
Global alignment
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
Nucleic Acid Sequence Databases
Nucleic Acid Sequence DatabasesNucleic Acid Sequence Databases
Nucleic Acid Sequence Databases
 
Global and local alignment (bioinformatics)
Global and local alignment (bioinformatics)Global and local alignment (bioinformatics)
Global and local alignment (bioinformatics)
 
Nucleic acid database
Nucleic acid database Nucleic acid database
Nucleic acid database
 
Nucleic Acid Sequence databases
Nucleic Acid Sequence databasesNucleic Acid Sequence databases
Nucleic Acid Sequence databases
 
Biological databases
Biological databasesBiological databases
Biological databases
 

Similar to Pairwise sequence alignment

Dotplots for Bioinformatics
Dotplots for BioinformaticsDotplots for Bioinformatics
Dotplots for Bioinformatics
avrilcoghlan
 
Multiple alignment
Multiple alignmentMultiple alignment
Multiple alignment
avrilcoghlan
 
Conference Poster: Discrete Symmetries of Symmetric Hypergraph States
Conference Poster: Discrete Symmetries of Symmetric Hypergraph StatesConference Poster: Discrete Symmetries of Symmetric Hypergraph States
Conference Poster: Discrete Symmetries of Symmetric Hypergraph States
Chase Yetter
 
20100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture0720100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture07
Computer Science Club
 
Better Late Than Never: A Fully Abstract Semantics for Classical Processes
Better Late Than Never: A Fully Abstract Semantics for Classical ProcessesBetter Late Than Never: A Fully Abstract Semantics for Classical Processes
Better Late Than Never: A Fully Abstract Semantics for Classical Processes
Marco Peressotti
 
20110501 csseminar alekseyev_comparative_genomics
20110501 csseminar alekseyev_comparative_genomics20110501 csseminar alekseyev_comparative_genomics
20110501 csseminar alekseyev_comparative_genomics
Computer Science Club
 

Similar to Pairwise sequence alignment (20)

Ch06 multalign
Ch06 multalignCh06 multalign
Ch06 multalign
 
Dotplots for Bioinformatics
Dotplots for BioinformaticsDotplots for Bioinformatics
Dotplots for Bioinformatics
 
Bioinformatics lesson
Bioinformatics lessonBioinformatics lesson
Bioinformatics lesson
 
Bioinformatics lesson
Bioinformatics lessonBioinformatics lesson
Bioinformatics lesson
 
SyMAP Master's Thesis Presentation
SyMAP Master's Thesis PresentationSyMAP Master's Thesis Presentation
SyMAP Master's Thesis Presentation
 
Slides4
Slides4Slides4
Slides4
 
Dot matrix seminar
Dot matrix seminarDot matrix seminar
Dot matrix seminar
 
A superglue for string comparison
A superglue for string comparisonA superglue for string comparison
A superglue for string comparison
 
Multiple alignment
Multiple alignmentMultiple alignment
Multiple alignment
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence
 
Infinite and Standard Computation with Unconventional and Quantum Methods Usi...
Infinite and Standard Computation with Unconventional and Quantum Methods Usi...Infinite and Standard Computation with Unconventional and Quantum Methods Usi...
Infinite and Standard Computation with Unconventional and Quantum Methods Usi...
 
Conference Poster: Discrete Symmetries of Symmetric Hypergraph States
Conference Poster: Discrete Symmetries of Symmetric Hypergraph StatesConference Poster: Discrete Symmetries of Symmetric Hypergraph States
Conference Poster: Discrete Symmetries of Symmetric Hypergraph States
 
20100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture0720100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture07
 
UCSD NANO106 - 03 - Lattice Directions and Planes, Reciprocal Lattice and Coo...
UCSD NANO106 - 03 - Lattice Directions and Planes, Reciprocal Lattice and Coo...UCSD NANO106 - 03 - Lattice Directions and Planes, Reciprocal Lattice and Coo...
UCSD NANO106 - 03 - Lattice Directions and Planes, Reciprocal Lattice and Coo...
 
Smaller fully-functional bidirectional BWT indexes
Smaller fully-functional bidirectional BWT indexesSmaller fully-functional bidirectional BWT indexes
Smaller fully-functional bidirectional BWT indexes
 
Better Late Than Never: A Fully Abstract Semantics for Classical Processes
Better Late Than Never: A Fully Abstract Semantics for Classical ProcessesBetter Late Than Never: A Fully Abstract Semantics for Classical Processes
Better Late Than Never: A Fully Abstract Semantics for Classical Processes
 
深層意味表現学習 (Deep Semantic Representations)
深層意味表現学習 (Deep Semantic Representations)深層意味表現学習 (Deep Semantic Representations)
深層意味表現学習 (Deep Semantic Representations)
 
seq alignment.ppt
seq alignment.pptseq alignment.ppt
seq alignment.ppt
 
20110501 csseminar alekseyev_comparative_genomics
20110501 csseminar alekseyev_comparative_genomics20110501 csseminar alekseyev_comparative_genomics
20110501 csseminar alekseyev_comparative_genomics
 
Quantified NTL
Quantified NTLQuantified NTL
Quantified NTL
 

More from avrilcoghlan

DESeq Paper Journal club
DESeq Paper Journal club DESeq Paper Journal club
DESeq Paper Journal club
avrilcoghlan
 
Introduction to genomes
Introduction to genomesIntroduction to genomes
Introduction to genomes
avrilcoghlan
 
Statistical significance of alignments
Statistical significance of alignmentsStatistical significance of alignments
Statistical significance of alignments
avrilcoghlan
 
The Smith Waterman algorithm
The Smith Waterman algorithmThe Smith Waterman algorithm
The Smith Waterman algorithm
avrilcoghlan
 
Alignment scoring functions
Alignment scoring functionsAlignment scoring functions
Alignment scoring functions
avrilcoghlan
 
The Needleman Wunsch algorithm
The Needleman Wunsch algorithmThe Needleman Wunsch algorithm
The Needleman Wunsch algorithm
avrilcoghlan
 
Introduction to HMMs in Bioinformatics
Introduction to HMMs in BioinformaticsIntroduction to HMMs in Bioinformatics
Introduction to HMMs in Bioinformatics
avrilcoghlan
 

More from avrilcoghlan (9)

DESeq Paper Journal club
DESeq Paper Journal club DESeq Paper Journal club
DESeq Paper Journal club
 
Introduction to genomes
Introduction to genomesIntroduction to genomes
Introduction to genomes
 
Homology
HomologyHomology
Homology
 
Statistical significance of alignments
Statistical significance of alignmentsStatistical significance of alignments
Statistical significance of alignments
 
BLAST
BLASTBLAST
BLAST
 
The Smith Waterman algorithm
The Smith Waterman algorithmThe Smith Waterman algorithm
The Smith Waterman algorithm
 
Alignment scoring functions
Alignment scoring functionsAlignment scoring functions
Alignment scoring functions
 
The Needleman Wunsch algorithm
The Needleman Wunsch algorithmThe Needleman Wunsch algorithm
The Needleman Wunsch algorithm
 
Introduction to HMMs in Bioinformatics
Introduction to HMMs in BioinformaticsIntroduction to HMMs in Bioinformatics
Introduction to HMMs in Bioinformatics
 

Recently uploaded

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 

Recently uploaded (20)

TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 

Pairwise sequence alignment

  • 1. Pairwise sequence Alignment Dr Avril Coghlan alc@sanger.ac.uk Note: this talk contains animations which can only be seen by downloading and using ‘View Slide show’ in Powerpoint
  • 2. Sequence comparison • How can we compare the human & Drosophila melanogaster Eyeless protein sequences? One method is a dotplot • A dotplot is a graphical (visual) approach Regions of local similarity between the 2 sequences appear as diagonal lines of coloured cells (‘dots’) Fruitfly Eyeless Window-size = 10, Threshold = 5 Human Eyeless
  • 3. Sequence alignment • A second method for comparing sequences is a sequence alignment • An alignment is an arrangement in columns of 2 sequences, highlighting their similarity The sequences are padded with gaps (dashes) so that wherever possible, alignment columns contain identical letters from the two sequences involved An insertion or deletion is represented by ‘–’ (a gap) The symbol “|” is used to represent matches eg. here is an alignment for amino acid sequences “QKGSYPVRSTC” & “QKGSGPVRSTC”: Q K G S Y P V R S T C This alignment has There are 10 matches is 1 mismatch | | | | | | | | | | Q K G S G P V R S T C 11 columns 1 2 3 4 5 6 7 8 9 10 11
  • 4. Sequence alignment • An alignment of the human and fruitfly (Drosophila melanogaster) Eyeless proteins:
  • 5. What does an alignment mean? • An alignment is tells you tells you what mutations occurred in the sequences since the sequences shared a common ancestor eg. an alignment of the human & fruitfly Eyeless suggests: (i) there were probably deletion(s) at the start of the human Eyeless, or insertion(s) at the start of fruitfly Eyeless (ii) there was probably a G→N substitution in human Eyeless, or a N→G substitution in fruitfly Eyeless (see arrow)
  • 6. How do we make an alignment? • Given two or more sequences, what is the best way to align them to each other We want the alignment columns to contain identical letters • Comparison of similar sequences of similar length is straightforward eg. for amino acid sequences “QKGSYPVRSTC” & “QKGSGPVRSTC”, we line up the identical letters in columns: Q K G S Y P V R S T C sequence 1 | | | | | | | | | | Q K G S G P V R S T C sequence 2 The alignment implies that one mutation occurred since the two sequences shared a common ancestor That is, the alignment implies there was a G→Y substitution in sequence 1 or a Y→G substitution in sequence 2
  • 7. Problem • Are there other possible plausible alignments for sequences “QKGSYPVRSTC” & “QKGSGPVRSTC”?
  • 8. Answer • Are there other possible plausible alignments for sequences “QKGSYPVRSTC” & “QKGSGPVRSTC”? There are many other possible alignments, eg. : Q K G S Y - P V R S T C | | | | | | | | | Q K G - S G P V R S T C Q K G S - Y P V R S T C | | | | | | | | | Q K G S G P - V R S T C Q K G - - - - - S Y P V R S T C | | | | | | Q K G S G P V R S - - - - - T C Q K - G S Y P V R S T C | | | Q K G S G P V R S T - C etc. etc. etc. . . .
  • 9. Number of possible pairwise alignments • There are lots of different possible alignments for two sequences that are both of length n The number of possible alignments of 2 seqs of length n letters (amino acids/nucleotides) is ( ) (“2n2n choose n”) n 2n ( n) can be calculated as ( 2n n ) = (2*n) ! n! * n! where n! (‘n factorial’) = n * (n - 1) * (n – 2) * (n – 3) * ... * 3 * 2 * 1 • For example, for “QKGSYPVRSTC” & “QKGSGPVRSTC”, n (length) = 11 letters The number of possible alignments of these two sequences is (2*11) = ( 22 ) = (2*11) ! = 22! 11 11 11! * 11! 39916800*3991680 = 1.124001e+21/1.593351e+15 = 705,432 possible alignments
  • 10. Number of possible pairwise alignments • Even for relatively short sequences, (2n ) is large, so n there are lots of possible alignments eg. for two sequences that are both 11 letters long, there are 705,432 possible alignments • In fact, the number of possible alignments, ( 2n ), n increases exponentially with the sequence length (n) ie. ( 2n ) is approximately equal to 22n n For two sequences of Number of 17 letters long (n=17), possible there are 2.3 billion alignments possible alignments Length of sequences (n)
  • 11. • Many of the possible alignments for 2 seqs are implausible as they imply many mutations occurred (but it’s known mutations are rare) eg. for amino acid sequences “QKGSYPVRSTC” & “QKGSGPVRSTC”, the alignment made by lining the identical letters into columns only implies one mutation: Q K G S Y P V R S T C This alignment implies that 1 G→Y or | | | | | | | | | | Y→G substitution occurred Q K G S G P V R S T C Many of the alternative alignments for these two sequences imply that many more mutations occurred, eg. : Q K G S Y - P V R S T C This alignment implies that 1 S→Y or | | | | | | | | | Y→S substitution occurred; Q K G - S G P V R S T C that 1 insertion of S or deletion of S occurred; and that 1 deletion of G or insertion of G occurred
  • 12. Further Reading • Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn • Practical on pairwise alignment in R in the Little Book of R for Bioinformatics: https://a-little-book-of-r-for- bioinformatics.readthedocs.org/en/latest/src/chapter4.html

Editor's Notes

  1. Made
  2. Made alignment of human.fa and fly.fa using Needleman-wunsch with default parameters at: http://emboss.bioinformatics.nl/cgi-bin/emboss/needle (EMBOSS needle) Human Eyeless (PAX6) from: http://www.treefam.org/cgi-bin/TFseq.pl?id=ENST00000379111.1 D. Melanogaster Eyeless from: http://www.treefam.org/cgi-bin/TFseq.pl?id=FBtr0100396.5 Viewed in jalview, and saved as humanfly_needlemanwunsch.png
  3. Made
  4. Made
  5. In R factorial(22)/( (factorial(11)) * (factorial(11)) )
  6. N.B. (2n choose n) = the binomial coefficient = the number of ways that n things can be 'chosen' from a set of 2 n things = ((2n)!)/(n!)*(n!). This can be shown to be proportional to 2^(2*n) (Deonier, Tavare & Waterman book page 158-9). Graph made using wolfram alpha at http://www.wolframalpha.com/ and typing “plot 2n choose n from 1 to 20”.
  7. Made