Msa & rooted/unrooted tree

 "Phylogenetics" is the study or estimation of the evolutionary history that
underlies that biological diversity.
 The results of phylogenetic analysis are usually presented as a collection of nodes
and branches. That is, a tree
 In such tree, taxa that are closely related in an evolutionary sense appear close to
each other, and taxa that are distantly related are in different (far) branches of
the trees
 Phylogenetic trees are also important for multiple sequence alignment

 Trees may be rooted or unrooted.
 Rooted trees reflect the most
basal ancestor of the tree in
question.
 Unrooted trees do not imply a
known ancestral root.
 There are competing techniques
for rooting a tree; one of the
most common methods is
through the use of an
"outgroup" .
 An outgroup is a species that
have unambiguously separated
early from the other species
being considered.
B

 Multiple sequence alignment can be viewed as an extension of pairwise sequence
alignment, but the complexity of the computation grows exponentially with the
number of sequences.
 MSA applies both to nucleotide and amino acid sequences
 One of the most essential tools in molecular biology that is used since 1987.
 MSA can help us to reveal biological facts about proteins, like analysis of the
secondary/tertiary structure.
 MSA helps us to do a phylogenetic analysis of the sequences so as to construct
evolutionary trees.

 Exhaustive search:
extension of DP to multiple dimensions.
 Progressive alignment: compute tree of sequences, based on hierarchical
clustering, and then merge closest first, greedily. E.g. ClustalW
 Block-based global alignment find highly conserved regions and then grow
alignment around these regions. E.g. BLAST
 Iterative search: based on genetic algorithm search.
• Local alignments
 Profile analysis
 Block analysis
 Patterns searching and/or Statistical methods

VTISCTGSSSNIGAG-NHVKWYQQLPG
VTISCTGTSSNIGS--ITVNWYQQLPG
LRLSCSSSGFIFSS--YAMYWVRQAPG
LSLTCTVSGTSFDD--YYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNWYVDG--
ATLVCLISDFYPGA--VTVAWKADS--
AALGCLVKDYFPEP--VTVSWNSG---
VSLTCLVKGFYPSD--IAVEWWSNG--

 Alignment of 2 sequences is represented as a
2-row matrix
 In a similar way, we represent alignment of 3
sequences as a 3-row matrix
A T _ G C G _
A _ C G T _ A
A T C A C _ A
 Score: more conserved columns, better alignment

 Align 3 sequences: ATGC, AATC,ATGC
0 1 1 2 3 4
A -- T G C
0 1 2 3 3 4
A A T -- C
0 0 1 2 3 4
-- A T G C
x coordinate
y coordinate
z coordinate
• Resulting path in (x,y,z) space:
(0,0,0)(1,1,0)(1,2,1) (2,3,2) (3,3,3) (4,4,4)

C (i-1,j-1) C (i-1,j)
C (i,j-1)
In 2-D, 3 edges
in each unit
square
In 3-D, 7 edges
in each unit cube
C(i-1,j-1,k-1) C(i-1,j,k-1)
C(i-1,j-1,k)
C(i,j-1,k)
C (i-1,j,k)
C(i,j,k)
C(i,j-1,k-1) C(i,j,k-1)
Enumerate all possibilities and choose the best one

 For three sequences of length n, the run time is proportional to the
number of edges in the 3-D grid. i. e 7n .
 For a k-way alignment, build a k-dimensional Manhattan graph
with
k
 n nodes
k k
k
 Most nodes have 2 -1 incoming edges
 Runtime: 0(2 n )
 Consider 2 protein sequences of 100 amino acids in length.
 If it takes 1002 (103) seconds to exhaustively align these sequences, then it will
take 104 seconds to align 3 sequences, 105 to align 4 sequences, etc.
 It will take ~1021 seconds to align 20 sequences. One year is ~3x107 seconds. The
age of the visible universe is ~.4x1018 seconds.

 Greedy method follows the problem solving heuristic of
making the locally optimal choice at each stage of k
sequences with the hope of finding a global optimum to
an alignment of of k-1 sequences/profiles.
u1= ACGTACGTACGT…
u2 = TTAATTAATTAA…
u3 = ACTACTACTACT…
…
uk = CCGGCCGGCCGG
u1= ACg/tTACg/tTACg/cT…
u2 = TTAATTAATTAA…
…
uk = CCGGCCGGCCGG…
k
k-1

• Consider these 4 sequences
s1 GATTCA
s2 GTCTGA
s3 GATATT
s4 GTCAGC

4
• There are = 6 possible alignments 2
s2 GTCTGA
s4 GTCAGC (score = 2)
s1 GAT-TCA
s2 G-TCTGA (score = 1)
s1 GAT-TCA
s3 GATAT-T (score
s1 GATTCA--
s4 G—T-CAGC(score = 0)
Match= +1
Mismatch/gap= -1
s2 G-TCTGA
s3 GATAT-T (score = -1)
s3 GAT-ATT
= 1) s4 G-TCAGC
(score = -1)

s2 and s4 are closest; combine:
s2 GTCTGA
s4 GTCAGC
s2,4 GTCt/aGa/c
(profile)
new set of 3 sequences:
s1
s3
s2,4
GATTCA
GATATT
GTCt/aGa/c

s1
s3
s2,4
GATTCA
GATATT
GTCt/aGa/c
s1 GATTC- - A
s2,4 G -T -CTGA
(score = 0)
s3 GATATT -
s2,4 G -TCTGA
(score = -1)
s1 and s2,4 are closest; combine:
s1 GATTC- - A
S2,4 G -T -CTGA S1,2,4 Ga/-Tt/-ct/-g/-A
s3
S1,2,4
GATATT
Ga/-Tt/-ct/-g/-A
s3 GATAT –T- -
S1,2,4 GAT-TCTGA
(score = 1)
S1,2,3,4 GATa/-Tc/-Tg/-a/-
Final Alignment:

 Computationally complex
 If msa includes matches, mismatches and gaps and also
accounts the degree of variation then msa can be applied
to only a few sequences
 Difficult to score
 Multiple comparison necessary in each column of the msa for a
cumulative score
 Placement of gaps and scoring of substitution is more difficult
 Difficulty increases with diversity
 Relatively easy for a set of closely related sequences
 Identifying the correct ancestry relationships for a set
of distantly related sequences is more challenging
 Even difficult if some members are more alike compared
to others

 EMBL-EBI
 http://www.ebi.ac.uk/clustalw/
 BCM Search Launcher: Multiple Alignment
 http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html
 Multiple Sequence Alignment for Proteins (Wash. U. St. Louis)
 http://www.ibc.wustl.edu/service/msa/
web.warwick.ac.uk/telri/Bioinfo/
http://science.marshall.edu/murraye/
http://www.cs.iastate.edu/~cs544/Lectures/

Msa & rooted/unrooted tree

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (8)

Ähnlich wie Msa & rooted/unrooted tree

Ähnlich wie Msa & rooted/unrooted tree (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Msa & rooted/unrooted tree