3. • The mutation distance : The
minimal number of nucleotides that would
need to be altered in order for the gene for one
Protein to code for the other.
• ACTGAT A C TGAT -
T C T - ATC
TCTATC
3
4. The construction of the tree
• Assume proteins, A, B and C, and their
mutation distances.
B C
A 24 28
B 32
• There are two Qs:
1. Which pair does one join together first?
2. What are the lengths of edges a, b, and c? 4
5. Which pair does one join together first ?
• It is simply by choosing the pair with the
smallest mutation distance.
B C
A 24 28
B 32 A B C
5
6. What are the lengths of legs a, b, and c?
c
B C
A 24 28 a b
B 32 A B C
a+b=24 a =? a =10
a+c=28 b =? b =14
b+c=32 c =18
c =?
6
7. • i. a+b=24 ii. a+c=28 iii. b+c=32
• a+b=24 : a=24-b put the value of a in ii :
• 24-b+c=28 ; c-b=28-24; c-b=4 : c=4+b
• put value of c in iii. b+4+b=32 :
2b+4=32: 2b=32-4;
• b=28/2=14
• Now put the value of b in 1
8. • Note that this analysis
assumes that there are
no multiple
substitutions|||||||||||
||||when a single site
undergoes two or more
changes
e.g. the ancestral
sequence … ATGT … gives
… AGGT …
• and … ACGT …).
9. Phylogenetic Tree Terminology
Terminal Nodes
Branches or
Lineages A Represent the
TAXA (genes,
populations,
B species, etc.)
used to infer
C the phylogeny
D
Ancestral Node
or ROOT of Internal Nodes or E
the Tree Divergence Points (represent
hypothetical ancestors of the
taxa)
Based on lectures by C-B Stewart, and by
Tal Pupko
10. Phylogenetic trees diagram the evolutionary
relationships between the taxa
Taxon B
Taxon C
Taxon A
Taxon D
Taxon E
((A,(B,C)),(D,E)) = The above phylogeny
as nested parentheses
Based on lectures by C-B Stewart, and by
Tal Pupko
11. Clade Taxon B
Taxon C
Taxon A
clade Taxon D
Taxon E
((A,(B,C)),(D,E))
__ B and C are more closely related to each other
than either is to A,
___ and A, B, and C form a clade that is a sister
group to the clade composed of D and E. ____If
the tree has a time scale, then D and E are the most
closely related. Based on lectures by C-B Stewart, and by
Tal Pupko
13. • Nature acts conservatively, i.e., it does not
develop a new kind of biology for every life
form but continuously changes and adapts a
proven general concept.
• Novel functionalities do not appear because a
new gene has suddenly arisen but are
developed and modified during evolution.
• Thus, Alleles of a gene found in a population
arise from a common ancestor
gene_____________ HOMOLOGOUS
14. Homology is not a measure of
similarity, but rather that sequences
have a shared evolutionary history
and, therefore, possess a common
ancestral sequence
(Tatusovet al. 1997).
• An all or none phenomenon
15. Orthologs
• Homologous proteins from different species
that possess the same function
(e.g., corresponding kinases in a signal
transduction pathway in humans and mice)
are called orthologs.
Paralogs
• Homologous proteins that have different
functions in the same species (e.g., two
kinases in different signal transduction
pathways of humans) are termed paralogs.
16. • A visual representation of orthologs (and
some other commonly confused
terms, paralogs and homologs)
17. Orthologs: "genes that have diverged after a speciation event...
[that] tend to have similar function" (Fulton et al. 2006).
Thus, orthologs are genes whose encoded proteins fulfill
similar roles in different species.
18. • Homology is not
quantifiable –
• The similarity and Identity
of two sequences, however
IS
19. Identity
• ratio of the
number of
identical amino
acids or
nucleotides
relative to the
total number of
amino acids or
nucleotides.
4/20 = 0.2.
20. similarity
• Unlike identity, similarity is not as simple to
calculate. Before similarity can be
determined, it must first be defined how similar
the building blocks of sequences are to each
other.
• This is done with the help of similarity matrices
_____ specify the probability at which a
sequence transforms into another sequence
over time.
• dependent on the time and the mutational rate
of nucleotides.
21. • For nucleotide sequences the simplest solution
is an identity matrix ( Fig. 4.2a).
22. • For protein sequences, an identity matrix is not
sufficient to describe biological and evolutionary
processes.
• Amino acids are not exchanged with the same
probability as might be conceived theoretically.
• YOU CAN RECALL THE SYNONYMOUS AND
NON-SYNONYMOUS MUTATIONS
23. • For example, DNA
T
• an exchange of T in
aspartic acid for DNA
glutamic acid is
frequently
observed;
• aspartic acid
to tryptophan is
seen rarely.
24. • A second reason for the mutation of
aspartic acid- to- glutamic acid
to occur more often is that both have similar properties.
• In contrast aspartic acid and tryptophan are chemically
different
– the hydrophobic tryptophan is frequently found in the
center of proteins, whereas the hydrophilic aspartic acid
occurs more often at the surface.
25. • Amino acid substitution matrices, therefore,
describe the probability at which amino acids
are exchanged in the course of evolution.
• The most commonly used amino acid scoring
matrices are the
PAM
(Position Accepted Mutation; Dayhoff et al.
1978) and
BLOSUM groups
• (Blocks Substitution Matrix; Henikoff and
Henikoff 1992)
26. Tryptophan Trp W
Hydrophobic
aspartic acid Asp D
Glutamic acid Glu E
Hydrophilic
Electrically Charged (negative)
27. NUCLEOTIDE AND AMINO ACID
SEQUENCES ARE
EVOLUTIONARILY DIFFERENT
SO,
WE NEED DIFFERENT CRITERIA AND
MATRICES TO ANALYZE THEM
28. • ( Fig. 4.2 a)
• For nucleotide sequences the simplest solution
is an identity matrix
29. ( Fig. 4.2 b) For Amino Acid Seqs
We need Similarity Matrices
Score: 65 Score: 19
30. Calculation of a global alignment of
two similar protein sequences.
31. Calculation of a global alignment of two similar protein
Sequences
32. Identity
• ratio of the
number of
identical amino
acids or
nucleotides
relative to the
total number of
amino acids or
nucleotides.
4/20 = 0.2.
33. Identity
• ratio of the
number of
identical amino
acids or
nucleotides
relative to the
total number of
amino acids or
nucleotides.
4/20 = 0.2.
35. Outgroup to root a
phylogenetic tree
• The tree of
human, chimpanzee, gorilla
and orangutan genes is rooted
with a baboon gene because
• we know from the fossil record
that the common ancestor of
the four species split away
from baboon earlier in
geological time
• Let’s See Members of this
Group
36. Outgroup Chimp
Human
Gorilla
Orangutan
0.01
Chimp
Human
Gorilla
Orangutan
Baboon
0.02
37. Outgroup
Kiwi
Ostrich
Swan
Ring Necked Phaes
Silver phaesant
song sparrow
Parrot
Lizzard
38. The Design of the phylogenetic TREE does not
change the evolutionary distance among the
various taxa represented.
Kiwi
Struthio camelus
Swan
song sparrow
Ring nick ed Phaesant
Silver pheasant
Parrot
39. The Design of the phylogenetic TREE does not
change the evolutionary distance among the
various taxa represented.
Kiwi
Struthio camelus
Swan
song sparrow
Ring nick ed Phaesa
Silver pheasant
Parrot
47. Examples of what can be inferred
from phylogenetic trees
(DNA, protein)
1. Which species are the closest living
relatives of modern humans?
2. Did the infamous Florida Dentist
infect his patients with HIV?
3. What is the relation between HIV
and SIV
48. Relatives of modern humans?
Humans Gorillas
Chimpanzees Chimpanzees
Bonobos Bonobos
Gorillas Orangutans
Orangutans Humans
14 0 15-30 0
MYA MYA
Mitochondrial DNA, most nuclear
DNA-encoded genes, and The pre-molecular view
DNA/DNA hybridization
49.
50. 2. Did the Florida Dentist infect his patients with HIV?
Phylogenetic tree DENTIST Yes:
of HIV sequences Patient C The HIV sequences
from the DENTIST, Patient A from
his Patients, & Local Patient G these patients fall
HIV-infected People:
Patient B within
Patient E the clade of HIV
Patient A sequences found in the
dentist.
DENTIST
Local control 2
Local control 3
Patient F No
Local control 9
Local control 35
Local control 3
Patient D No
Based on lectures by C-B Stewart, and by
From Ou et al. (1992) and Page & Holmes (1998) Tal Pupko
51. 3. Relating Human HIV to Simian SIV
retroviruses
human immunodeficiency virus
1 (HIV-1), pathogenic
SIVs are not pathogenic in their
normal hosts
52.
53. CD4 proteins on
surface
Phospholipid
membrane
Matrix
Capsid
Viral RNA
Viral enzymes:
- Reverse transcriptase
- Integrase
- Protease
The structure of HIV
IMAGE FROM: Medical Art Service, Munich / Wellcome Images.
54. New virus
leaves cell
New virus
assembled
HIV attaches to CD4
Viral RNA
receptors on T-Cell
Viral
proteins
Viral core of Viral protease
enzymes and RNA cuts up
injected into cell proteins
DNA transcribed
from viral RNA
Transcription
Double-stranded
DNA produced
Viral integrase
DNA integrates
HIV’s replication cycle with host
chromosome
55. Retrovirus genomes accumulate mutations
relatively quickly
• lacks an efficient proofreading, so make
errors when it carries out RNA-dependent
DNA synthesis.
• the molecular clock runs rapidly in
retroviruses,
56. •genomes that diverged quite recently display
sufficient nucleotide dissimilarity for a
phylogenetic analysis to be carried out.
•In less than 100 years, HIV and SIV genomes
contain sufficient data.
57. •The starting point for this
phylogenetic analysis is RNA extracted
from virus particles.
RT-PCR
58. RT-PCR
Reverse transcription polymerase chain
reaction (RT-PCR) is a variant of polymerase chain
reaction (PCR). It is a laboratory
technique commonly used in molecular biology
where a RNA strand is reverse transcribed into
its DNA complement (complementary
DNA, or cDNA) using the enzyme reverse
transcriptase, and the resulting cDNA is amplified
using PCR.
59.
60. • This tree has a number of
interesting features. First
it shows that different
samples ofHIV-1 have
slightly different
sequences, the samples as
a whole forming a tight
cluster, almost a star-like
pattern, that radiates from
one end of the unrooted
tree.
61. •* This star-like topology implies
that the global AIDS epidemic
began with a very small number of
viruses, perhaps just one, which have spread and
diversified since entering the human population.
• The closest relative to HIV-1 among primates is
the SIV of chimpanzees, the implication being that
• this virus jumped across the species barrier
between chimps and humans and initiated
the AIDS epidemic.
62. • However, this epidemic did not
begin immediately: a relatively long
uninterrupted branch links the
center of the HIV-1 radiation with
the internal node leading to the
relevant SIV sequence, suggesting
that after transmission to
humans, HIV-1 underwent a latent
period when it remained restricted
to a small part of the global human
population, presumably in
Africa, before beginning its rapid
spread to other parts of the world.
63. • Other primate SIVs are less closely
related to HIV-1, but
one, the SIV from sooty
mangabey, clusters in the tree with
the second human
immunodeficiency virus, HIV-2.
• It appears that HIV-2 was
transferred to the human
population independently of HIV-
1, and from a different simian
host. HIV-2 is also able to
cause AIDS, but has not, as
yet, become globally epidemic.
1.a+b=24 2. a+c=28 3. b+c=32a+b=24 : a=24-b put in 2 : 24-b+c=28 : c-b=28-24: c-b=4 : c=4+bput value of c in 3. b+4+b=32 : 2b+4=32: 2b=32-4; b=28/2=14Now put the value of b in 1
the evolutionary distance is expressed as the number of nucleotide differences per nucleotide site for each sequence pair. For example, sequences 1 and 2 are 20 nucleotides in length and have four differences, corresponding to an evolutionary difference of 4/20 = 0.2. Note that this analysis assumes that there are no multiple substitutions (also called multiple hits). Multiple substitution occurs when a single site undergoes two or more changes (e.g. the ancestral sequence … ATGT … gives rise to two modern sequences: … AGGT … and … ACGT …). There is only one nucleotide difference between the two modern sequences, but there have been two nucleotide substitutions. If this multiple hit is not recognized then the evolutionary distance between the two modern sequences will be significantly underestimated. To avoid this problem, distance matrices for phylogenetic analysis are usually constructed using mathematical methods that include statistical devices for estimating the amount of multiple substitution that has occurred.
the evolutionary distance is expressed as the number of nucleotide differences per nucleotide site for each sequence pair. For example, sequences 1 and 2 are 20 nucleotides in length and have four differences, corresponding to an evolutionary difference of 4/20 = 0.2. Note that this analysis assumes that there are no multiple substitutions (also called multiple hits). Multiple substitution occurs when a single site undergoes two or more changes (e.g. the ancestral sequence … ATGT … gives rise to two modern sequences: … AGGT … and … ACGT …). There is only one nucleotide difference between the two modern sequences, but there have been two nucleotide substitutions. If this multiple hit is not recognized then the evolutionary distance between the two modern sequences will be significantly underestimated. To avoid this problem, distance matrices for phylogenetic analysis are usually constructed using mathematical methods that include statistical devices for estimating the amount of multiple substitution that has occurred.
The lengths of the branches indicate the degree of difference between the genes represented by the nodes.
therefore, one may transfer functional information from one protein to another if both possess a certain degree of similarity. However, this process must be carried out critically, as similar proteins may yet perform different functions, despite, for example, having arisen from a common ancestor.
Homology is not quantifiable – either two sequences arehomologous or not. The identity or similarity of two sequences is, however, quantifiable.
Orthologs can be defined as "genes that have diverged after a speciation event... [that] tend to have similar function" (Fulton et al. 2006). Thus, orthologs are genes whose encoded proteins fulfill similar roles in different species. The importance of orthologs is quite simply seen when imagining a hypothetical comparison of two genes, A and B, that encode proteins with similar functions in two different species (human and chimp, for example). If one compares the two protein sequences encoded by the orthologs, the truly critical parts of the gene will be conserved. What has remained constant can probably be interpreted as crucial to the functioning; what has changed, minor.
Two alpha chains plus two beta chains constitute HbA, which in normal adult life comprises about 97% of the total hemoglobin; alpha chains combine with delta chains to constitute HbA-2, which with HbF (fetal hemoglobin) makes up the remaining 3% of adult hemoglobin. Alpha thalassemiasresult from deletions of each of the alpha genes as well as deletions of both HBA2 and HBA1;LOCUS CONTROL REGION(LC R)Many thalassemias result from mutations in the coding regions of the globin genes, but a few were shown to map to a 12-kb region upstream of the β-globin gene cluster, the region now called the LCR. The ability of mutations in the LCR to cause thalassemia is a clear indication that disruption of the LCR results in a loss of globin gene expression.
Unlike identity, similarity is not as simple to calculate. Before similarity can be determined, it must first be defined how similar the building blocks of sequences are to each other. This is done with the help of similarity matrices that are also known as substitution or scoring matrices. Similarity matrices specify the probability at which a sequence transforms into another sequence over time. dependent on the time and the mutational rate of nucleotides.
Here, one assumes that the fournucleotides do not show any similarity to one other, and therefore,only identical nucleotides are factored into the similarityscoring.
Onereason for this is the triplet-based genetic code (see Chap. 2).For an exchange of aspartic acid to glutamic acid to occur onlya mutation of the last nucleotide in the triplet codon is required
For an exchange of aspartic acid to glutamic acid to occur onlya mutation of the last nucleotide in the triplet codon is requiredIn contrast, a complete mutationof the whole triplet has to occur in order to exchange asparticacid for tryptophan (GAT/GAC to TGG).
An exchange of aspartic acid for tryptophan, therefore, couldgreatly alter the tertiary structure of a protein and consequentlyits function. Such striking amino acid exchanges accompaniedby a loss of function rarely happen.
Fig. 4.2. Use of the BLOSUM 62 matrix for the construction of an optimalamino acid alignment. Two potential alignments for each are representedwhereby the optimal alignment is shown in green
THERE IS ASYMMETRY BETWEEN NUCLEIC ACIDS AND THEIR PRODUCTS i.e. PROTEINS
IT IS THE SIMPLEST BUT NOT THE ONLY AND THE MOST CORRECT SOLUTIONHere, one assumes that the fournucleotides do not show any similarity to one other, and therefore,only identical nucleotides are factored into the similarityscoring.
Fig. 4.2. Scoring matrices allow the computation of optimal alignments.(a) Use of an identity matrix for the construction of an optimal nucleotidealignment. (b) Use of the BLOSUM62 matrix for the construction of an optimalamino acid alignment. Two potential alignments for each are representedwhereby the optimal alignment is shown in green
Sometimes, interest may focus solely on aligning the mostsimilar stretches within two sequences – a local alignment. Withthis approach, protein domains and motifs (e.g., ATP bindingsites, DNA binding domains, N-glycosylation sites) can be identified. In principle, a local alignment is calculated in the sameway as a global alignment using a substitution matrix and theintroduction and extension of gaps.
Fig. 4.4. Calculation of a global alignment of two similar protein sequences.(a) Both sequences are compared in a two-dimensional matrix and thesimilarity of the amino acids is determined using similarity matrices. Eachalignment can be described as a path through the two-dimensional matrix,starting with highest-scoring amino acid pair at the N-terminus. (b) By addingthe values corresponding scores for the different paths are obtained.The alignment with the highest score is considered optimal (shown in red).(c) The optimal alignment is obtained by the introduction of a gap and contains10 amino acids, of which seven are identical. Using the BLOSUM62similarity matrix and a gap penalty of 1.0 a score of 31.0 is achieved
Fig. 4.4. Calculation of a global alignment of two similar protein sequences.(a) Both sequences are compared in a two-dimensional matrix and thesimilarity of the amino acids is determined using similarity matrices. Eachalignment can be described as a path through the two-dimensional matrix,starting with highest-scoring amino acid pair at the (b) By addingthe values corresponding scores for the different paths are obtained.The alignment with the highest score is considered optimal (shown in red).(c) The optimal alignment is obtained by the introduction of a gap and contains10 amino acids, of which seven are identical. Using the BLOSUM62similarity matrix and a gap penalty of 1.0 a score of 31.0 is achieved
Fig. 4.4. Calculation of a global alignment of two similar protein sequences.(a) Both sequences are compared in a two-dimensional matrix and thesimilarity of the amino acids is determined using similarity matrices. Eachalignment can be described as a path through the two-dimensional matrix,starting with highest-scoring amino acid pair at the (b) By addingthe values corresponding scores for the different paths are obtained.The alignment with the highest score is considered optimal (shown in red).(c) The optimal alignment is obtained by the introduction of a gap and contains10 amino acids, of which seven are identical. Using the BLOSUM62similarity matrix and a gap penalty of 1.0 a score of 31.0 is achieved
Fig. 4.4. Calculation of a global alignment of two similar protein sequences.(a) Both sequences are compared in a two-dimensional matrix and thesimilarity of the amino acids is determined using similarity matrices. Eachalignment can be described as a path through the two-dimensional matrix,starting with highest-scoring amino acid pair at the (b) By addingthe values corresponding scores for the different paths are obtained.The alignment with the highest score is considered optimal (shown in red).(c) The optimal alignment is obtained by the introduction of a gap and contains10 amino acids, of which seven are identical. Using the BLOSUM62similarity matrix and a gap penalty of 1.0 a score of 31.0 is achieved
What are your findings. You were given an option of 25 Aminoacids and aksed to prepare a suitable Matrix
Fig. 4.2. Scoring matrices allow the computation of optimal alignments.(a) Use of an identity matrix for the construction of an optimal nucleotidealignment. (b) Use of the BLOSUM62 matrix for the construction of an optimalamino acid alignment. Two potential alignments for each are representedwhereby the optimal alignment is shown in green
The use of an outgroup to root a phylogenetic treeThe tree of human, chimpanzee, gorilla and orangutan genes is rooted with a baboon gene because we know from the fossil record that baboons split away from the primate lineage before the time of the common ancestor of the other four species. For more information on phylogenetic analysis of humans and other primates see
The Design and Angles of the phylogenetic does not change the evolutionary distance among the various taxa represented. Naeem
Is This tree Rooted?
This Tree is Rooted ?
Fig. 4.6. Phylogenetic tree of dopamine receptor sequences. The evolutionaryrelationship between the sequences is reflected by the length of thebranches. Dopamine receptor sequences of invertebrates (Dm, Drosophilamelanogaster; Ag, Anopheles gambiae; Am, Apismellifera) are compared withthose of humans (Hs, Homo sapiens). Three clear clusters are formed. As acontrol, the phylogenetically distant sequence of the Dm histamine receptorwas not found in any of the clusters
OrthologousRefers to homologous genes located in the genomes of different organisms.
The pre-molecular view was that the great apes (chimpanzees, gorillas and orangutans) formed a clade separate from humans, and that humans diverged from the apes at least 15-30 MYA. Mitochondrial DNA, most nuclear DNA-encoded genes, and DNA/DNA hybridization all show that bonobos and chimpanzees are related more closely to humans than either are to gorillas.
Figure 16.14 Different interpretations of the evolutionary relationships between humans, chimpanzees and gorillasSee the text for details. Abbreviation: Myr, million years.
AbstractHuman immunodeficiency virus type 1 (HIV-1) transmission from infected patients to health-care workers has been well documented, but transmission from an infected healthcare worker to a patient has not been reported. After identification of an acquired immunodeficiency syndrome (AIDS) patient who had no known risk factors for HIV infection but who had undergone an invasive procedure performed by a dentist with AIDS, six other patients of this dentist were found to be HIV-infected. Molecular biologic studies were conducted to complement the epidemiologic investigation. Portions of the HIV proviral envelope gene from each of the seven patients, the dentist, and 35 HIV-infected persons from the local geographic area were amplified by polymerase chain reaction and sequenced. Three separate comparative genetic analyses-genetic distance measurements, phylogenetic tree analysis, and amino acid signature pattern analysis-showed that the viruses from the dentist and five dental patients were closely related. These data, together with the epidemiologic investigation, indicated that these patients became infected with HIV while receiving care from a dentist with AIDS.
LIFE CYCLE OF HIV A RETROVIUS
Retrovirus genomes accumulate mutations relatively quickly because reverse transcriptase, the enzyme that copies the RNA genome contained in the virus particle into the DNA version that integrates into the host genome (see Section 2.4.2), lacks an efficient proofreading activity (Section 13.2.2) and so tends to make errors when it carries out RNA-dependent DNA synthesis. This means that the molecular clock runs rapidly in retroviruses,
This means that the molecular clock runs rapidly in retroviruses, and genomes that diverged quite recently display sufficient nucleotide dissimilarity for a phylogenetic analysis to be carried out. Even though the evolutionary period we are interested in is less than 100 years, HIV and SIV genomes contain sufficient data for their relationships to be inferred by phylogenetic analysis
In molecular biology, real-time polymerase chain reaction, also called quantitative real time polymerase chain reaction (Q-PCR/qPCR/qrt-PCR) or kinetic polymerase chain reaction (KPCR), is a laboratory technique based on the PCR, which is used to amplify and simultaneously quantify a targeted DNA molecule. For one or more specific sequences in a DNA sample, Real Time-PCR enables both detection and quantification. The quantity can be either an absolute number of copies or a relative amount when normalized to DNA input or additional normalizing genes.The procedure follows the general principle of polymerase chain reaction; its key feature is that the amplified DNA is detected as the reaction progresses in real time. This is a new approach compared to standard PCR, where the product of the reaction is detected at its end. Two common methods for detection of products in real-time PCR are: (1) non-specificfluorescent dyes that intercalate with any double-stranded DNA, and (2) sequence-specificDNA probes consisting of oligonucleotides that are labeled with a fluorescent reporter which permits detection only after hybridization of the probe with its complementary DNA target.Frequently, real-time PCR is combined with reverse transcription to quantify messenger RNA and Non-coding RNA in cells or tissues.
Figure 16.15Thephylogenetic tree reconstructed from HIV and SIV genome sequencesThe AIDS epidemic is due to the HIV-1M type of immunodeficiency virus. ZR59 is positioned near the root of the star-like pattern formed by genomes of this type. Based on Wain-Hobson (1998).These simian immunodeficiency viruses (SIVs) are not pathogenic in their normal hosts