This document discusses comparative genomics and gene family analysis. It describes clustering proteins into families to study evolution, orthology, paralogy, gene duplication and loss. Gene family analysis allows detection of errors in gene structure annotation by comparing sequences. Phylogenetic trees classify gene families and reveal functional divergence. Resources for orthology analysis include Ensembl, OrthoMCLDB and YGOB. The goal of hands-on analysis is to characterize the talin 2 gene family.
3. Applications of clustering the
proteome(s)
Gene families form the basis for the evolutionary
(or phylogenetic) analysis of
Detection of orthologs and paralogs
Gene duplication, family expansions,
pseudogene formation and gene loss
Species taxonomies
Horizontal Gene Transfer (HGT)
Evolution of gene structure
• Introns
• Protein domain organisation &
(re)arrangements
Base composition and codon usage
3
4. I. Structural annotation: genome-
wide versus family-wise
Rationale family-wise annotation
Since every gene has different (sequence)
characteristics and different genes evolve at
different rates, using these characteristics to
determine homologous gene models will
improve the overall structural annotation
quality
Properties:
Slow & nearly-manual procedure
High-quality gene models revealing biological
novel findings
4
5. Workflow family-wise annotation
procedure
Collecting experi- MSA experimental Family
HMMbuild
mental representatives representatives HMM profile
EST/cDNA
BLAST Species X
proteome
Protein motifs Ab initio gene prediction
Correction gene model Putative
HMMsearch
Homologs
Classification using
Phylogenetic trees
5 Detailed characterization http://hmmer.janelia.org/
7. BLAST / HMMsearch
1. Use multiple sequence
alignment to create HMM profile
2. Use HMM profile to search for
similar proteins
7
8. Representatives + putative homologs
BioEdit Sequence Editor
Suffix finalcds indicates corrected gene model compared to the original gene model
generate by the ab-initio gene prediction
Multiple sequence alignments assist in the detection and
correction of errors in the structural annotation (missed exon)
8
9. Representatives + putative homologs
Suffix finalcds indicates corrected gene model compared to the original gene model
generate by the ab-initio gene prediction
Multiple sequence alignments assist in the detection of errors
in the structural annotation (false first exon)
9
10. Examples of family-specific protein
motifs
B-type cyclins have HxKF signature
Cyclin destruction boxes (B1-type cyclin R-[AV]LGDIGN)
10
11. Examples of family-specific protein
Arabidopsis
Rice
motifs
D-type cyclins contain LxCxE Rb-binding motif
Low conservation of phylogenetic signal at primary sequence level
General rules are rarely general: exceptions (i.e. missing protein
motifs) are frequent and might indicate functional divergence
11
12. Classification using phylogenetic
tree construction
A- and B-type cyclins
are mitotic cyclins
D-type cyclins are
G1-specific
H-type cyclins regulate activity
of CDK-activating kinases
• The complexity of the cyclin gene family appears to be higher in plants than in
mammals
• Whether there is functional redundancy within A- and B-type cyclins or different
regulation (and expression) of some cyclin subclasses remains to be analyzed
12
15. II. Orthology & paralogy
A major goal of sequence analysis is evolutionary
reconstruction. It is critical to distinguish between two
principal types of homologous relationships, which differ
in their evolutionary history and functional implications.
Orthologs, defined as homologous genes evolved
through speciation (~evolutionary counterparts derived
from a single ancestral gene in the last common ancestor
of the given two species)
Paralogs, which are homologous genes evolved through
duplication within the same (perhaps ancestral) genome.
These definitions were first introduced by Fitch (1970)
15
16. Orthology & paralogy inference
Organism phylogeny Gene phylogenies
(species tree) gene duplication
a1
A
b1
B c1
a1
b) a2
a2
C b2
b1
c2
a) b2
speciation Outparalogs
16 Inparalogs c1
17. In- and outparalogy
17 Sonnhammer & Koonin: Orthology, paralogy and proposed classification for paralog subtypes
18. Tree reconciliation
The automatic detection of speciation and duplication
events using a species tree and gene family tree
18
21. Interpreting the output of an all-
against-all similarity search
Metrics for sequence similarity:
• E-value, Bit score or percent identity
21 • alignment coverage
22. Clustering of similar sequences
Proteins = vertices ~ nodes
Sequence similarity relationship = edges
22