2. Some important terminologies:
Orthologs are genes in different species that evolved from a
common ancestral gene by speciation. Normally, orthologs retain the
same function in the course of evolution. Identification of orthologs
is critical for reliable prediction of gene function in newly sequenced
genomes.
Paralogs are genes related by duplication within a genome.
Orthologs retain the same function in the course of evolution,
whereas paralogs evolve new functions, even if these are related to
the original one.
Speciation is the origin of a new species capable of making a living
in a new way from the species from which it arose. As part of this
process it has also acquired some barrier to genetic exchange with
the parent species.
4. COGs
• Cluster of orthologous genes.
• Clusters of Orthologous Groups, are groups of three or more ortholog
genes, meaning they are direct evolutionary counter parts and are
considered to be part of an 'ancient conserved domain'. A COG is defined
as three or more proteins from the genomes of distant species that are
more similar to each other than to any other protein within the individual
genome.
• COGs can be used to predict the function of homologous proteins in
poorly studied species and can also be used to track the evolutionary
divergence from a common ancestor, hence providing a powerful tool for
functional annotation of uncharacterized proteins.
• Important in comparative genomics studies
5. Application of COG
• The most straightforward application of the COGs is for the prediction of
functions of individual proteins or protein sets, including those from newly
completed genomes.
NCBI provides a COG database that consists of 4,873 COGs that code for over
136,000 proteins from the genomes of 50 bacteria, 13 archaea and 3 unicellular
eukaryotes. This database uses completely sequenced genomes to classify
proteins using the orthology concept.
The COG database
6. What are some questions that comparative
genomics can address?
• How has the organism evolved?
• What differentiates species?
• Which non-coding regions are important?
• Which genes are required for organisms to survive in a
certain environment?
7. What is Comparative Genomics?
It is the comparison of one genome to another.
Genomics DNA (Gene)
Functional
Genomics
Transcriptomics RNA
Proteomics PROTEIN
Metabolomics METABOLITE
Transcription
Translation
Enzymatic
reaction
8. Difference is in Scale and Direction
One or several genes
compared against all
other known genes.
Use genome to
inform us about the
entire organism.
Use information
from many
genomes to learn
more about the
individual genes.
Entire Genome
compared to
other entire
genomes.
Other “omics” Comparative
9. Comparative genomics
• Discover what lies hidden in genomic
sequence by comparing sequence
information.
• Main areas
– Whole genome alignment
– Gene prediction
– Regulatory element prediction
– Phylogenomics
– Pharmacogenetics
10. Comparative Genomics
Comparative genomics is a powerful tool for identifying the features and dissecting
the functions of genomes. The approach is based on selection for the gene or
regulatory region constraining the evolution of the sequence. Comparison with other
genomes has become an integral part of the analysis of the human genome sequence
and is one of the most effective methods for identifying genes (Batzoglou et al. ,
2000; Roest Crollius et al. , 2000)
Comparative genomics is a field of biological research in which the genomic features of
different organisms are compared. The genomic features may include the DNA sequence,
genes, gene order, regulatory sequences, and other genomic structural landmarks
13. Figure: Distribution and clustering of orthologous genes of Tulsi genome to other related plant
genomes. a. Distribution of gene families among five plant genomes. Ocimum tenuiflorum (Ote
- green), Arabidopsis thaliana (Ath – black rectangle), Oryza sativa (Osa – red), Solanum
lycopersicum (Sly – blue) and Mimulus guttatus (Mgu – black circle). The numbers in the Venn
diagram represent shared and unique gene families across these 5 species obtained by
OrthoMCL.
b. Horizontal stacked bar plot of all the genes in 23 different genomes. This figure shows
ortholog group distribution in all 23 plant species including Tulsi. Each row represents a plant
species - Physcomitrella patens (Ppa), Selaginella moellendorffii (Smo), Oryza sativa (Osa),
Setaria italic (Sit), Zea mays (Zma), Sorghum bicolor (Sbi), Aquilegia caerulea (Aca), Ocimum
tenuiflorum (Ote), Mimulus guttatus (Mgu), Solanum lycopersicum (Sly), Solanum tuberosum
(Stu), Vitis vinifera (Vvi), Eucalyptus grandis (Egr), Citrus sinensis (Csi), Theobroma cacao (Tca),
Carica papaya (Cpa), Brassica rapa (Bra), Arabidopsis thaliana (Ath), Fragaria vesca (Fve), Prunus
persica (Ppe), Glycine max (Gma), Medicago truncatula (Mtr), Populus trichocarpa (Ptr). The bar
graph represents ortholog protein groups for that species subdivided into 22 categories
depending on the degree of sharing with the other 22 plant species e.g., category 2 represents
the number of orthologous groups that have representatives from the species of interest and
from one more species out of the 23 species selected for the study
14. Background: Shortly after multiple
genome sequences of bacteria, archae
and unicellular eukaryotes became
available, an attempt on such a
classification was implemented in
Cluster of Orthologous Groups of
proteins (COGs). Rapid accumulation of
genome sequences creates
opportunities for refining COGs but also
represents a challenge because of error
amplification.
Conclusion: The arCOGs provide a
convenient, flexible framework for
functional annotation of archael
genomes, comparative genomics and
evolutionary reconstructions. Genomic
reconstructions suggest that the last
common ancestor of archaea might
have been (nearly) as advanced as the
modern archael hyperthermophiles.
For more info:
ftp://ftp.ncbi.nih.gov/pub/koonin/arCOGs/.
15.
16. MBGD Database
MBGD is a database for comparative analysis of completely sequenced microbial
genomes, the number of which is now growing rapidly. The aim of MBGD is to
facilitate comparative genomics from various points of view such as ortholog
identification, paralog clustering, motif analysis and gene order comparison.
17. Conclusion
The study of Cluster of Orthologous
Genes play a vital role in the
Comparative genomic studies.
18. References and links
• NCBI COGs database
• Chapter 22 of the NCBI handbook: The Clusters of Orthologous Groups (COGs)
Database: Phylogenetic Classification of Proteins from Complete Genomes. NCBI
Bookshelf ID: NBK21101.
• NCBI News Letter: Protein Families and Genome Evolution. Published Feb 1998.
• http://homepage.usask.ca/~ctl271/857/def_homolog.shtml
• http://biologydirect.biomedcentral.com/articles/10.1186/1745-6150-2-33
• Nucleic Acids Res. 2015 Jan;43(Database issue):D261-9. doi: 10.1093/nar/gku1223.
Epub 2014 Nov 26.
• http://www.ncbi.nlm.nih.gov/pubmed/25428365
• http://biologydirect.biomedcentral.com/articles/10.1186/1745-6150-2-33
Homology: Is the relationship between biological structures or sequences that are derived from a common ancestry.
Two things are homologous if they bear same relationship to one another, such as a certain bone in various forms of the “hand”.