2. Comparative genomics involves a comprehensive
systematic comparison of genome sequences.
It begins with powerful computer programs that
identify homologous regions within the genomes
under comparison.
Sets of homologous sequences are then grouped
with their sequences aligned at the base-pair level
in an attempt to define whole genome sequence
alignments.
• Discover what lies hidden in genomic sequences
by comparing sequence information.
3. By comparing the human genome with the genomes of
different organisms, researchers can better understand the
structure and function of human genes and thereby develop
new strategies in the battle against human disease.
In addition, comparative genomics provides a powerful
new tool for studying evolutionary changes among
organisms, helping to identify the genes that are conserved
among species along with the genes that give each
organism its own unique characteristics.
4. SOME QUESTIONS THAT COMPARATIVE GENOMICS CAN
ADDRESS?
How has the organism evolved?
What differentiates species?
Which non-coding regions are important?
Which genes are required for organisms to survive in a
certain environment?
5. PHYLOGENETIC DISTANCE
Information that can be gained by
comparison of genomes largely dependent
upon the phylogenetic distances between
them.
Phylogenetic distance is a measure of the
degree of separation b/w two organisms or
genomes on an evolutionary scale , usually
expressed as the number of accumulated
sequence changes, number of years or
number of generations
More distance, less sequence similarity or
less shared genomic features.
6. Comparisons of Genomes at Different Phylogenetic Distances Are
Appropriate to Address Different Questions
7. Broad insights about types of genes can be gleaned by genomic comparisons at very long
phylogenetic distances, e.g., greater than 1 billion years since their separation.
For example, comparing the genomes of yeast, worms, and flies reveals that these
eukaryotes encode many of the same proteins, and the non-redundant protein sets of flies
and worms are about the same size, being only twice that of yeast.
The more complex developmental biology of flies and worms is reflected in the greater
number of signaling pathways in these two species than in yeast.
Over such very large distances, the order of genes and the sequences regulating their
expression are generally not conserved.
At moderate phylogenetic distances (roughly 70–100 million years of divergence), both
functional and nonfunctional DNA is found within the conserved DNA.
In these cases, the functional sequences will show a signature of purifying or negative
selection, which is that the functional sequences will have changed less than the
nonfunctional or neutral DNA (Jukes and Kimura 1984).
8. COMMONLY USED TOOLS
UCSC Browser: This site contains the reference sequence and working draft
assemblies for a large collection of genomes.
Ensembl: The Ensembl project produces genome databases for vertebrates and
other eukaryotic species, and makes this information freely available online.
MapView: The Map Viewer provides a wide variety of genome mapping and
sequencing data.
VISTA is a comprehensive suite of programs and databases for comparative
analysis of genomic sequences. It was built to visualize the results of
comparative analysis based on DNA alignments. The presentation of
comparative data generated by VISTA can easily suit both small and large scale
of data.
BlueJay Genome Browser: a stand-alone visualization tool for the multi-scale
viewing of annotated genomes and other genomic elements.
9.
10. Chromosome level
Number of genes
Genome size
Content (sequence)
Location (map position)
Gene Order
Gene Cluster (Genes that are part of a known metabolic pathway, are found
to exist as a group)
Translocation: movement of genomic part fromone position to another
HOW ARE GENOMES COMPARED ?
11.
12. Different ways of comparison
Whole genome
Genome alignments
Synteny (gene
order conservation)
Anomalous regions
Gene-centric
Gene families
and unique genes
Gene clustering by
function
Gene sequence variations
Codon usage,
SNPs,
inDels,
pseudogenes
13. GENOME ALIGNMENT
Alignment of DNA sequences is the core process in
comparative genomics.
An alignment is a mapping of the nucleotides in one
sequence onto the nucleotides in the other sequence,
with gaps introduced into one or the other sequence to
increase the number of positions with matching
nucleotides.
Several powerful alignment algorithms have been
developed to align two or more sequences.
Popular alignment programs such as BLAST and FASTA or the multiple alignment program Clustal
W are essentiallyoptimizedfor the alignment
15. Human PKLR gene region compared
to the macaque, dog, mouse, chicken,
and zebrafish genomes
Numbers on the vertical axis represent the
proportion of identical nucleotides in a 100-
bp window for a point on the plot. Numbers
on the horizontal axis indicate the nucleotide
position from the beginning of the 12-
kilobase human genomic sequence. Peaks
shaded in blue correspond to the PKLR coding
regions. Peaks shaded in light blue correspond
to PKLR mRNA untranslated regions. Peaks
shaded in red correspond to conserved non-
coding regions (CNSs), defined as areas where
the average identity is > 75%. Alignment was
generated using the sequence comparison tool
VISTA (http://pipeline.lbl.gov).
GENOME
ALIGNMENT
16. Notice the high degree of sequence similarity between human and macaque
(two primates) in both PKLR exons (blue) as well as introns (red) and
untranslated regions (light blue) of the gene.
In contrast, the chicken and zebrafish alignments with human only show
similarity to sequences in the coding exons; the rest of the sequence has
diverged to a point where it can no longer be reliably aligned with the human
DNA sequence.
Using such computer-based analysis to zero in on the genomic features that
have been preserved in multiple organisms over millions of years, researchers
are able to locate the signals that represent the location of genes, as well as
sequences that may regulate gene expression.
Indeed, much of the functional parts of the human genome have been
discovered or verified by this type of sequence comparison (Lander et al. 2001)
and it is now a standard component of the analysis of every new genome
sequence.
17. Comparison of overall nucleotide statistics
• Overall nucleotide statistics, suchas
– Genome size,
– Overall (G+C) content,
– Regions of different (G+C) content,
– Genome signature such as codon usage biases,
– Amino acid usage biases, and the ratio of observed dinucleotide frequency
These all present a global view of the similarities and differences of the genomes
18. SYNTENY
Refers to regions of two genomes that show considerable similarity in terms of
sequence and
conservation of the order of genes
likelyto be related by common descent.
By mapping of syntenic regions in corresponding genomes, genome rearrangement
events can be identifiedsuchas fission, translocation, inversion, and transposition
20. Once syntenic regions are detected, one can obtain breakpoints(a.k.a. syntenicboundaries)
betweensyntenicregions.
Analysis of various genomicfeatures of the breakpoints such as G+C content, gene density,
and the density of various DNA repeats provides understanding of the evolution of
genomes.
For instance, Mural et al. observedsharpdiscontinuity of features aroundsome syntenic
boundaries but not others.
They hypothesizedthat syntenicboundaries that do not show sharp transitions in these
various features may provide evidence for conservation of the ancestral pattern in the
lineage.
Analysis Of Breakpoints
21. Homologs:
Genes that have the same ancestor; in general retain the same function
Orthologs:
Homologs from different species (arise from speciation)
Paralogs:
Homologs from the same species (arise from duplication)
Duplication before speciation (ancient duplication) : Out-paralogs; may not
have the same function
Duplication after speciation (recent duplication) : In-paralogs; likely to have
the same function
GENE CENTRIC COMPARISON
22. GENE CLUSTERS
In prokaryotes, groups of functionally related genes tend to be
located in close proximity to each other, and often in specific order,
as exemplified by operons.
Although gene order conservation beyond the level of operons is
much less prevalent, conservation of clusters and gene order can be
important indicators of function.
Several approaches have been used to determine functionally
related ‘‘clusters’’ of genes.
Overbeek et al. use the constructs of a ‘‘pair of close bidirectional
best hits’’ (PCBBH) and ‘‘pairs of close homologs’’ (PCHs) to
represent pairs of genes that are closely conserved between two
species and likely to be functionally related.
23. COGs
Cluster of orthologous genes.
groups of threeor more orthologgenes,
meaningtheyare direct evolutionarycounter parts and are considered to be part of an 'ancient conserved domain'.
A COGis definedas threeor more proteins fromthe genomes of distant species that are more similarto each other than
to anyotherproteinwithin the individual genome.
COGs can be used to predict the function of homologousproteins in poorly studied species and can alsobe used to track
the evolutionarydivergence froma common ancestor,
hence providinga powerful toolfor functional annotation of uncharacterizedproteins.
Important in comparative genomics studies
24. Application of COG
The most straightforwardapplication of the COGs is for the predictionof functions of individual
proteins or proteinsets, including those fromnewly completedgenomes.
COG database
NCBI provides a COG databasethat consists of 4,873 COGs that code for over 13600
proteins fromthe genomes of 50 bacteria, 13 archaea and 3 unicellular eukaryotes. This
database uses completely sequenced genomes to classify proteins using the orthologyconcept.
25. MBGD
MBGDis a database for comparative analysis
Of completely sequenced microbial genomes,
the number of which is now growing rapidly.
The aimof MBGDis to facilitatecomparative
genomics fromvarious points of viewsuchas
ortholog identification, paralog clustering,
motif analysis and gene order comparisons
26. COMPARATIVE ANALYSIS OF CODING
REGIONS
typically involves the identification of gene-coding regions,
comparison of gene content, and comparison of protein content.
Recently there have also been a number of algorithms developed that
use comparative genomics to aid function prediction of genes.
The analysis and comparison of the coding regions starts with, and is
very dependent upon, the gene identification algorithm that is used to
infer what portions of the genomic sequence actively code for genes.
27. A combination of multiple gene identification approaches are often used together in large-scale analysis to
improve the overall accuracy
28. COMPARATIVE ANALYSIS OF NON CODING
REGIONS
Noncoding regions of the genome, which may comprise as much as 97%
of the genome length such as in the human genome, gained a lot of
attention in recent years because of its predicted role in regulation of
transcription, DNA replication, and other biological functions .
However, identification of regulatory elements from the noncoding
portion of a genome remains a challenge.
Comparative genomics has been used to greatly aid the identification of
regulatory segments by comparing the genomic noncoding DNA
sequences from diverse species to identify conserved regions .
This approach is based on the presumption that selective pressure
causes regulatory elements to evolve at a slower rate than that of non
regulatory sequences in the noncoding regions.
29. ANALYSIS OF MUTATIONS
Search and display of mutations within multiple alignments, with
discrimination between intergenic, synonymous, non-synonymous
and Indel mutations.
Additional filtering based on SNP quality scores.
Display colors based on mutation type or quality; sorting based on
position, gene, NA change, AA change, quality
Direct clustering based upon mutations or export of mutation list
for further analysis.
30. Nonfunctional protein coding genes
Mutations introduce “sequence problems” (frameshifts, stop in frame, absence of stop)
PSEUDOGENES?
“Normal” bacterial genomes have 1-5% of pseudogenes [Liu et al]
Pseudogenes can give interesting clues to evolutionary pathways
High fractions of pseudogenes suggest a “genome degradation” process
May be cause or effect of niche restriction
Examples
Mycobacterium leprae: 36% (~1,100 genes)
Leifsonia xyli subsp. xyli: 13% (~300 genes)
Pseudogenes do not show up in BLAST searches
31. APPLICATIONS
Gene identification
comparative genomics can aid gene identification. Comparative genomics can recognize real
genes based on their patterns of nucleotide conservation across evolutionary time. With the
availability of genome-wide alignments across the genomes compared, the different ways by
which sequences change in known genes and in intergenic regions can be analyzed. The
alignments of known genes will reveal the conservation of the reading frame of protein
translation.
Regulatory motif discovery
Regulatory motifs are short DNA sequences about 6 to 15bp long that are used to control the
expression of genes, dictating the conditions under which a gene will be turned on or off. Each
motif is typically recognized by a specific DNA-binding protein called a transcription factor (TF).
A transcription factor binds precise sites in the promoter region of target genes in a sequence-
specific way, but this contact can tolerate some degree of sequence variation. Comparative
genomics provides a powerful way to distinguish regulatory motifs from non-functional patterns
based on their conservation.
32. APPLICATIONS
Comparative genomics has wide applications in the field of molecular
medicine and molecular evolution. The most significant application of
comparative genomics in molecular medicine is the identification of drug
targets of many infectious diseases. For example, comparative analyses of
fungal genomes have led to the identification of many putative targets for
novel antifungal. This discovery can aid in target based drug design to cure
fungal diseases in human.
Comparative genomics also helps in the clustering of regulatory sites , which
can help in the recognition of unknown regulatory regions in other genomes.
The metabolic pathway regulation can also be recognized by means of
comparative genomics of a species.
Agriculture is a field that reaps the benefits of comparative genomics.
Identifying the loci of advantageous genes is a key step in breeding crops
that are optimized for greater yield, cost-efficiency, quality, and disease
resistance.