6. Genome annotation
•Annotation : Obtaining biological information from
unprocessed sequence data
•Structural annotation : Identification of genes and other
other important sequence elements
•Functional annotation : The determination of the functional
roles of genes in the organism
7. Genome annotation
•Raw genomic sequence can be annotated by,
i. Comparison with databases of previously cloned genes and ESTs
ii. Gene prediction based on consensus features such as
Promoters
Splice sites
Polyadenylation sites and
ORFs
8. Gene identification
Gene finding in eukaryotes is difficult
Genome Genes
Bacterial genome 80-85%
Yeast 70%
Fruit fly 25%
Human genome 3-5%
In human genome,
Typical exon = 150bp
Intron = Several kbs
Complete gene = Hundreds of kbs
9. ORF prediction
•Three reading frames are possible from each strand of a
DNA using “six-frame translation process”
- Result is 6 potential protein sequences
- Longest frame uninterrupted by a stop codon is the
correct one
•Finding the ends of ORF is easier than finding beginning
Beginning can be find using,
- Start codon
- kozak sequence (CCGCCAUGG) flanking start codon
- CpG islands
10. Software programs for gene
identification
•Advantage : Speed – annotation can be carried out
concurrently with sequencing itself.
•Disadvantage : Accuracy
•Two strategies used are,
- Homology searching
- ab initio prediction
15. ab initio prediction
Based on type of algorithm,
GRAIL – Based on neural networks
- Predicts exons, genes, promoters, polyAs, CpG islands
EST similarities, repetitive elements,
GeneFinder – Rule-based system
GENSCAN, GENEI, HMMGene, GeneMarkHMM, FGENEH
– Hidden Markov model
17. ab initio prediction
1. Feature dependent methods,
Features of eukaryotic genes recognized are,
-Control signals such as TATA box, cap site, Kozak consensus
and polyadenylation sites
HEXON, MZEF are gene predicting programs that can predict
only a single feature, exon.
2. Few programs depend on differences in base composition
18. ab initio prediction
Accuracy problem – Algorithms are not 100% accurate
Errors include
- Incorrect calling of exon boundaries
- Missed exons
- Failure to detect entire genes
Solution:
Running different programs on single genome
19. Homology searching
•Finding genes in long sequences by looking for matches with
sequences that are known to be transcribed, e.g. cDNA, EST or
a gene
Programs used are BLAST (Basic Local Alignment Search Tool)
based,
BLASTN
BLASTX
BLASTP etc.
20. Homology searching
or ab initio ?
•Algorithms that take similarity data into account are better at
gene prediction – Reese et al(2000), Fortna et al(2001)
Latest gene prediction algorithms combine similarity data with
ab initio methods
examples : Grail/Exp,
GenieEST,
GenomeScan
tRNAScanSE : For tRNA identification