SlideShare a Scribd company logo
1 of 33
Download to read offline
GENE PREDICTION
VIJAY
JRF
GIT,Bengaluru
•Automated sequencing of genomes require automated gene
assignment
•Includes detection of open reading frames (ORFs)
•Identification of the introns and exons
•Gene prediction a very difficult problem in pattern
recognition
•Coding regions generally do not have conserved sequences
•Much progress made with prokaryotic gene prediction
•Eukaryotic genes more difficult to predict correctly
Ab initio methods
•Predict genes on given sequence alone
•Uses gene signals
•Start/stop codon
•Intronsplice sites
•Transcription factor binding sitesribosomal binding sites
•Poly-A sites
•Codon demand multiple of three nucleotides
•Gene content
•Nucleotide composition – use HMMs
Homologybased methods
•Matches to known genes
•Matches to cDNA
Consensus based
•Uses output from more than one program
Prokaryotic gene structure
•ATG (GTG or TTG less frequent) is start codon
•Ribosome binding site (Shine-Dalgarno sequence)
complementary to 16S rRNA of ribosome
•AGGAGGT
•TAG stop codon
•Transcription termination site (-independent
termination)
•Stem-loop secondary structure followed by string
of Ts
•Translate sequence into 6 reading frames
•Stop codon randomly every 20 codons
•Look for frame longer that 30 codons (normally 50-60
codons)
•Presence of start codon and Shine-Dalgarno sequence
•Translate putative ORF into protein, and search databases
•Non-randomness of 3rd base of codon, more frequently G/C
•Plotting wobble base GC% can identify ORFs
•3rd base also repeats, thus repetition gives clue on gene
location
Markov chains and HMMs
• Order depends on k previous positions
• The higher the order of a Markov model to describe a gene, the
more non-randomness the model includes
• Genes described in codons or hexamers
• HMMs trained with known genes
• Codon pairs are often found, thus 6 nucleotide patterns often
occur in ORFs – 5th-order Markov chain
• 5th-order HMM gives very accurate gene predictions
• Problem may be that in short genes there are not enough
hexamers
• InterpolatedMarkov Model (IMM) samples different length
Markov chains. Weighing scheme places less weight on rare k-
mers
• Final probability is the probability of all weighted k-mers
• Typical and atypical genes
GeneMark (http://exon.gatech.edu/genemark/)
Trained on complete microbial genomes
Most closely related organism used for predictions
Glimmer (Gene Locator and Interpolation Markov
Model)
(http://www.cbcb.umd.edu/software/glimmer/)
FGENESB(http://linux1.softberry.com/)
5th-order HMM
Trained with bacterial sequences
Linear discriminant analysis (LDA)
RBSFinder (ftp://ftp.tigr.org )
Takes output from Glimmer and searches for S-D
sequencesclose to start sites
Performance evaluation
•Sensitivity Sn = TP/(TP+FN)
•Specificity Sp = TP/(TP+FP)
•CC=TP.TN-FP.FN/([TP+FP][TN+FN][TP+TN])1/2
Gene prediction in Eukaryotes
Low gene density (3% in humans)
Space between genes very large with multiply repeated
sequencesand transposable elements
Eukaryotic genes are split (introns/exons)
Transcript is capped (methylation of 5’ residue)
Splicing in spliceosome
Alternative splicing
Poly adenylation (~250 As added) downstream of
CAATAAA(T/C)consensusbox
Major issue identification of splicing sites
GT-AG rule (GTAAGT/Y12NCAG 5’/3’ intron splice
junctions)
Codon use frequencies
ATG start codon
Kozak sequence (CCGCCATGG)
•Ab initio programs
•Gene signals
•Start/stop
•Putative splice signals
•Consensus sequences
•Poly-A sites
•Gene content
•Coding statistics
•Non-random nucleotide distributions
•Hexamer frequencies
•HMMs
Discriminant analysis
•Plot 2D graph of coding length versus 3’
splice site
•Place diagonal line (LDA) that separates
true coding from non-coding sequences
based on learnt knowledge
•QDA fits quadratic curve
•FGENES uses LDA
•MZEF(Michael Zang’s Exon Finder uses
QDA)
Neural Nets
•A series of input, hidden and output layers
•Gene structure information is fed to input layer, and is
separated into several classes
•Hexamer frequencies
•splice sites
•GC composition
•Weights are calculated in the hidden layer to generate
output of exon
•When input layer is challenged with new sequence,
the rules that was generated to output exon is applied
to new sequence
HHMs
•GenScan (http://genes.mit.edu/GENSCAN.html)
5th-order HMM
•Combined hexamer frequencies with coding signals
•Initiation codons
•TATAboxes
•CAP site
•Poly-A
•Trained on Arabidopsis and maize data
•Extensively used in human genome project
•HMMgene (http://www.cbs.dtu.dk/services/HMMgene)
•Identified sub regions of exons from cDNA or proteins
•Locks such regions and used HMM extension into neighboring regions
Homology based programs
•Uses translations to search for EST, cDNA and
proteins in databases
•GenomeScan
(http://genes.mit.edu/genomescan.html)
•Combined GENSCAN with BLASTX
•EST2Genome
(http://bioweb.pasteur.fr/seqanal/interfaces/est2geno
me.html)
•Compares EST and cDNA to user sequence
•TwinScan
•Similar to GenomeScan
Consensus-based programs
•Uses several different programs to generate lists of
predicted exons
•Only common predicted exons are retained
•GeneComber
(http://www.bioinformatics.ubc.ca/gencombver/inde
x.php)
•Combined HMMgene with GenScan
•DIGIT (http://digit.gsc.riken.go.jp/cgi-bin/index.cgi)
•Combines FGENESH, GENSCAN and HMMgene
Nucleotide Level Exon Level
Sn Sp CC Sn Sp (Sn+Sp)
/2
ME WE
FGENES 0.86 0.88 0.83 0.67 0.67 0.67 0.12 0.09
GeneMark 0.87 0.89 0.83 0.53 0.54 0.54 0.13 0.11
Genie 0.91 0.90 0.88 0.71 0.70 0.71 0.19 0.11
GenScAN 0.95 0.90 0.91 0.71 0.70 0.70 0.08 0.09
HMMgene 0.93 0.93 0.91 0.76 0.77 0.76 0.12 0.07
Morgan 0.75 0.74 0.74 0,.46 0.41 0.;43 0.20 0.28
MZEF 0.70 0.73 0.66 0.58 0.59 0.59 0.32 0.23
Accuracy
Chapter 9
Promoter and regulatory element prediction
•Promoters are short regions upstream of transcription start site
•Contains short (6-8nt) transcription factor recognition site
•Extremely laborious to define by experiment
•Sequence is not translated into protein, so no homology
matchingis possible
•Each promoter is unique with a unique combination of factor
binding sites – thus no consensuspromoter
polymerase
ORF
-35 box
-10 box
TF site
TF
•70 factor bindsto -35 and -10 boxes and recruit full polymerase enzyme
•-35 box consensus sequence: TTGACA
•-10 box consensus sequence: TATAAT
•Transcriptionfactorsthat activateor repress transcription
•Bindto regulatory elements
•DNA loopsto allow long-distanceinteractions
Prokaryoticgene
PolymeraseI, II and III
Basaltranscription factors(TFIID, TFIIA, TFIIB, etc.)
TATA box (TATA(A/T)A(A/T)
“Housekeeping”genes often do not containTATA boxes
Initiatiorsite (Inr) (C/T) (C/T) CA(C/T) (C/T) coincides with transcription
start
ManyTF sites
Activation/repression
TF site
TF site TATA Inr
Pol II
Eukaryoticgene structure
Ab initio methods
•Promoter signals
•TATA boxes
•Hexamer frequencies
•Consensussequence matching
•PSSM
•Numerous FPs
•HMMs incorporateneighboring information
Promoter prediction in prokaryotes
•Find operon
•Upstreamoffirst gene is promoter
•Wang rules (distance between genes, no -
independent termination, number of genomes that
display linkage)
•BPROM (http://www.softberry.com)
•Based of arbitarry setting of operon egen distances
•200bop uopstream of first gene
•‘many FPs
•FindTerm (http://sun1.softberry.com)
•Searches for -independent termination signals
Prediction in eukaryotes
• Searching for consensussequences in databases (TransFac)
• Increase specuificity by searching for CpG islands
• High density fo trasncription factor binding sitres
• CpGProD (http://pbil.univ-lyon1.fr/software/cpgprod.html)
• CG% inmoving window
• Eponine (http://servlet.sanger.ac.uk:8080/eponine/ )
• Matches TATAbox, CCAAT bvox, CpG island to PSSM
• Cluster-Buster(http://zlab.bu.edu/cluster-buster/cbust.html)
• Detectshigh concentrationsof TF sites
• FirstEF (http://rulai.cshl.org/tools/FirstEF/)
• QDAof fisrt exonboundary
• McPromoter (http://genes.mit.edu/McPromoter.html)
• Neural net of DNA bendability, TAT box,initator box
• Trained for Drosophila and human sequences
Phylogenetic footprinting technique
•Identifyconserved regulatory sites
•Human-chimpanzeetoo close
•Humanfish too distant
•Human0-mouse appropriate
•ConSite(http://mordor.cgb.ki.se/cgi-bin/CONSITE/consite)
•Aligntwo sequences by global;alignmentalgorithm
•Identifyconserved regions and compare to TRANSFAC database
•High scoring hits returned as positives
•rVISTA (http://rvista.dcode.org)
•IdentifiedTRANSFACsites in two orthologoussequences
•Alignssequences with localalignment algorithm
•Highest identity regions returned as hits
•Bayesaligner
(http://www.bioinfo.rpi.edu/applications/bayesian/bayes/bayes.align12.pl)
•Alignstwo sequences with Bayesianalgorithm
•Even weakly conserved regions identified
Expression-profilingbased method
Microarrayanalysesallowsidentificationof co-regulatedgenes
Assume that promoters containsimilarregulatory sites
Findsuch sites by EM and Gibbs sampling using iterationof PSSM
Co-expressed genes may be regulatedat higher levels
MEME(http://meme.sdsc.edu/meme/website/meme-intro.html)
AlignACE(http://atlas.med.harvard.edu/cgi-bin/alignace.pl)
Gibbssampling algorithm
Web humour…
Slide Title
• Make Effective Presentations
• Using Awesome Backgrounds
• Engage your Audience
• Capture Audience Attention
Slide Title
• Make Effective Presentations
• Using Awesome Backgrounds
• Engage your Audience
• Capture Audience Attention
Slide Title
Product A
• Feature 1
• Feature 2
• Feature 3
Product B
• Feature 1
• Feature 2
• Feature 3

More Related Content

What's hot

Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignmentRamya S
 
Phylogenetic analysis
Phylogenetic analysis Phylogenetic analysis
Phylogenetic analysis Nitin Naik
 
UniProt
UniProtUniProt
UniProtAmnaA7
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomicshemantbreeder
 
Secondary protein structure prediction
Secondary protein structure predictionSecondary protein structure prediction
Secondary protein structure predictionSiva Dharshini R
 
Open Reading Frames
Open Reading FramesOpen Reading Frames
Open Reading FramesOsama Zahid
 
shotgun sequncing
 shotgun sequncing shotgun sequncing
shotgun sequncingSAIFALI444
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-naveed ul mushtaq
 
SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)talhakhat
 
Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Vijay Hemmadi
 

What's hot (20)

Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Genome annotation 2013
Genome annotation 2013Genome annotation 2013
Genome annotation 2013
 
Phylogenetic analysis
Phylogenetic analysis Phylogenetic analysis
Phylogenetic analysis
 
Genome annotation
Genome annotationGenome annotation
Genome annotation
 
UniProt
UniProtUniProt
UniProt
 
Structural genomics
Structural genomicsStructural genomics
Structural genomics
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Secondary protein structure prediction
Secondary protein structure predictionSecondary protein structure prediction
Secondary protein structure prediction
 
Open Reading Frames
Open Reading FramesOpen Reading Frames
Open Reading Frames
 
shotgun sequncing
 shotgun sequncing shotgun sequncing
shotgun sequncing
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-
 
SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)
 
Proteins databases
Proteins databasesProteins databases
Proteins databases
 
Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins
 
YEAST TWO HYBRID SYSTEM
 YEAST TWO HYBRID SYSTEM YEAST TWO HYBRID SYSTEM
YEAST TWO HYBRID SYSTEM
 
Genome mapping
Genome mapping Genome mapping
Genome mapping
 
Clustal
ClustalClustal
Clustal
 
Gen bank databases
Gen bank databasesGen bank databases
Gen bank databases
 
Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment   Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment
 

Similar to Gene prediction methods vijay

Assembly and gene_prediction
Assembly and gene_predictionAssembly and gene_prediction
Assembly and gene_predictionBas van Breukelen
 
High Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can KnowHigh Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can KnowBrian Krueger
 
gene prediction programs
gene prediction programsgene prediction programs
gene prediction programsMugdhaSharma11
 
Molecular marker technology in studies on plant genetic diversity
Molecular marker technology in studies on plant genetic diversityMolecular marker technology in studies on plant genetic diversity
Molecular marker technology in studies on plant genetic diversityChanakya P
 
DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification Senthil Natesan
 
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptxBTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptxChijiokeNsofor
 
Map based cloning of genome
Map based cloning of genomeMap based cloning of genome
Map based cloning of genomeKAUSHAL SAHU
 
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
ECCMID 2015 Meet-The-Expert: Bioinformatics ToolsECCMID 2015 Meet-The-Expert: Bioinformatics Tools
ECCMID 2015 Meet-The-Expert: Bioinformatics ToolsNick Loman
 
High Throughput Sequencing Technologies: On the path to the $0* genome
High Throughput Sequencing Technologies: On the path to the $0* genomeHigh Throughput Sequencing Technologies: On the path to the $0* genome
High Throughput Sequencing Technologies: On the path to the $0* genomeBrian Krueger
 
B.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene predictionB.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene predictionRai University
 
B.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene predictionB.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene predictionRai University
 

Similar to Gene prediction methods vijay (20)

genomeannotation-160822182432.pdf
genomeannotation-160822182432.pdfgenomeannotation-160822182432.pdf
genomeannotation-160822182432.pdf
 
gene prediction methods.pptx
gene prediction methods.pptxgene prediction methods.pptx
gene prediction methods.pptx
 
artificial neural network-gene prediction
artificial neural network-gene predictionartificial neural network-gene prediction
artificial neural network-gene prediction
 
Assembly and gene_prediction
Assembly and gene_predictionAssembly and gene_prediction
Assembly and gene_prediction
 
Genome analysis2
Genome analysis2Genome analysis2
Genome analysis2
 
High Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can KnowHigh Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can Know
 
gene prediction programs
gene prediction programsgene prediction programs
gene prediction programs
 
Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014
 
ChipSeq Data Analysis
ChipSeq Data AnalysisChipSeq Data Analysis
ChipSeq Data Analysis
 
Molecular marker technology in studies on plant genetic diversity
Molecular marker technology in studies on plant genetic diversityMolecular marker technology in studies on plant genetic diversity
Molecular marker technology in studies on plant genetic diversity
 
DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification
 
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptxBTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
 
Gene prediction strategies
Gene prediction strategies Gene prediction strategies
Gene prediction strategies
 
Rflp
RflpRflp
Rflp
 
Map based cloning of genome
Map based cloning of genomeMap based cloning of genome
Map based cloning of genome
 
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
ECCMID 2015 Meet-The-Expert: Bioinformatics ToolsECCMID 2015 Meet-The-Expert: Bioinformatics Tools
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
 
High Throughput Sequencing Technologies: On the path to the $0* genome
High Throughput Sequencing Technologies: On the path to the $0* genomeHigh Throughput Sequencing Technologies: On the path to the $0* genome
High Throughput Sequencing Technologies: On the path to the $0* genome
 
Gene identification using bioinformatic tools.pptx
Gene identification using bioinformatic tools.pptxGene identification using bioinformatic tools.pptx
Gene identification using bioinformatic tools.pptx
 
B.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene predictionB.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene prediction
 
B.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene predictionB.sc biochem i bobi u 4 gene prediction
B.sc biochem i bobi u 4 gene prediction
 

More from Vijay Hemmadi

Hemoglobin estimation and Blood typing experiment and
Hemoglobin estimation and Blood typing experiment and Hemoglobin estimation and Blood typing experiment and
Hemoglobin estimation and Blood typing experiment and Vijay Hemmadi
 
Determination of protein concentration by Bradford method.pptx
Determination of protein concentration by Bradford method.pptxDetermination of protein concentration by Bradford method.pptx
Determination of protein concentration by Bradford method.pptxVijay Hemmadi
 
Endangered species of india
Endangered species of india Endangered species of india
Endangered species of india Vijay Hemmadi
 
Liposomes-Classification, methods of preparation and application
Liposomes-Classification, methods of preparation and application Liposomes-Classification, methods of preparation and application
Liposomes-Classification, methods of preparation and application Vijay Hemmadi
 
Natural disasters and its managment
Natural disasters and its managmentNatural disasters and its managment
Natural disasters and its managmentVijay Hemmadi
 
Introduction to probability distributions-Statistics and probability analysis
Introduction to probability distributions-Statistics and probability analysis Introduction to probability distributions-Statistics and probability analysis
Introduction to probability distributions-Statistics and probability analysis Vijay Hemmadi
 
Morphometric measurements,condition indexing and dissection of fish
Morphometric measurements,condition indexing and dissection of fishMorphometric measurements,condition indexing and dissection of fish
Morphometric measurements,condition indexing and dissection of fishVijay Hemmadi
 
Fish anatomy and physilogy
Fish anatomy and physilogy Fish anatomy and physilogy
Fish anatomy and physilogy Vijay Hemmadi
 
How to identify the fake journals
How to identify the fake  journalsHow to identify the fake  journals
How to identify the fake journalsVijay Hemmadi
 
Environmental legislation
Environmental legislationEnvironmental legislation
Environmental legislationVijay Hemmadi
 
Atomic absorption spectrometer
Atomic absorption spectrometerAtomic absorption spectrometer
Atomic absorption spectrometerVijay Hemmadi
 
metallothionein -Biomarker
metallothionein -Biomarkermetallothionein -Biomarker
metallothionein -BiomarkerVijay Hemmadi
 

More from Vijay Hemmadi (20)

Hemoglobin estimation and Blood typing experiment and
Hemoglobin estimation and Blood typing experiment and Hemoglobin estimation and Blood typing experiment and
Hemoglobin estimation and Blood typing experiment and
 
Determination of protein concentration by Bradford method.pptx
Determination of protein concentration by Bradford method.pptxDetermination of protein concentration by Bradford method.pptx
Determination of protein concentration by Bradford method.pptx
 
Endangered species of india
Endangered species of india Endangered species of india
Endangered species of india
 
Mining
MiningMining
Mining
 
Enzyme assays
Enzyme assaysEnzyme assays
Enzyme assays
 
Liposomes-Classification, methods of preparation and application
Liposomes-Classification, methods of preparation and application Liposomes-Classification, methods of preparation and application
Liposomes-Classification, methods of preparation and application
 
Natural disasters and its managment
Natural disasters and its managmentNatural disasters and its managment
Natural disasters and its managment
 
Introduction to probability distributions-Statistics and probability analysis
Introduction to probability distributions-Statistics and probability analysis Introduction to probability distributions-Statistics and probability analysis
Introduction to probability distributions-Statistics and probability analysis
 
Morphometric measurements,condition indexing and dissection of fish
Morphometric measurements,condition indexing and dissection of fishMorphometric measurements,condition indexing and dissection of fish
Morphometric measurements,condition indexing and dissection of fish
 
Fish anatomy and physilogy
Fish anatomy and physilogy Fish anatomy and physilogy
Fish anatomy and physilogy
 
How to identify the fake journals
How to identify the fake  journalsHow to identify the fake  journals
How to identify the fake journals
 
Global warming 1
Global warming  1Global warming  1
Global warming 1
 
Nucleic acid
Nucleic acidNucleic acid
Nucleic acid
 
Environmental legislation
Environmental legislationEnvironmental legislation
Environmental legislation
 
Air pollution act
Air pollution actAir pollution act
Air pollution act
 
Air pollution act
Air pollution actAir pollution act
Air pollution act
 
Comet assay
Comet assayComet assay
Comet assay
 
Atomic absorption spectrometer
Atomic absorption spectrometerAtomic absorption spectrometer
Atomic absorption spectrometer
 
metallothionein -Biomarker
metallothionein -Biomarkermetallothionein -Biomarker
metallothionein -Biomarker
 
Forest resources
Forest resourcesForest resources
Forest resources
 

Recently uploaded

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 

Recently uploaded (20)

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 

Gene prediction methods vijay

  • 2. •Automated sequencing of genomes require automated gene assignment •Includes detection of open reading frames (ORFs) •Identification of the introns and exons •Gene prediction a very difficult problem in pattern recognition •Coding regions generally do not have conserved sequences •Much progress made with prokaryotic gene prediction •Eukaryotic genes more difficult to predict correctly
  • 3. Ab initio methods •Predict genes on given sequence alone •Uses gene signals •Start/stop codon •Intronsplice sites •Transcription factor binding sitesribosomal binding sites •Poly-A sites •Codon demand multiple of three nucleotides •Gene content •Nucleotide composition – use HMMs Homologybased methods •Matches to known genes •Matches to cDNA Consensus based •Uses output from more than one program
  • 4. Prokaryotic gene structure •ATG (GTG or TTG less frequent) is start codon •Ribosome binding site (Shine-Dalgarno sequence) complementary to 16S rRNA of ribosome •AGGAGGT •TAG stop codon •Transcription termination site (-independent termination) •Stem-loop secondary structure followed by string of Ts
  • 5. •Translate sequence into 6 reading frames •Stop codon randomly every 20 codons •Look for frame longer that 30 codons (normally 50-60 codons) •Presence of start codon and Shine-Dalgarno sequence •Translate putative ORF into protein, and search databases •Non-randomness of 3rd base of codon, more frequently G/C •Plotting wobble base GC% can identify ORFs •3rd base also repeats, thus repetition gives clue on gene location
  • 6. Markov chains and HMMs • Order depends on k previous positions • The higher the order of a Markov model to describe a gene, the more non-randomness the model includes • Genes described in codons or hexamers • HMMs trained with known genes • Codon pairs are often found, thus 6 nucleotide patterns often occur in ORFs – 5th-order Markov chain • 5th-order HMM gives very accurate gene predictions • Problem may be that in short genes there are not enough hexamers • InterpolatedMarkov Model (IMM) samples different length Markov chains. Weighing scheme places less weight on rare k- mers • Final probability is the probability of all weighted k-mers • Typical and atypical genes
  • 7. GeneMark (http://exon.gatech.edu/genemark/) Trained on complete microbial genomes Most closely related organism used for predictions Glimmer (Gene Locator and Interpolation Markov Model) (http://www.cbcb.umd.edu/software/glimmer/) FGENESB(http://linux1.softberry.com/) 5th-order HMM Trained with bacterial sequences Linear discriminant analysis (LDA) RBSFinder (ftp://ftp.tigr.org ) Takes output from Glimmer and searches for S-D sequencesclose to start sites
  • 8.
  • 9. Performance evaluation •Sensitivity Sn = TP/(TP+FN) •Specificity Sp = TP/(TP+FP) •CC=TP.TN-FP.FN/([TP+FP][TN+FN][TP+TN])1/2
  • 10. Gene prediction in Eukaryotes Low gene density (3% in humans) Space between genes very large with multiply repeated sequencesand transposable elements Eukaryotic genes are split (introns/exons) Transcript is capped (methylation of 5’ residue) Splicing in spliceosome Alternative splicing Poly adenylation (~250 As added) downstream of CAATAAA(T/C)consensusbox Major issue identification of splicing sites GT-AG rule (GTAAGT/Y12NCAG 5’/3’ intron splice junctions) Codon use frequencies ATG start codon Kozak sequence (CCGCCATGG)
  • 11. •Ab initio programs •Gene signals •Start/stop •Putative splice signals •Consensus sequences •Poly-A sites •Gene content •Coding statistics •Non-random nucleotide distributions •Hexamer frequencies •HMMs
  • 12. Discriminant analysis •Plot 2D graph of coding length versus 3’ splice site •Place diagonal line (LDA) that separates true coding from non-coding sequences based on learnt knowledge •QDA fits quadratic curve •FGENES uses LDA •MZEF(Michael Zang’s Exon Finder uses QDA)
  • 13. Neural Nets •A series of input, hidden and output layers •Gene structure information is fed to input layer, and is separated into several classes •Hexamer frequencies •splice sites •GC composition •Weights are calculated in the hidden layer to generate output of exon •When input layer is challenged with new sequence, the rules that was generated to output exon is applied to new sequence
  • 14. HHMs •GenScan (http://genes.mit.edu/GENSCAN.html) 5th-order HMM •Combined hexamer frequencies with coding signals •Initiation codons •TATAboxes •CAP site •Poly-A •Trained on Arabidopsis and maize data •Extensively used in human genome project •HMMgene (http://www.cbs.dtu.dk/services/HMMgene) •Identified sub regions of exons from cDNA or proteins •Locks such regions and used HMM extension into neighboring regions
  • 15.
  • 16.
  • 17. Homology based programs •Uses translations to search for EST, cDNA and proteins in databases •GenomeScan (http://genes.mit.edu/genomescan.html) •Combined GENSCAN with BLASTX •EST2Genome (http://bioweb.pasteur.fr/seqanal/interfaces/est2geno me.html) •Compares EST and cDNA to user sequence •TwinScan •Similar to GenomeScan
  • 18.
  • 19. Consensus-based programs •Uses several different programs to generate lists of predicted exons •Only common predicted exons are retained •GeneComber (http://www.bioinformatics.ubc.ca/gencombver/inde x.php) •Combined HMMgene with GenScan •DIGIT (http://digit.gsc.riken.go.jp/cgi-bin/index.cgi) •Combines FGENESH, GENSCAN and HMMgene
  • 20. Nucleotide Level Exon Level Sn Sp CC Sn Sp (Sn+Sp) /2 ME WE FGENES 0.86 0.88 0.83 0.67 0.67 0.67 0.12 0.09 GeneMark 0.87 0.89 0.83 0.53 0.54 0.54 0.13 0.11 Genie 0.91 0.90 0.88 0.71 0.70 0.71 0.19 0.11 GenScAN 0.95 0.90 0.91 0.71 0.70 0.70 0.08 0.09 HMMgene 0.93 0.93 0.91 0.76 0.77 0.76 0.12 0.07 Morgan 0.75 0.74 0.74 0,.46 0.41 0.;43 0.20 0.28 MZEF 0.70 0.73 0.66 0.58 0.59 0.59 0.32 0.23 Accuracy
  • 21. Chapter 9 Promoter and regulatory element prediction
  • 22. •Promoters are short regions upstream of transcription start site •Contains short (6-8nt) transcription factor recognition site •Extremely laborious to define by experiment •Sequence is not translated into protein, so no homology matchingis possible •Each promoter is unique with a unique combination of factor binding sites – thus no consensuspromoter
  • 23. polymerase ORF -35 box -10 box TF site TF •70 factor bindsto -35 and -10 boxes and recruit full polymerase enzyme •-35 box consensus sequence: TTGACA •-10 box consensus sequence: TATAAT •Transcriptionfactorsthat activateor repress transcription •Bindto regulatory elements •DNA loopsto allow long-distanceinteractions Prokaryoticgene
  • 24. PolymeraseI, II and III Basaltranscription factors(TFIID, TFIIA, TFIIB, etc.) TATA box (TATA(A/T)A(A/T) “Housekeeping”genes often do not containTATA boxes Initiatiorsite (Inr) (C/T) (C/T) CA(C/T) (C/T) coincides with transcription start ManyTF sites Activation/repression TF site TF site TATA Inr Pol II Eukaryoticgene structure
  • 25. Ab initio methods •Promoter signals •TATA boxes •Hexamer frequencies •Consensussequence matching •PSSM •Numerous FPs •HMMs incorporateneighboring information
  • 26. Promoter prediction in prokaryotes •Find operon •Upstreamoffirst gene is promoter •Wang rules (distance between genes, no - independent termination, number of genomes that display linkage) •BPROM (http://www.softberry.com) •Based of arbitarry setting of operon egen distances •200bop uopstream of first gene •‘many FPs •FindTerm (http://sun1.softberry.com) •Searches for -independent termination signals
  • 27. Prediction in eukaryotes • Searching for consensussequences in databases (TransFac) • Increase specuificity by searching for CpG islands • High density fo trasncription factor binding sitres • CpGProD (http://pbil.univ-lyon1.fr/software/cpgprod.html) • CG% inmoving window • Eponine (http://servlet.sanger.ac.uk:8080/eponine/ ) • Matches TATAbox, CCAAT bvox, CpG island to PSSM • Cluster-Buster(http://zlab.bu.edu/cluster-buster/cbust.html) • Detectshigh concentrationsof TF sites • FirstEF (http://rulai.cshl.org/tools/FirstEF/) • QDAof fisrt exonboundary • McPromoter (http://genes.mit.edu/McPromoter.html) • Neural net of DNA bendability, TAT box,initator box • Trained for Drosophila and human sequences
  • 28. Phylogenetic footprinting technique •Identifyconserved regulatory sites •Human-chimpanzeetoo close •Humanfish too distant •Human0-mouse appropriate •ConSite(http://mordor.cgb.ki.se/cgi-bin/CONSITE/consite) •Aligntwo sequences by global;alignmentalgorithm •Identifyconserved regions and compare to TRANSFAC database •High scoring hits returned as positives •rVISTA (http://rvista.dcode.org) •IdentifiedTRANSFACsites in two orthologoussequences •Alignssequences with localalignment algorithm •Highest identity regions returned as hits •Bayesaligner (http://www.bioinfo.rpi.edu/applications/bayesian/bayes/bayes.align12.pl) •Alignstwo sequences with Bayesianalgorithm •Even weakly conserved regions identified
  • 29. Expression-profilingbased method Microarrayanalysesallowsidentificationof co-regulatedgenes Assume that promoters containsimilarregulatory sites Findsuch sites by EM and Gibbs sampling using iterationof PSSM Co-expressed genes may be regulatedat higher levels MEME(http://meme.sdsc.edu/meme/website/meme-intro.html) AlignACE(http://atlas.med.harvard.edu/cgi-bin/alignace.pl) Gibbssampling algorithm
  • 31. Slide Title • Make Effective Presentations • Using Awesome Backgrounds • Engage your Audience • Capture Audience Attention
  • 32. Slide Title • Make Effective Presentations • Using Awesome Backgrounds • Engage your Audience • Capture Audience Attention
  • 33. Slide Title Product A • Feature 1 • Feature 2 • Feature 3 Product B • Feature 1 • Feature 2 • Feature 3