2. Bioinformatics
Applies Computer Technology in Molecular biology
Develops algorithms and methods to manage and analyze biological
data
Effective methods are needed to compare and align biological
sequences and discover sequential patterns
Type of data
DNA: helix-shaped molecule whose constituents are two parallel strands
of nucleotides : Adenine (A), Cytosine (C), Guanine (G), Thymine (T)
Proteins: Composed of 20 amino acids
Produced from DNA using 3 operations or transformations: transcription, splicing and translation
Gene : Sequence of hundreds of individual nucleotides arranged in a
particular order
Genome : Complete set of genes of an organism
2
3. Alignment of Biological Sequences
Alignment – given two or more input biological sequences, identify similar
sequences with long conserved sub-sequences
Pair-wise Sequence alignment
Multiple Sequence Alignment
In nucleotides – two symbols align if they are identical
In amino acids – they align if identical / or one can be derived from the other
Local Alignment Vs Global Alignment
Substitution matrix – represent probability of substitution
Alignment score can be calculated
Need for alignment
Two sequences are homologous if they share the same ancestor
Degree of similarity – helps to determine degree of homology
Helps to construct evolution tree or phylogenetic tree
3
4. Pairwise Alignment
4
A E G H W
A 5 -1 0 -2 -3
E -1 6 -3 0 -3
H -2 0 -2 10 -3
P -1 -1 -2 -2 -4
W -3 -3 -3 -3 15
Gap penalty: -8
Gap extension: -8
HEAGAWGHE-E
P-A--W-HEAE
HEAGAWGHE-E
--P-AW-HEAE
(-8) + (-8) + (-1) + (-8) + 5 + 15 + (-8)
+ 10 + 6 + (-8) + 6 = 1
(-2) + (-8) + (5) + (-8) + (-8) + (15) + (-8)
+ 10 + 6 + (-8) + 6 = 0
20 x 20 triangular matrices – Available
5. Pairwise Alignment
Needleman-Wunsch Algorithm
Smith-Waterman Algorithm
Build up Optimal Sequences
Use Dynamic Programming
O(n2
) Time Complexity
Dot matrix plot
Uses boolean matrices to represent alignments that can be detected visually
O(n2
) Time Complexity
Heuristic Algorithms
BLAST – Basic Local Alignment Search Tool
FASTA – Fast Alignment Tool
First locate high-scoring short stretches and extend them
5
6. BLAST Local Alignment Algorithm
Finds regions of local similarity between bio-sequences
Matches nucleotide / protein sequences to sequence databases and
calculates statistical significance of matches
Breaks the sequences to be compared into sequences of fragments (words)
and seeks matches between words
DNA – word size – 11 bases
Amino Acids – 3 amino acids
Creates a hash table of matching words
Moves from exact matches to neighborhood words
Due to hashing – O(n)
Variants : MEGABLAST (long alignments), Discontinuous MEGABLAST
(gapped alignments- similar not identical), BLASTN (Adjustable word size),
BLASTP…
6
7. Multiple Sequence Alignment Methods
Goal – To find common patterns among all considered sequences
Applications
To build gene / protein families
Identify amino acids which are essential sites for structure and function
More complex than Pair wise alignment
Multi-dimensional alignment / Approximate alignment
Methods
Series of pair-wise alignments
Feng-Doolittle alignment
Computes all possible pair wise alignments by dynamic programming
Constructs a Guide tree – by clustering and progressive alignment
Multiple Sequence alignment
Hidden Markov Models
7
8. HMM for Biological Sequence Analysis
Finding CpG Islands
Methylation process – converts C in CpG to T
CpG occurrence – rare
Methylation is suppressed around start regions of genes
Areas with high concentration – CpG Islands
Given a short sequence is it from a CpG island
Given a long sequence – can all CpG islands be
found
8
9. Markov Chain
Probability of a symbol depends only on previous symbol
Markov Chain model – states and transitions (probability)
Probability of a sequence x = x1x2…xL
9
∏=
−
−−−
=
=
L
i
ii
LLLL
xxx
xxxxxxxx
2
11
112211
)|Pr()Pr(
)Pr()|Pr()...|Pr()/Pr()Pr(
Markov model can be used for classification
- To distinguish CpG islands from others using the
training data construct two models + and -. Classify a
given sequence based on P(x|+) and P(x|-)
- Probability values are estimated from training
sequences
10. Hidden Markov Model
Used to find all CpG islands in a long DNA Sequence
Merge two Markov chains and add transition probabilities between the two
states
Hidden Markov Model: states, transitions, emission probabilities (probability
of producing a symbol at a state)
Hidden because the states visited in generating a sequence are not known
10
11. Hidden Markov Models
Tasks
Evaluation: Given a sequence x determine probability P(x) –
Forward Algorithm
Decoding: Given a sequence, determine most probable path
through the model – Viterbi Algorithm
Learning: Given a model and training sequences, find the model
parameters – Baum Welch Algorithm
11