5.4 mining sequence patterns in biological data

Mining Sequence Patterns
in Biological data
1

Bioinformatics
 Applies Computer Technology in Molecular biology
 Develops algorithms and methods to manage and analyze biological
data
 Effective methods are needed to compare and align biological
sequences and discover sequential patterns
 Type of data
 DNA: helix-shaped molecule whose constituents are two parallel strands
of nucleotides : Adenine (A), Cytosine (C), Guanine (G), Thymine (T)
 Proteins: Composed of 20 amino acids
 Produced from DNA using 3 operations or transformations: transcription, splicing and translation
 Gene : Sequence of hundreds of individual nucleotides arranged in a
particular order
 Genome : Complete set of genes of an organism
2

Alignment of Biological Sequences
 Alignment – given two or more input biological sequences, identify similar
sequences with long conserved sub-sequences
 Pair-wise Sequence alignment
 Multiple Sequence Alignment
 In nucleotides – two symbols align if they are identical
 In amino acids – they align if identical / or one can be derived from the other
 Local Alignment Vs Global Alignment
 Substitution matrix – represent probability of substitution
 Alignment score can be calculated
 Need for alignment
 Two sequences are homologous if they share the same ancestor
 Degree of similarity – helps to determine degree of homology
 Helps to construct evolution tree or phylogenetic tree
3

Pairwise Alignment
4
A E G H W
A 5 -1 0 -2 -3
E -1 6 -3 0 -3
H -2 0 -2 10 -3
P -1 -1 -2 -2 -4
W -3 -3 -3 -3 15
Gap penalty: -8
Gap extension: -8
HEAGAWGHE-E
P-A--W-HEAE
HEAGAWGHE-E
--P-AW-HEAE
(-8) + (-8) + (-1) + (-8) + 5 + 15 + (-8)
+ 10 + 6 + (-8) + 6 = 1
(-2) + (-8) + (5) + (-8) + (-8) + (15) + (-8)
+ 10 + 6 + (-8) + 6 = 0
20 x 20 triangular matrices – Available

Pairwise Alignment
 Needleman-Wunsch Algorithm
 Smith-Waterman Algorithm
 Build up Optimal Sequences
 Use Dynamic Programming
 O(n2
) Time Complexity
 Dot matrix plot
 Uses boolean matrices to represent alignments that can be detected visually
 O(n2
) Time Complexity
 Heuristic Algorithms
 BLAST – Basic Local Alignment Search Tool
 FASTA – Fast Alignment Tool
 First locate high-scoring short stretches and extend them
5

BLAST Local Alignment Algorithm
 Finds regions of local similarity between bio-sequences
 Matches nucleotide / protein sequences to sequence databases and
calculates statistical significance of matches
 Breaks the sequences to be compared into sequences of fragments (words)
and seeks matches between words
 DNA – word size – 11 bases
 Amino Acids – 3 amino acids
 Creates a hash table of matching words
 Moves from exact matches to neighborhood words
 Due to hashing – O(n)
 Variants : MEGABLAST (long alignments), Discontinuous MEGABLAST
(gapped alignments- similar not identical), BLASTN (Adjustable word size),
BLASTP…
6

Multiple Sequence Alignment Methods
 Goal – To find common patterns among all considered sequences
 Applications
 To build gene / protein families
 Identify amino acids which are essential sites for structure and function
 More complex than Pair wise alignment
 Multi-dimensional alignment / Approximate alignment
 Methods
 Series of pair-wise alignments
 Feng-Doolittle alignment
 Computes all possible pair wise alignments by dynamic programming
 Constructs a Guide tree – by clustering and progressive alignment
 Multiple Sequence alignment
 Hidden Markov Models
7

HMM for Biological Sequence Analysis
 Finding CpG Islands
 Methylation process – converts C in CpG to T
 CpG occurrence – rare
 Methylation is suppressed around start regions of genes
 Areas with high concentration – CpG Islands
 Given a short sequence is it from a CpG island
 Given a long sequence – can all CpG islands be
found
8

Markov Chain
 Probability of a symbol depends only on previous symbol
 Markov Chain model – states and transitions (probability)
 Probability of a sequence x = x1x2…xL
9
∏=
−
−−−
=
=
L
i
ii
LLLL
xxx
xxxxxxxx
2
11
112211
)|Pr()Pr(
)Pr()|Pr()...|Pr()/Pr()Pr(
Markov model can be used for classification
- To distinguish CpG islands from others using the
training data construct two models + and -. Classify a
given sequence based on P(x|+) and P(x|-)
- Probability values are estimated from training
sequences

Hidden Markov Model
 Used to find all CpG islands in a long DNA Sequence
 Merge two Markov chains and add transition probabilities between the two
states
 Hidden Markov Model: states, transitions, emission probabilities (probability
of producing a symbol at a state)
 Hidden because the states visited in generating a sequence are not known
10

Hidden Markov Models
 Tasks
 Evaluation: Given a sequence x determine probability P(x) –
Forward Algorithm
 Decoding: Given a sequence, determine most probable path
through the model – Viterbi Algorithm
 Learning: Given a model and training sequences, find the model
parameters – Baum Welch Algorithm
11

5.4 mining sequence patterns in biological data

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie 5.4 mining sequence patterns in biological data

Ähnlich wie 5.4 mining sequence patterns in biological data (20)

Mehr von Krish_ver2

Mehr von Krish_ver2 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

5.4 mining sequence patterns in biological data