SlideShare ist ein Scribd-Unternehmen logo
1 von 11
Mining Sequence Patterns
in Biological data
1
Bioinformatics
 Applies Computer Technology in Molecular biology
 Develops algorithms and methods to manage and analyze biological
data
 Effective methods are needed to compare and align biological
sequences and discover sequential patterns
 Type of data
 DNA: helix-shaped molecule whose constituents are two parallel strands
of nucleotides : Adenine (A), Cytosine (C), Guanine (G), Thymine (T)
 Proteins: Composed of 20 amino acids
 Produced from DNA using 3 operations or transformations: transcription, splicing and translation
 Gene : Sequence of hundreds of individual nucleotides arranged in a
particular order
 Genome : Complete set of genes of an organism
2
Alignment of Biological Sequences
 Alignment – given two or more input biological sequences, identify similar
sequences with long conserved sub-sequences
 Pair-wise Sequence alignment
 Multiple Sequence Alignment
 In nucleotides – two symbols align if they are identical
 In amino acids – they align if identical / or one can be derived from the other
 Local Alignment Vs Global Alignment
 Substitution matrix – represent probability of substitution
 Alignment score can be calculated
 Need for alignment
 Two sequences are homologous if they share the same ancestor
 Degree of similarity – helps to determine degree of homology
 Helps to construct evolution tree or phylogenetic tree
3
Pairwise Alignment
4
A E G H W
A 5 -1 0 -2 -3
E -1 6 -3 0 -3
H -2 0 -2 10 -3
P -1 -1 -2 -2 -4
W -3 -3 -3 -3 15
Gap penalty: -8
Gap extension: -8
HEAGAWGHE-E
P-A--W-HEAE
HEAGAWGHE-E
--P-AW-HEAE
(-8) + (-8) + (-1) + (-8) + 5 + 15 + (-8)
+ 10 + 6 + (-8) + 6 = 1
(-2) + (-8) + (5) + (-8) + (-8) + (15) + (-8)
+ 10 + 6 + (-8) + 6 = 0
20 x 20 triangular matrices – Available
Pairwise Alignment
 Needleman-Wunsch Algorithm
 Smith-Waterman Algorithm
 Build up Optimal Sequences
 Use Dynamic Programming
 O(n2
) Time Complexity
 Dot matrix plot
 Uses boolean matrices to represent alignments that can be detected visually
 O(n2
) Time Complexity
 Heuristic Algorithms
 BLAST – Basic Local Alignment Search Tool
 FASTA – Fast Alignment Tool
 First locate high-scoring short stretches and extend them
5
BLAST Local Alignment Algorithm
 Finds regions of local similarity between bio-sequences
 Matches nucleotide / protein sequences to sequence databases and
calculates statistical significance of matches
 Breaks the sequences to be compared into sequences of fragments (words)
and seeks matches between words
 DNA – word size – 11 bases
 Amino Acids – 3 amino acids
 Creates a hash table of matching words
 Moves from exact matches to neighborhood words
 Due to hashing – O(n)
 Variants : MEGABLAST (long alignments), Discontinuous MEGABLAST
(gapped alignments- similar not identical), BLASTN (Adjustable word size),
BLASTP…
6
Multiple Sequence Alignment Methods
 Goal – To find common patterns among all considered sequences
 Applications
 To build gene / protein families
 Identify amino acids which are essential sites for structure and function
 More complex than Pair wise alignment
 Multi-dimensional alignment / Approximate alignment
 Methods
 Series of pair-wise alignments
 Feng-Doolittle alignment
 Computes all possible pair wise alignments by dynamic programming
 Constructs a Guide tree – by clustering and progressive alignment
 Multiple Sequence alignment
 Hidden Markov Models
7
HMM for Biological Sequence Analysis
 Finding CpG Islands
 Methylation process – converts C in CpG to T
 CpG occurrence – rare
 Methylation is suppressed around start regions of genes
 Areas with high concentration – CpG Islands
 Given a short sequence is it from a CpG island
 Given a long sequence – can all CpG islands be
found
8
Markov Chain
 Probability of a symbol depends only on previous symbol
 Markov Chain model – states and transitions (probability)
 Probability of a sequence x = x1x2…xL
9
∏=
−
−−−
=
=
L
i
ii
LLLL
xxx
xxxxxxxx
2
11
112211
)|Pr()Pr(
)Pr()|Pr()...|Pr()/Pr()Pr(
Markov model can be used for classification
- To distinguish CpG islands from others using the
training data construct two models + and -. Classify a
given sequence based on P(x|+) and P(x|-)
- Probability values are estimated from training
sequences
Hidden Markov Model
 Used to find all CpG islands in a long DNA Sequence
 Merge two Markov chains and add transition probabilities between the two
states
 Hidden Markov Model: states, transitions, emission probabilities (probability
of producing a symbol at a state)
 Hidden because the states visited in generating a sequence are not known
10
Hidden Markov Models
 Tasks
 Evaluation: Given a sequence x determine probability P(x) –
Forward Algorithm
 Decoding: Given a sequence, determine most probable path
through the model – Viterbi Algorithm
 Learning: Given a model and training sequences, find the model
parameters – Baum Welch Algorithm
11

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Time advance mehcanism
Time advance mehcanismTime advance mehcanism
Time advance mehcanism
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
2. Distributed Systems Hardware & Software concepts
2. Distributed Systems Hardware & Software concepts2. Distributed Systems Hardware & Software concepts
2. Distributed Systems Hardware & Software concepts
 
Spatial data mining
Spatial data miningSpatial data mining
Spatial data mining
 
Data mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityData mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarity
 
Attribute oriented analysis
Attribute oriented analysisAttribute oriented analysis
Attribute oriented analysis
 
Mobile Network Layer
Mobile Network LayerMobile Network Layer
Mobile Network Layer
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
5.5 graph mining
5.5 graph mining5.5 graph mining
5.5 graph mining
 
Graph mining ppt
Graph mining pptGraph mining ppt
Graph mining ppt
 
Matching techniques
Matching techniquesMatching techniques
Matching techniques
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
Advantages and disadvantages of hidden markov model
Advantages and disadvantages of hidden markov modelAdvantages and disadvantages of hidden markov model
Advantages and disadvantages of hidden markov model
 
Distributed database
Distributed databaseDistributed database
Distributed database
 
Hidden markov model ppt
Hidden markov model pptHidden markov model ppt
Hidden markov model ppt
 
Image processing second unit Notes
Image processing second unit NotesImage processing second unit Notes
Image processing second unit Notes
 
Planning
PlanningPlanning
Planning
 
Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)
 

Ähnlich wie 5.4 mining sequence patterns in biological data

20100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture0720100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture07
Computer Science Club
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
Rai University
 
International Journal of Computer Science and Security Volume (2) Issue (5)
International Journal of Computer Science and Security Volume (2) Issue (5)International Journal of Computer Science and Security Volume (2) Issue (5)
International Journal of Computer Science and Security Volume (2) Issue (5)
CSCJournals
 
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Natalio Krasnogor
 

Ähnlich wie 5.4 mining sequence patterns in biological data (20)

Seq alignment
Seq alignment Seq alignment
Seq alignment
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdf
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
Clustering and Visualisation using R programming
Clustering and Visualisation using R programmingClustering and Visualisation using R programming
Clustering and Visualisation using R programming
 
20100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture0720100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture07
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
Sequence Analysis
Sequence AnalysisSequence Analysis
Sequence Analysis
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
 
Introduction to sequence alignment
Introduction to sequence alignmentIntroduction to sequence alignment
Introduction to sequence alignment
 
Sequence alignment belgaum
Sequence alignment belgaumSequence alignment belgaum
Sequence alignment belgaum
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
RDataMining-reference-card
RDataMining-reference-cardRDataMining-reference-card
RDataMining-reference-card
 
Hmm and neural networks
Hmm and neural networksHmm and neural networks
Hmm and neural networks
 
PPT
PPTPPT
PPT
 
Bioinformatics life sciences_v2015
Bioinformatics life sciences_v2015Bioinformatics life sciences_v2015
Bioinformatics life sciences_v2015
 
International Journal of Computer Science and Security Volume (2) Issue (5)
International Journal of Computer Science and Security Volume (2) Issue (5)International Journal of Computer Science and Security Volume (2) Issue (5)
International Journal of Computer Science and Security Volume (2) Issue (5)
 
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
 
A Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
A Comparative Analysis of Feature Selection Methods for Clustering DNA SequencesA Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
A Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
 
Bioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmmBioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmm
 

Mehr von Krish_ver2

Mehr von Krish_ver2 (20)

5.5 back tracking
5.5 back tracking5.5 back tracking
5.5 back tracking
 
5.5 back track
5.5 back track5.5 back track
5.5 back track
 
5.5 back tracking 02
5.5 back tracking 025.5 back tracking 02
5.5 back tracking 02
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
 
5.4 randamized algorithm
5.4 randamized algorithm5.4 randamized algorithm
5.4 randamized algorithm
 
5.3 dynamic programming 03
5.3 dynamic programming 035.3 dynamic programming 03
5.3 dynamic programming 03
 
5.3 dynamic programming
5.3 dynamic programming5.3 dynamic programming
5.3 dynamic programming
 
5.3 dyn algo-i
5.3 dyn algo-i5.3 dyn algo-i
5.3 dyn algo-i
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03
 
5.2 divide and conquer
5.2 divide and conquer5.2 divide and conquer
5.2 divide and conquer
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03
 
5.1 greedyyy 02
5.1 greedyyy 025.1 greedyyy 02
5.1 greedyyy 02
 
5.1 greedy
5.1 greedy5.1 greedy
5.1 greedy
 
5.1 greedy 03
5.1 greedy 035.1 greedy 03
5.1 greedy 03
 
4.4 hashing02
4.4 hashing024.4 hashing02
4.4 hashing02
 
4.4 hashing
4.4 hashing4.4 hashing
4.4 hashing
 
4.4 hashing ext
4.4 hashing  ext4.4 hashing  ext
4.4 hashing ext
 
4.4 external hashing
4.4 external hashing4.4 external hashing
4.4 external hashing
 
4.2 bst
4.2 bst4.2 bst
4.2 bst
 

Kürzlich hochgeladen

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 

Kürzlich hochgeladen (20)

Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 

5.4 mining sequence patterns in biological data

  • 1. Mining Sequence Patterns in Biological data 1
  • 2. Bioinformatics  Applies Computer Technology in Molecular biology  Develops algorithms and methods to manage and analyze biological data  Effective methods are needed to compare and align biological sequences and discover sequential patterns  Type of data  DNA: helix-shaped molecule whose constituents are two parallel strands of nucleotides : Adenine (A), Cytosine (C), Guanine (G), Thymine (T)  Proteins: Composed of 20 amino acids  Produced from DNA using 3 operations or transformations: transcription, splicing and translation  Gene : Sequence of hundreds of individual nucleotides arranged in a particular order  Genome : Complete set of genes of an organism 2
  • 3. Alignment of Biological Sequences  Alignment – given two or more input biological sequences, identify similar sequences with long conserved sub-sequences  Pair-wise Sequence alignment  Multiple Sequence Alignment  In nucleotides – two symbols align if they are identical  In amino acids – they align if identical / or one can be derived from the other  Local Alignment Vs Global Alignment  Substitution matrix – represent probability of substitution  Alignment score can be calculated  Need for alignment  Two sequences are homologous if they share the same ancestor  Degree of similarity – helps to determine degree of homology  Helps to construct evolution tree or phylogenetic tree 3
  • 4. Pairwise Alignment 4 A E G H W A 5 -1 0 -2 -3 E -1 6 -3 0 -3 H -2 0 -2 10 -3 P -1 -1 -2 -2 -4 W -3 -3 -3 -3 15 Gap penalty: -8 Gap extension: -8 HEAGAWGHE-E P-A--W-HEAE HEAGAWGHE-E --P-AW-HEAE (-8) + (-8) + (-1) + (-8) + 5 + 15 + (-8) + 10 + 6 + (-8) + 6 = 1 (-2) + (-8) + (5) + (-8) + (-8) + (15) + (-8) + 10 + 6 + (-8) + 6 = 0 20 x 20 triangular matrices – Available
  • 5. Pairwise Alignment  Needleman-Wunsch Algorithm  Smith-Waterman Algorithm  Build up Optimal Sequences  Use Dynamic Programming  O(n2 ) Time Complexity  Dot matrix plot  Uses boolean matrices to represent alignments that can be detected visually  O(n2 ) Time Complexity  Heuristic Algorithms  BLAST – Basic Local Alignment Search Tool  FASTA – Fast Alignment Tool  First locate high-scoring short stretches and extend them 5
  • 6. BLAST Local Alignment Algorithm  Finds regions of local similarity between bio-sequences  Matches nucleotide / protein sequences to sequence databases and calculates statistical significance of matches  Breaks the sequences to be compared into sequences of fragments (words) and seeks matches between words  DNA – word size – 11 bases  Amino Acids – 3 amino acids  Creates a hash table of matching words  Moves from exact matches to neighborhood words  Due to hashing – O(n)  Variants : MEGABLAST (long alignments), Discontinuous MEGABLAST (gapped alignments- similar not identical), BLASTN (Adjustable word size), BLASTP… 6
  • 7. Multiple Sequence Alignment Methods  Goal – To find common patterns among all considered sequences  Applications  To build gene / protein families  Identify amino acids which are essential sites for structure and function  More complex than Pair wise alignment  Multi-dimensional alignment / Approximate alignment  Methods  Series of pair-wise alignments  Feng-Doolittle alignment  Computes all possible pair wise alignments by dynamic programming  Constructs a Guide tree – by clustering and progressive alignment  Multiple Sequence alignment  Hidden Markov Models 7
  • 8. HMM for Biological Sequence Analysis  Finding CpG Islands  Methylation process – converts C in CpG to T  CpG occurrence – rare  Methylation is suppressed around start regions of genes  Areas with high concentration – CpG Islands  Given a short sequence is it from a CpG island  Given a long sequence – can all CpG islands be found 8
  • 9. Markov Chain  Probability of a symbol depends only on previous symbol  Markov Chain model – states and transitions (probability)  Probability of a sequence x = x1x2…xL 9 ∏= − −−− = = L i ii LLLL xxx xxxxxxxx 2 11 112211 )|Pr()Pr( )Pr()|Pr()...|Pr()/Pr()Pr( Markov model can be used for classification - To distinguish CpG islands from others using the training data construct two models + and -. Classify a given sequence based on P(x|+) and P(x|-) - Probability values are estimated from training sequences
  • 10. Hidden Markov Model  Used to find all CpG islands in a long DNA Sequence  Merge two Markov chains and add transition probabilities between the two states  Hidden Markov Model: states, transitions, emission probabilities (probability of producing a symbol at a state)  Hidden because the states visited in generating a sequence are not known 10
  • 11. Hidden Markov Models  Tasks  Evaluation: Given a sequence x determine probability P(x) – Forward Algorithm  Decoding: Given a sequence, determine most probable path through the model – Viterbi Algorithm  Learning: Given a model and training sequences, find the model parameters – Baum Welch Algorithm 11