SlideShare a Scribd company logo
1 of 1
Download to read offline
Analysis of Dynamic Programming Algorithms in
Bioinformatics – A Big Data Perspective
Vineetha V and Dr.Achuthsankar S. Nair
Department of Computational Biology & Bioinformatics, University of Kerala,
Thiruvananthapuram, Kerala
Vineetha V
Dept. of Computational Biology and Bioinformatics, University of Kerala
Email: vineevishnu@gmail.com
Phone: 9446175215
Contact
1. http://www.site.uottawa.ca/~lucia/courses/5126-11/lecturenotes/12-13MultipleAlignment.pdf
2. http://thor.info.uaic.ro/~ciortuz/SLIDES/msa.pdf
3. http://www.inf.fu-berlin.de/lehre/WS05/aldabi/downloads/multAlign_part3.pdf
4. http://www.cs.tau.ac.il/~rshamir/algmb/98/scribe/pdf/lec05.pdf
5. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0018093
6. https://en.wikipedia.org/wiki/List_of_sequence_alignment_software
7. http://fpt.akt.tu-berlin.de/publications/fpt-strings-beatcs14.pdf?bcsi_scan_8d5e706812dbbfea=0&bcsi_scan_filename=fpt-strings-beatcs14.pdf
8. http://homepages.ecs.vuw.ac.nz/~downey/publications/sequence_alignment.pdf?bcsi_scan_8d5e706812dbbfea=0&bcsi_scan_filename=sequence_alignment.pdf
References
In bioinformatics, DP algorithms are used mainly in
Sequence alignment problems like:
Pairwise Sequence
Alignment
Multi
Sequence
Alignment
Gene
Prediction
DP Algorithms in Bioinformatics
Approximation with performance guarantee, polytime.
In approximation algorithms, the running time is
polynomial but we do not find an optimal solution.
Instead, we get (for a minimization problem):
(objective value of the solution found) ≤ α (optimal value
of solution)
α ≥ 1 is the approximation ratio; the closest to 1 the
better
Center Star Method with ratio 2
D(X, Y ) be the pairwise optimal alignment distance
between X, Y .
Sc ∈ S is said to be the center string of S if it minimizes:
∑i
k
=1 D(Sc, Si)
- Identify the center string Sc of S.
- Uses the alignments of Sc with each Si to create a
multiple alignment.
computing the optimal distance D(Si , Sj ) for all i, j
=> O(k2n2 ).
computing the center string Sc => O(k2 )
generating the k pairwise alignments with Sc => O(kn2 )
insert spaces into Sc in order to satisfy all multiple
alignments simultaneously => O(k2n)
Total running time – O(k2n2)
Neat optimal solution in polynomial time. Guarantee that
solution is not too far away from optimal solution.
Big Data Challenges
No performance guarantee, but effective in practice,
polytime.
Progressive Alignment - Compute pairwise distance
scores for all pairs of sequences, generate a guide tree,
align sequences based on guide tree. Root node will
represent a complete multiple alignment of the input
sequences.
ClustalW – S={S1, S2…Sk} each of length n,
Optimal alignment of every Si, Sj => O(k2n2).
Build guide tree from distance matrix => O(k3)
Alignments based on guide tree with profile-profile
alignments => O(k2n+kn2)
Total Time => O(k2n2+k3)
Additional heuristics like weighing sequences based on
branch length, adjusting guide tree on the fly etc. are
also be applied.
Iterative Alignment – Start from initial MA (can be via
progressive), and then apply modifications to improve.
MUSCLE – Use alignment
to compute more accurate
pairwise distance.
Refines multiple alignment
using the tree-dependent
restricted partition technique - a process of deleting
edges of guide tree, and re-combine the alignment of the
disjoint trees, if better.
Other iterative methods – PRRP, MAFFT
Iterative algorithms offer improved alignment accuracy at
the expense of computation time.
Heuristics
Alignment quality and
Total run time
State of the Art
• There is trade off between optimal solution and
computation time.
• Fixed parameter based approaches found to be
promising for finding tractable special cases of
these problems.
• Employing more than one program based on
different alignment techniques might yield a
better result
• Parallelization is also a recommended technique
to be considered from practical aspect for
reducing execution time though computation
time would remain the same.
Conclusion & Future Work
Enhanced sequencing technologies produce
sequence data on an unparalleled scale and there is
a need to scale the alignment solutions to be able
to handle huge volume of input data
Computational complexity of the DP algorithm
increases exponentially with dimensionality of the
state, which makes it impractical in large-scale
applications
• MSA – MSA with sum-of-pairs score is NP
Complete.
• Multiple tree alignment is MAX SNP-Hard
• DP on full k dimension box of volume n1 x n2 x n3
x …x nk takes O(n1 . n2 . n3 … nk . 2k)
• Running time is very slow even for k = 3, and
totally infeasible for k ≥ 6
• Pairwise Sequence Alignment – Most
versions of pairwise sequence alignment has a
time complexity of O(mn) and space complexity
of O(n)
• Specific cases like LCS is solvable in polynomial
time.
• Gene Prediction – In general prediction
problem is NP Hard.
• There exists polynomial time algorithms for
several special cases.
Approximation
Fixed Parameter
Complexity
• Fine-grained complexity analysis of NP Hard
problems.
• Analyzes how problem- and data-specific
parameters influence the computational
complexity of the problem.
Analyze problem difficulty not only in terms of the
input size, but also for an additional parameter,
typically an integer p.
Fixed-parameter tractability - If a problem can be
solved in time O(nα) for each fixed parameter value
p, where α is a constant independent of p
A parameterized problem with parameter p is fixed-
parameter tractable if there is an algorithm that
decides an instance (I, p) in f(p)·|I| O(1) time, where f
is an arbitrary computable function depending only
on p and I is un-parameterized instance.
Possible parameterizations:
Instance: A set of k strings X1, ..., Xk over an
alphabet Σ, and a positive integer m.
Parameter 1: k
Parameter 2: m
Parameter 3: (k, m)
Approaches Considered
Approaches to tackle
MSA Hardness
Heuristics
Approximation
Fixed Parameter
Complexity
Curse of Dimensionality
Complexity
Tool/Algo. Description
ClustalW Progressive alignment, medium-large.
MUSCLE Iterative alignment, medium
dialign Segment based method
kalign Progressive alignment, large
MAFFT Iterative alignment, medium-large
probcons Probabilistic/consistency
T-Coffee Consistency based, small
Data Deluge

More Related Content

What's hot

Dotplots for Bioinformatics
Dotplots for BioinformaticsDotplots for Bioinformatics
Dotplots for Bioinformatics
avrilcoghlan
 
Set theory-complete-1211828121770367-8
Set theory-complete-1211828121770367-8Set theory-complete-1211828121770367-8
Set theory-complete-1211828121770367-8
Yusra Shaikh
 
SET THEORY
SET THEORYSET THEORY
SET THEORY
Lena
 
Maths Project on sets
Maths Project on setsMaths Project on sets
Maths Project on sets
atifansari17
 

What's hot (20)

multiple sequence alignment
multiple sequence alignmentmultiple sequence alignment
multiple sequence alignment
 
Bioinformatics t4-alignments v2014
Bioinformatics t4-alignments v2014Bioinformatics t4-alignments v2014
Bioinformatics t4-alignments v2014
 
Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment   Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment
 
Drbmn relations
Drbmn relationsDrbmn relations
Drbmn relations
 
CLASS X MATHS
CLASS X MATHS CLASS X MATHS
CLASS X MATHS
 
Set theory
Set theorySet theory
Set theory
 
Set relationship, set operation and sigmoid
Set relationship, set operation and sigmoidSet relationship, set operation and sigmoid
Set relationship, set operation and sigmoid
 
Maths Project 11 class(SETS)
Maths Project 11 class(SETS)Maths Project 11 class(SETS)
Maths Project 11 class(SETS)
 
Dotplots for Bioinformatics
Dotplots for BioinformaticsDotplots for Bioinformatics
Dotplots for Bioinformatics
 
Pakdd
PakddPakdd
Pakdd
 
Needleman-wunch algorithm harshita
Needleman-wunch algorithm  harshitaNeedleman-wunch algorithm  harshita
Needleman-wunch algorithm harshita
 
Lecture 6 disjoint set
Lecture 6 disjoint setLecture 6 disjoint set
Lecture 6 disjoint set
 
Discreet_Set Theory
Discreet_Set TheoryDiscreet_Set Theory
Discreet_Set Theory
 
Dxc
DxcDxc
Dxc
 
Set theory-complete-1211828121770367-8
Set theory-complete-1211828121770367-8Set theory-complete-1211828121770367-8
Set theory-complete-1211828121770367-8
 
computational_Maths
computational_Mathscomputational_Maths
computational_Maths
 
SET THEORY
SET THEORYSET THEORY
SET THEORY
 
Mathematics project
Mathematics projectMathematics project
Mathematics project
 
Crisp sets
Crisp setsCrisp sets
Crisp sets
 
Maths Project on sets
Maths Project on setsMaths Project on sets
Maths Project on sets
 

Similar to Dynamic_Prog_Analysis_poster2

cis97003
cis97003cis97003
cis97003
perfj
 
MUMS Opening Workshop - Materials Innovation Driven by Data and Knowledge Sys...
MUMS Opening Workshop - Materials Innovation Driven by Data and Knowledge Sys...MUMS Opening Workshop - Materials Innovation Driven by Data and Knowledge Sys...
MUMS Opening Workshop - Materials Innovation Driven by Data and Knowledge Sys...
The Statistical and Applied Mathematical Sciences Institute
 
Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...
TELKOMNIKA JOURNAL
 
Problems in Task Scheduling in Multiprocessor System
Problems in Task Scheduling in Multiprocessor SystemProblems in Task Scheduling in Multiprocessor System
Problems in Task Scheduling in Multiprocessor System
ijtsrd
 
ssnow_manuscript_postreview
ssnow_manuscript_postreviewssnow_manuscript_postreview
ssnow_manuscript_postreview
Stephen Snow
 
De31486489
De31486489De31486489
De31486489
IJMER
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
ESCOM
 

Similar to Dynamic_Prog_Analysis_poster2 (20)

An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
 
08039246
0803924608039246
08039246
 
cis97003
cis97003cis97003
cis97003
 
Complexity analysis of multilayer perceptron neural network embedded into a w...
Complexity analysis of multilayer perceptron neural network embedded into a w...Complexity analysis of multilayer perceptron neural network embedded into a w...
Complexity analysis of multilayer perceptron neural network embedded into a w...
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
 
RADIAL BASIS FUNCTION PROCESS NEURAL NETWORK TRAINING BASED ON GENERALIZED FR...
RADIAL BASIS FUNCTION PROCESS NEURAL NETWORK TRAINING BASED ON GENERALIZED FR...RADIAL BASIS FUNCTION PROCESS NEURAL NETWORK TRAINING BASED ON GENERALIZED FR...
RADIAL BASIS FUNCTION PROCESS NEURAL NETWORK TRAINING BASED ON GENERALIZED FR...
 
20320140501002
2032014050100220320140501002
20320140501002
 
DefenseTalk_Trimmed
DefenseTalk_TrimmedDefenseTalk_Trimmed
DefenseTalk_Trimmed
 
MUMS Opening Workshop - Materials Innovation Driven by Data and Knowledge Sys...
MUMS Opening Workshop - Materials Innovation Driven by Data and Knowledge Sys...MUMS Opening Workshop - Materials Innovation Driven by Data and Knowledge Sys...
MUMS Opening Workshop - Materials Innovation Driven by Data and Knowledge Sys...
 
Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...
 
Neural Networks for High Performance Time-Delay Estimation and Acoustic Sourc...
Neural Networks for High Performance Time-Delay Estimation and Acoustic Sourc...Neural Networks for High Performance Time-Delay Estimation and Acoustic Sourc...
Neural Networks for High Performance Time-Delay Estimation and Acoustic Sourc...
 
NEURAL NETWORKS FOR HIGH PERFORMANCE TIME-DELAY ESTIMATION AND ACOUSTIC SOURC...
NEURAL NETWORKS FOR HIGH PERFORMANCE TIME-DELAY ESTIMATION AND ACOUSTIC SOURC...NEURAL NETWORKS FOR HIGH PERFORMANCE TIME-DELAY ESTIMATION AND ACOUSTIC SOURC...
NEURAL NETWORKS FOR HIGH PERFORMANCE TIME-DELAY ESTIMATION AND ACOUSTIC SOURC...
 
Problems in Task Scheduling in Multiprocessor System
Problems in Task Scheduling in Multiprocessor SystemProblems in Task Scheduling in Multiprocessor System
Problems in Task Scheduling in Multiprocessor System
 
ssnow_manuscript_postreview
ssnow_manuscript_postreviewssnow_manuscript_postreview
ssnow_manuscript_postreview
 
ME Synopsis
ME SynopsisME Synopsis
ME Synopsis
 
De31486489
De31486489De31486489
De31486489
 
Theories and Applications of Spatial-Temporal Data Mining and Knowledge Disco...
Theories and Applications of Spatial-Temporal Data Mining and Knowledge Disco...Theories and Applications of Spatial-Temporal Data Mining and Knowledge Disco...
Theories and Applications of Spatial-Temporal Data Mining and Knowledge Disco...
 
MSc Thesis Presentation
MSc Thesis PresentationMSc Thesis Presentation
MSc Thesis Presentation
 
filter.pptx
filter.pptxfilter.pptx
filter.pptx
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
 

Dynamic_Prog_Analysis_poster2

  • 1. Analysis of Dynamic Programming Algorithms in Bioinformatics – A Big Data Perspective Vineetha V and Dr.Achuthsankar S. Nair Department of Computational Biology & Bioinformatics, University of Kerala, Thiruvananthapuram, Kerala Vineetha V Dept. of Computational Biology and Bioinformatics, University of Kerala Email: vineevishnu@gmail.com Phone: 9446175215 Contact 1. http://www.site.uottawa.ca/~lucia/courses/5126-11/lecturenotes/12-13MultipleAlignment.pdf 2. http://thor.info.uaic.ro/~ciortuz/SLIDES/msa.pdf 3. http://www.inf.fu-berlin.de/lehre/WS05/aldabi/downloads/multAlign_part3.pdf 4. http://www.cs.tau.ac.il/~rshamir/algmb/98/scribe/pdf/lec05.pdf 5. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0018093 6. https://en.wikipedia.org/wiki/List_of_sequence_alignment_software 7. http://fpt.akt.tu-berlin.de/publications/fpt-strings-beatcs14.pdf?bcsi_scan_8d5e706812dbbfea=0&bcsi_scan_filename=fpt-strings-beatcs14.pdf 8. http://homepages.ecs.vuw.ac.nz/~downey/publications/sequence_alignment.pdf?bcsi_scan_8d5e706812dbbfea=0&bcsi_scan_filename=sequence_alignment.pdf References In bioinformatics, DP algorithms are used mainly in Sequence alignment problems like: Pairwise Sequence Alignment Multi Sequence Alignment Gene Prediction DP Algorithms in Bioinformatics Approximation with performance guarantee, polytime. In approximation algorithms, the running time is polynomial but we do not find an optimal solution. Instead, we get (for a minimization problem): (objective value of the solution found) ≤ α (optimal value of solution) α ≥ 1 is the approximation ratio; the closest to 1 the better Center Star Method with ratio 2 D(X, Y ) be the pairwise optimal alignment distance between X, Y . Sc ∈ S is said to be the center string of S if it minimizes: ∑i k =1 D(Sc, Si) - Identify the center string Sc of S. - Uses the alignments of Sc with each Si to create a multiple alignment. computing the optimal distance D(Si , Sj ) for all i, j => O(k2n2 ). computing the center string Sc => O(k2 ) generating the k pairwise alignments with Sc => O(kn2 ) insert spaces into Sc in order to satisfy all multiple alignments simultaneously => O(k2n) Total running time – O(k2n2) Neat optimal solution in polynomial time. Guarantee that solution is not too far away from optimal solution. Big Data Challenges No performance guarantee, but effective in practice, polytime. Progressive Alignment - Compute pairwise distance scores for all pairs of sequences, generate a guide tree, align sequences based on guide tree. Root node will represent a complete multiple alignment of the input sequences. ClustalW – S={S1, S2…Sk} each of length n, Optimal alignment of every Si, Sj => O(k2n2). Build guide tree from distance matrix => O(k3) Alignments based on guide tree with profile-profile alignments => O(k2n+kn2) Total Time => O(k2n2+k3) Additional heuristics like weighing sequences based on branch length, adjusting guide tree on the fly etc. are also be applied. Iterative Alignment – Start from initial MA (can be via progressive), and then apply modifications to improve. MUSCLE – Use alignment to compute more accurate pairwise distance. Refines multiple alignment using the tree-dependent restricted partition technique - a process of deleting edges of guide tree, and re-combine the alignment of the disjoint trees, if better. Other iterative methods – PRRP, MAFFT Iterative algorithms offer improved alignment accuracy at the expense of computation time. Heuristics Alignment quality and Total run time State of the Art • There is trade off between optimal solution and computation time. • Fixed parameter based approaches found to be promising for finding tractable special cases of these problems. • Employing more than one program based on different alignment techniques might yield a better result • Parallelization is also a recommended technique to be considered from practical aspect for reducing execution time though computation time would remain the same. Conclusion & Future Work Enhanced sequencing technologies produce sequence data on an unparalleled scale and there is a need to scale the alignment solutions to be able to handle huge volume of input data Computational complexity of the DP algorithm increases exponentially with dimensionality of the state, which makes it impractical in large-scale applications • MSA – MSA with sum-of-pairs score is NP Complete. • Multiple tree alignment is MAX SNP-Hard • DP on full k dimension box of volume n1 x n2 x n3 x …x nk takes O(n1 . n2 . n3 … nk . 2k) • Running time is very slow even for k = 3, and totally infeasible for k ≥ 6 • Pairwise Sequence Alignment – Most versions of pairwise sequence alignment has a time complexity of O(mn) and space complexity of O(n) • Specific cases like LCS is solvable in polynomial time. • Gene Prediction – In general prediction problem is NP Hard. • There exists polynomial time algorithms for several special cases. Approximation Fixed Parameter Complexity • Fine-grained complexity analysis of NP Hard problems. • Analyzes how problem- and data-specific parameters influence the computational complexity of the problem. Analyze problem difficulty not only in terms of the input size, but also for an additional parameter, typically an integer p. Fixed-parameter tractability - If a problem can be solved in time O(nα) for each fixed parameter value p, where α is a constant independent of p A parameterized problem with parameter p is fixed- parameter tractable if there is an algorithm that decides an instance (I, p) in f(p)·|I| O(1) time, where f is an arbitrary computable function depending only on p and I is un-parameterized instance. Possible parameterizations: Instance: A set of k strings X1, ..., Xk over an alphabet Σ, and a positive integer m. Parameter 1: k Parameter 2: m Parameter 3: (k, m) Approaches Considered Approaches to tackle MSA Hardness Heuristics Approximation Fixed Parameter Complexity Curse of Dimensionality Complexity Tool/Algo. Description ClustalW Progressive alignment, medium-large. MUSCLE Iterative alignment, medium dialign Segment based method kalign Progressive alignment, large MAFFT Iterative alignment, medium-large probcons Probabilistic/consistency T-Coffee Consistency based, small Data Deluge