1. Analysis of Dynamic Programming Algorithms in
Bioinformatics – A Big Data Perspective
Vineetha V and Dr.Achuthsankar S. Nair
Department of Computational Biology & Bioinformatics, University of Kerala,
Thiruvananthapuram, Kerala
Vineetha V
Dept. of Computational Biology and Bioinformatics, University of Kerala
Email: vineevishnu@gmail.com
Phone: 9446175215
Contact
1. http://www.site.uottawa.ca/~lucia/courses/5126-11/lecturenotes/12-13MultipleAlignment.pdf
2. http://thor.info.uaic.ro/~ciortuz/SLIDES/msa.pdf
3. http://www.inf.fu-berlin.de/lehre/WS05/aldabi/downloads/multAlign_part3.pdf
4. http://www.cs.tau.ac.il/~rshamir/algmb/98/scribe/pdf/lec05.pdf
5. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0018093
6. https://en.wikipedia.org/wiki/List_of_sequence_alignment_software
7. http://fpt.akt.tu-berlin.de/publications/fpt-strings-beatcs14.pdf?bcsi_scan_8d5e706812dbbfea=0&bcsi_scan_filename=fpt-strings-beatcs14.pdf
8. http://homepages.ecs.vuw.ac.nz/~downey/publications/sequence_alignment.pdf?bcsi_scan_8d5e706812dbbfea=0&bcsi_scan_filename=sequence_alignment.pdf
References
In bioinformatics, DP algorithms are used mainly in
Sequence alignment problems like:
Pairwise Sequence
Alignment
Multi
Sequence
Alignment
Gene
Prediction
DP Algorithms in Bioinformatics
Approximation with performance guarantee, polytime.
In approximation algorithms, the running time is
polynomial but we do not find an optimal solution.
Instead, we get (for a minimization problem):
(objective value of the solution found) ≤ α (optimal value
of solution)
α ≥ 1 is the approximation ratio; the closest to 1 the
better
Center Star Method with ratio 2
D(X, Y ) be the pairwise optimal alignment distance
between X, Y .
Sc ∈ S is said to be the center string of S if it minimizes:
∑i
k
=1 D(Sc, Si)
- Identify the center string Sc of S.
- Uses the alignments of Sc with each Si to create a
multiple alignment.
computing the optimal distance D(Si , Sj ) for all i, j
=> O(k2n2 ).
computing the center string Sc => O(k2 )
generating the k pairwise alignments with Sc => O(kn2 )
insert spaces into Sc in order to satisfy all multiple
alignments simultaneously => O(k2n)
Total running time – O(k2n2)
Neat optimal solution in polynomial time. Guarantee that
solution is not too far away from optimal solution.
Big Data Challenges
No performance guarantee, but effective in practice,
polytime.
Progressive Alignment - Compute pairwise distance
scores for all pairs of sequences, generate a guide tree,
align sequences based on guide tree. Root node will
represent a complete multiple alignment of the input
sequences.
ClustalW – S={S1, S2…Sk} each of length n,
Optimal alignment of every Si, Sj => O(k2n2).
Build guide tree from distance matrix => O(k3)
Alignments based on guide tree with profile-profile
alignments => O(k2n+kn2)
Total Time => O(k2n2+k3)
Additional heuristics like weighing sequences based on
branch length, adjusting guide tree on the fly etc. are
also be applied.
Iterative Alignment – Start from initial MA (can be via
progressive), and then apply modifications to improve.
MUSCLE – Use alignment
to compute more accurate
pairwise distance.
Refines multiple alignment
using the tree-dependent
restricted partition technique - a process of deleting
edges of guide tree, and re-combine the alignment of the
disjoint trees, if better.
Other iterative methods – PRRP, MAFFT
Iterative algorithms offer improved alignment accuracy at
the expense of computation time.
Heuristics
Alignment quality and
Total run time
State of the Art
• There is trade off between optimal solution and
computation time.
• Fixed parameter based approaches found to be
promising for finding tractable special cases of
these problems.
• Employing more than one program based on
different alignment techniques might yield a
better result
• Parallelization is also a recommended technique
to be considered from practical aspect for
reducing execution time though computation
time would remain the same.
Conclusion & Future Work
Enhanced sequencing technologies produce
sequence data on an unparalleled scale and there is
a need to scale the alignment solutions to be able
to handle huge volume of input data
Computational complexity of the DP algorithm
increases exponentially with dimensionality of the
state, which makes it impractical in large-scale
applications
• MSA – MSA with sum-of-pairs score is NP
Complete.
• Multiple tree alignment is MAX SNP-Hard
• DP on full k dimension box of volume n1 x n2 x n3
x …x nk takes O(n1 . n2 . n3 … nk . 2k)
• Running time is very slow even for k = 3, and
totally infeasible for k ≥ 6
• Pairwise Sequence Alignment – Most
versions of pairwise sequence alignment has a
time complexity of O(mn) and space complexity
of O(n)
• Specific cases like LCS is solvable in polynomial
time.
• Gene Prediction – In general prediction
problem is NP Hard.
• There exists polynomial time algorithms for
several special cases.
Approximation
Fixed Parameter
Complexity
• Fine-grained complexity analysis of NP Hard
problems.
• Analyzes how problem- and data-specific
parameters influence the computational
complexity of the problem.
Analyze problem difficulty not only in terms of the
input size, but also for an additional parameter,
typically an integer p.
Fixed-parameter tractability - If a problem can be
solved in time O(nα) for each fixed parameter value
p, where α is a constant independent of p
A parameterized problem with parameter p is fixed-
parameter tractable if there is an algorithm that
decides an instance (I, p) in f(p)·|I| O(1) time, where f
is an arbitrary computable function depending only
on p and I is un-parameterized instance.
Possible parameterizations:
Instance: A set of k strings X1, ..., Xk over an
alphabet Σ, and a positive integer m.
Parameter 1: k
Parameter 2: m
Parameter 3: (k, m)
Approaches Considered
Approaches to tackle
MSA Hardness
Heuristics
Approximation
Fixed Parameter
Complexity
Curse of Dimensionality
Complexity
Tool/Algo. Description
ClustalW Progressive alignment, medium-large.
MUSCLE Iterative alignment, medium
dialign Segment based method
kalign Progressive alignment, large
MAFFT Iterative alignment, medium-large
probcons Probabilistic/consistency
T-Coffee Consistency based, small
Data Deluge