Paper: http://drops.dagstuhl.de/opus/volltexte/2018/8952/
Code: https://github.com/odenas/indexed_ms
Computing the matching statistics of a string $S$ with respect to a string $T$ on an alphabet of size $\sigma$ is a fundamental primitive for a number of large-scale string analysis applications, including the comparison of entire genomes, for which space is a pressing issue. This paper takes from theory to practice an existing algorithm that uses just $O(|T|\log{\sigma})$ bits of space, and that computes a compact encoding of the matching statistics array in $O(|S|\log{\sigma})$ time. The techniques used to speed up the algorithm are of general interest, since they optimize queries on the existence of a Weiner link from a node of the suffix tree, and parent operations after unsuccessful Weiner links. Thus, they can be applied to other matching statistics algorithms, as well as to any suffix tree traversal that relies on such calls. Some of our optimizations yield a matching statistics implementation that is up to three times faster than a plain version of the algorithm, depending on the similarity between $S$ and $T$. In genomic datasets of practical significance we achieve speedups of up to 1.8, but our fastest implementations take on average twice the time of an existing code based on the LCP array. The key advantage is that our implementations need between one half and one fifth of the competitor's memory, and they approach comparable running times when $S$ and $T$ are very similar.
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Fast matching statistics in small space
1. Fast matching statistics
in small space
Djamal Belazzougui1
, Fabio Cunial2,3
, Olgert Denas4
(1) DTISI, CERIST, Algiers, Algeria.
(2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany.
(3) Center for Systems Biology Dresden
(4) Adobe Inc., San Jose, California.
2. Matching statistics (1973)
Weiner. Linear pattern matching algorithms. Switching and Automata Theory, 1973.
T =text ttctttctgttcatgtgtatttgct
S =query gtctcttagcccagactt
= 232432212112111332
3. Applications
[3] Teo, Vishwanathan. Fast and space efficient string kernels using suffix arrays. ICML 2006.
[4] Philippe et al. CRAC: an integrated approach to the analysis of RNA-seq reads. Genome Biology 2013.
[1] Farach et al. On the entropy of DNA: algorithms and measurements based on memory and rapid convergence. SODA 1995.
[2] Ulitsky et al. The average common substring approach to phylogenomic reconstruction. JCB 2006.
Substring kernels, indexing
just one string.
Read sets: distinguishing between
errors and mutations.
Approximating the cross-entropy
of random sources
Whole-genome, alignment-free
phylogeny reconstruction.
Variable-order Markov chains
Proteome, H. sapiens-P. troglodytes.
Proteome, H. sapiens-P. troglodytes.
Proteome, H. sapiens-bacteria.
13. bits
bits
time
bits
≤ one byte per
value in some
practical cases
Suffix tree, or
LCP array
Compressed
suffix array
SPIRE 2014
No need for string
depthtime time
String depth of ST nodes
[1] Ohlebusch, Gog, Kügel. Computing matching statistics and maximal exact matches on
compressed full-text indexes. SPIRE 2010.
[2] Sadakane. Compressed suffix trees with full functionality. Theory of Computing Systems,
2007.
[3] Belazzougui, Cunial. Indexed matching
statistics and shortest unique substrings.
SPIRE 2014.
Practical
14. Making SPIRE 2014 faster in practice
Checking existence of a Weiner link
Parent after unsuccessful Weiner link
In this work
Belazzougui, Cunial. Indexed matching statistics and shortest unique substrings. SPIRE 2014.
15. Making SPIRE 2014 faster in practice
of independent
interest
Checking existence of a Weiner link
Parent after unsuccessful Weiner link
In this work
Belazzougui, Cunial. Indexed matching statistics and shortest unique substrings. SPIRE 2014.
16. BWT of T and T (wavelet tree)
Suffix tree topology:
Maximal repeats
Tools
bits
time
Navarro and Sadakane. Fully functional static and dynamic succinct trees. ACM TALG, 2014.
17. Matching statistics in small space
Belazzougui, Cunial. Indexed matching statistics and shortest unique substrings. SPIRE 2014.
18. Right-to-left scan
t c t t g t t t t t t c g t t t t g a c g t # t c a
#
g c
#
a a c
a t t c g t
gt
t
t
t t
t
ag
g
g
gg
t
# a
a g tc t ct
at c g t
BWTT
=
S =
runs =
g t c t c t t a g c c c a g a c t t c c c g t g t
g t c t c t t a g c c c a g a c t t c c c g t g t
19. Right-to-left scan
t c t t g t t t t t t c g t t t t g a c g t # t c a
#
g c
#
a a c
a t t c g t
gt
t
t
t t
t
ag
g
g
gg
t
# a
a g tc t ct
at c g t
BWTT
=
S =
runs =
g t c t c t t a g c c c a g a c t t c c c g t g t
g t c t c t t a g c c c a g a c t t c c c g t g 1
20. Right-to-left scan
t c t t g t t t t t t c g t t t t g a c g t # t c a
#
g c
#
a a c
a t t c g t
gt
t
t
t t
t
ag
g
g
gg
t
# a
a g tc t ct
at c g t
BWTT
=
S =
runs =
g t c t c t t a g c c c a g a c t t c c c g t g t
g t c t c t t a g c c c a g a c t t c c c g t 1 1
21. Right-to-left scan
t c t t g t t t t t t c g t t t t g a c g t # t c a
#
g c
#
a a c
a t t c g t
gt
t
t
t t
t
ag
g
g
gg
t
# a
a g tc t ct
at c g t
BWTT
=
S =
runs =
g t c t c t t a g c c c a g a c t t c c c g t g t
g t c t c t t a g c c c a g a c t t c c c g 1 1 1
22. Right-to-left scan
t c t t g t t t t t t c g t t t t g a c g t # t c a
#
g c
#
a a c
a t t c g t
gt
t
t
t t
t
ag
g
g
gg
t
# a
a g tc t ct
at c g t
BWTT
=
S =
runs =
g t c t c t t a g c c c a g a c t t c c c g t g t
g t c t c t t a g c c c a g a c t t c c c 0 1 1 1
23. Right-to-left scan
t c t t g t t t t t t c g t t t t g a c g t # t c a
#
g c
#
a a c
a t t c g t
gt
t
t
t t
t
ag
g
g
gg
t
# a
a g tc t ct
at c g t
BWTT
=
S =
runs =
g t c t c t t a g c c c a g a c t t c c c g t g t
g t c t c t t a g c c c a g a c t t c c c 0 1 1 1
24. Right-to-left scan
t c t t g t t t t t t c g t t t t g a c g t # t c a
#
g c
#
a a c
a t t c g t
gt
t
t
t t
t
ag
g
g
gg
t
# a
a g tc t ct
at c g t
BWTT
=
S =
runs =
g t c t c t t a g c c c a g a c t t c c c g t g t
g t c t c t t a g c c c a g a c t t c c c 0 1 1 1
25. Right-to-left scan
t c t t g t t t t t t c g t t t t g a c g t # t c a
#
g c
#
a a c
a t t c g t
gt
t
t
t t
t
ag
g
g
gg
t
# a
a g tc t ct
at c g t
BWTT
=
S =
runs =
g t c t c t t a g c c c a g a c t t c c c g t g t
g t c t c t t a g c c c a g a c t t c c 0 0 1 1 1
26. Right-to-left scan
t c t t g t t t t t t c g t t t t g a c g t # t c a
#
g c
#
a a c
a t t c g t
gt
t
t
t t
t
ag
g
g
gg
t
# a
a g tc t ct
at c g t
BWTT
=
S =
runs =
g t c t c t t a g c c c a g a c t t c c c g t g t
g t c t c t t a g c c c a g a c t t c 0 0 0 1 1 1
27. Right-to-left scan
t c t t g t t t t t t c g t t t t g a c g t # t c a
#
g c
#
a a c
a t t c g t
gt
t
t
t t
t
ag
g
g
gg
t
# a
a g tc t ct
at c g t
BWTT
=
S =
runs =
g t c t c t t a g c c c a g a c t t c c c g t g t
g t c t c t t a g c c c a g a c t t 1 0 0 0 1 1 1
28. Right-to-left scan
t c t t g t t t t t t c g t t t t g a c g t # t c a
#
g c
#
a a c
a t t c g t
gt
t
t
t t
t
ag
g
g
gg
t
# a
a g tc t ct
at c g t
BWTT
=
S =
runs =
g t c t c t t a g c c c a g a c t t c c c g t g t
g t c t c t t a g c c c a g a c t 1 1 0 0 0 1 1 1
29. Left-to-right scan
t t t t t a t t t t c t g t # t g g t a c t t c g c
#
c a
# c #
a
g tt c gt tg
t t c
t
t c
g c g a
c#
t # a
g g tatt
a c gt t
BWTT
=
S =
runs =
MS =
g t c t c t t a g c c c a g a c t t c c c g t g t
0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1
0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1
30. Left-to-right scan
t t t t t a t t t t c t g t # t g g t a c t t c g c
#
c a
# c #
a
g tt c gt tg
t t c
t
t c
g c g a
c#
t # a
g g tatt
a c gt t
BWTT
=
S =
runs =
MS =
g t c t c t t a g c c c a g a c t t c c c g t g t
0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1
0 0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1
31. Left-to-right scan
t t t t t a t t t t c t g t # t g g t a c t t c g c
#
c a
# c #
a
g tt c gt tg
t t c
t
t c
g c g a
c#
t # a
g g tatt
a c gt t
BWTT
=
S =
runs =
MS =
g t c t c t t a g c c c a g a c t t c c c g t g t
0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1
0 0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1
32. Left-to-right scan
t t t t t a t t t t c t g t # t g g t a c t t c g c
#
c a
# c #
a
g tt c gt tg
t t c
t
t c
g c g a
c#
t # a
g g tatt
a c gt t
BWTT
=
S =
runs =
MS =
g t c t c t t a g c c c a g a c t t c c c g t g t
0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1
0 0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1
33. Left-to-right scan
t t t t t a t t t t c t g t # t g g t a c t t c g c
#
c a
# c #
a
g tt c gt tg
t t c
t
t c
g c g a
c#
t # a
g g tatt
a c gt t
BWTT
=
S =
runs =
MS =
g t c t c t t a g c c c a g a c t t c c c g t g t
0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1
0 0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1
34. Left-to-right scan
t t t t t a t t t t c t g t # t g g t a c t t c g c
#
c a
# c #
a
g tt c gt tg
t t c
t
t c
g c g a
c#
t # a
g g tatt
a c gt t
BWTT
=
S =
runs =
MS =
g t c t c t t a g c c c a g a c t t c c c g t g t
0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1
0 0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1
35. Left-to-right scan
t t t t t a t t t t c t g t # t g g t a c t t c g c
#
c a
# c #
a
g tt c gt tg
t t c
t
t c
g c g a
c#
t # a
g g tatt
a c gt t
BWTT
=
S =
runs =
MS =
g t c t c t t a g c c c a g a c t t c c c g t g t
0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1
0 0 1 0 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1
36. Left-to-right scan
t t t t t a t t t t c t g t # t g g t a c t t c g c
#
c a
# c #
a
g tt c gt tg
t t c
t
t c
g c g a
c#
t # a
g g tatt
a c gt t
BWTT
=
S =
runs =
MS =
g t c t c t t a g c c c a g a c t t c c c g t g t
0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1
0 0 1 0 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1
37. Left-to-right scan
t t t t t a t t t t c t g t # t g g t a c t t c g c
#
c a
# c #
a
g tt c gt tg
t t c
t
t c
g c g a
c#
t # a
g g tatt
a c gt t
BWTT
=
S =
runs =
MS =
g t c t c t t a g c c c a g a c t t c c c g t g t
0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1
0 0 1 0 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1
38. Left-to-right scan
t t t t t a t t t t c t g t # t g g t a c t t c g c
#
c a
# c #
a
g tt c gt tg
t t c
t
t c
g c g a
c#
t # a
g g tatt
a c gt t
BWTT
=
S =
runs =
MS =
g t c t c t t a g c c c a g a c t t c c c g t g t
0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1
0 0 1 0 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1
39. Baseline implementation
t t t t t a t t t t c t g t # t g g t a c t t c g c
#
c a
# c #
a
g tt c gt tg
t t c
t
t c
g c g a
c#
t # a
g g tatt
a c gt t
BWTT
=
40. Baseline implementation
t t t t t a t t t t c t g t # t g g t a c t t c g c
#
c a
# c #
a
g tt c gt tg
t t c
t
t c
g c g a
c#
t # a
g g tatt
a c gt t
BWTT
=
41. Baseline implementation
t t t t t a t t t t c t g t # t g g t a c t t c g c
#
c a
# c #
a
g tt c gt tg
t t c
t
t c
g c g a
c#
t # a
g g tatt
a c gt t
BWTT
=
47. Lazy Weiner links
t c t t g t t t t t t c g t t t t g a c g t # t c a
#
g c
#
a a c
a t t c g t
gt
t
t
t t
t
ag
g
g
gg
t
# a
a g tc t ct
at c g t
BWTT
=
S = g t c t c t t a g c c c a g a c t t c c c g t g t
Successful Weiner link:
updating just the BWT interval.
48. t c t t g t t t t t t c g t t t t g a c g t # t c a
#
g c
#
a a c
a t t c g t
gt
t
t
t t
t
ag
g
g
gg
t
# a
a g tc t ct
at c g t
BWTT
=
S = g t c t c t t a g c c c a g a c t t c c c g t g t
Lazy Weiner links
Successful Weiner link:
updating just the BWT interval.
49. t c t t g t t t t t t c g t t t t g a c g t # t c a
#
g c
#
a a c
a t t c g t
gt
t
t
t t
t
ag
g
g
gg
t
# a
a g tc t ct
at c g t
BWTT
=
S = g t c t c t t a g c c c a g a c t t c c c g t g t
Lazy Weiner links
Successful Weiner link:
updating just the BWT interval.
50. t c t t g t t t t t t c g t t t t g a c g t # t c a
#
g c
#
a a c
a t t c g t
gt
t
t
t t
t
ag
g
g
gg
t
# a
a g tc t ct
at c g t
BWTT
=
S = g t c t c t t a g c c c a g a c t t c c c g t g t
Lazy Weiner links
Successful Weiner link:
updating just the BWT interval.
51. t c t t g t t t t t t c g t t t t g a c g t # t c a
#
g c
#
a a c
a t t c g t
gt
t
t
t t
t
ag
g
g
gg
t
# a
a g tc t ct
at c g t
BWTT
=
S =
Failed Weiner link: computing
the node ID in the topology.
Needed only for moving
to the parent.
g t c t c t t a g c c c a g a c t t c c c g t g t
Lazy Weiner links
53. t c t t g t t t t t t c g t t t t g a c g t # t c a
#
g c
#
a a c
a t t c g t
gt
t
t
t t
t
ag
g
g
gg
t
# a
a g tc t ct
at c g t
BWTT
=
Maximal repeats
⟹Not a maximal repeat
Maximal repeats can be marked on a bitvector in BWT space
BWT interval contains exactly one distinct character
Weiner link = one access operation
e.g. on the last position of the interval.
Can fail early
54. Maximal repeats
⟹Not a maximal repeat
Maximal repeats can be marked on a bitvector in BWT space
BWT interval contains exactly one distinct character
Weiner link = one access operation
e.g. on the last position of the interval.
Can fail early
In repetitive strings, most nodes are not maximal repeats.
57. If a Weiner link fails from a node,
it will fail also from all its non-maxrep ancestors.
If a node is a maxrep, all its ancestors are maxreps as well,
so we don't need to check if they are.
58. If a Weiner link fails from a node,
it will fail also from all its non-maxrep ancestors.
If a node is a maxrep, all its ancestors are maxreps as well,
so we don't need to check if they are.
Jumping to the lowest maxrep ancestor
59. Jumping to the lowest ancestor with the WL
t t t t t a t t t t c t g t # t g g t a c t t c g c
#
c a
# c #
a
g tt c gt tg
t t c
t
t c
g c g a
c#
t # a
g g tatt
a c gt t
BWTT
=
weinerLink(t)
60. t t t t t a t t t t c t g t # t g g t a c t t c g c
#
c a
# c #
a
g tt c gt tg
t t c
t
t c
g c g a
c#
t # a
g g tatt
a c gt t
BWTT
=
Jumping to the lowest ancestor with the WL
61. t t t t t a t t t t c t g t # t g g t a c t t c g c
#
c a
# c #
a
g tt c gt tg
t t c
t
t c
g c g a
c#
t # a
g g tatt
a c gt t
BWTT
=
Jumping to the lowest ancestor with the WL
62. t t t t t a t t t t c t g t # t g g t a c t t c g c
#
c a
# c #
a
g tt c gt tg
t t c
t
t c
g c g a
c#
t # a
g g tatt
a c gt t
BWTT
=
Jumping to the lowest ancestor with the WL
weinerLink(t)
72. We call selectNext (twice, right and left)
if and only if a Weiner link fails.
We could merge selectNext with
doubleRankAndFail
Adds small overhead
Combining two selectNext at each WT
node might save some operations
doubleRankAndSelectNext
73. Range MS queries
ms =
MS =
00100110001110110011010011010100010111010100001111
2324322121121113321114321
Sadakane. Compressed suffix trees with full functionality. Theory of Computing Systems 2007.
74. Range MS queries
ms =
MS =
00100110001110110011010011010100010111010100001111
2324322121121113321114321
75. Biological strings
Genome H. sapiens ⟼ P. troglodytes ≈ 1h per scan
Genome H. sapiens ⟼ All NCBI bacteria ≈ 2.5h per scan
Proteome H. sapiens ⟼ P. troglodytes ≈ 30s per scan
Intel Xeon E5-2680v3 @ 2.50-3.30 GHz
77. Time Memory
genome genome
similar
proteome proteome
similar
genome genome
similar
proteome proteome
similar
1
2
3
4
5
0.5
1.0
1.5
2.0
Competitor/Thispaper
Ohlebusch, Gog, Kügel. Computing matching statistics and maximal exact matches on compressed full-text indexes. SPIRE 2010.
78. Fast matching statistics
in small space
Suffix tree figures by M. Maso and A. Perissinotto, University of Padova.
https://github.com/odenas/indexed_ms
based on https://github.com/simongog/sdsl-lite