1. Advanced Algorithms – COMS31900
Pattern matching part four
Pattern matching with at most k mismatches
Benjamin Sach
2. Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c
a b d
a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
m
i.e. the number of distinct j such that P[j] = T[i + j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
3. Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c
a b d
a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
m
i.e. the number of distinct j such that P[j] = T[i + j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a
Ham(4) = 1
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
4. Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P[j] = T[i + j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(5) = 4
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
5. Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P[j] = T[i + j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(6) = 1
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
6. Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P[j] = T[i + j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(7) = 3
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
7. Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P[j] = T[i + j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(7) = 3
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
this is alignment 7
8. Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P[j] = T[i + j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(8) = 3
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
this is alignment 8
9. Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P[j] = T[i + j]
Last lecture we saw two algorithms for this problem:
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(8) = 3
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
this is alignment 8
10. Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P[j] = T[i + j]
Last lecture we saw two algorithms for this problem:
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(8) = 3
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
this is alignment 8
One algorithm takes O(n|Σ| log m) time (where |Σ| is the alphabet size)
The other algorithm takes O(n
√
m log m) time (regardless of the alphabet size)
11. Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (length m) and a positive integer k
P
ba b c
a b d
a a d ad a
Goal: For all i, output,
a a a
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a
Output the number of mismatches. . . unless it’s more than k
(we interpret the output X to mean “too many mismatches”)
Hamk(i) =
Ham(i) if Ham(i) k
X if Ham(i) > k
12. Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (length m) and a positive integer k
P
ba b c
a b d
a a d ad a
Goal: For all i, output,
a a a
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a
Output the number of mismatches. . . unless it’s more than k
(we interpret the output X to mean “too many mismatches”)
Hamk(i) =
Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2
13. Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (length m) and a positive integer k
P
ba b c
a b d
a a d ad a
Goal: For all i, output,
a a a
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a
Hamk(4) = 1
Output the number of mismatches. . . unless it’s more than k
(we interpret the output X to mean “too many mismatches”)
Hamk(i) =
Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2
14. Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (length m) and a positive integer k
P
ba b c a a d ad a
Goal: For all i, output,
a a a
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(5) = X
Output the number of mismatches. . . unless it’s more than k
(we interpret the output X to mean “too many mismatches”)
Hamk(i) =
Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2
15. Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (length m) and a positive integer k
P
ba b c a a d ad a
Goal: For all i, output,
a a a
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(6) = 1
Output the number of mismatches. . . unless it’s more than k
(we interpret the output X to mean “too many mismatches”)
Hamk(i) =
Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2
16. Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (length m) and a positive integer k
P
ba b c a a d ad a
Goal: For all i, output,
a a a
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(7) = 2
Output the number of mismatches. . . unless it’s more than k
(we interpret the output X to mean “too many mismatches”)
Hamk(i) =
Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2
17. Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (length m) and a positive integer k
P
ba b c a a d ad a
Goal: For all i, output,
a a a
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(8) = X
Output the number of mismatches. . . unless it’s more than k
(we interpret the output X to mean “too many mismatches”)
Hamk(i) =
Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2
18. Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (length m) and a positive integer k
P
ba b c a a d ad a
Goal: For all i, output,
a a a
• We could use the O(n
√
m log m) time algorithm for Hamming distance. . .
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(8) = X
Output the number of mismatches. . . unless it’s more than k
(we interpret the output X to mean “too many mismatches”)
Hamk(i) =
Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2
19. Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (length m) and a positive integer k
P
ba b c a a d ad a
Goal: For all i, output,
a a a
• We could use the O(n
√
m log m) time algorithm for Hamming distance. . .
but when k is small we can do much better
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(8) = X
Output the number of mismatches. . . unless it’s more than k
(we interpret the output X to mean “too many mismatches”)
Hamk(i) =
Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2
20. LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P, LCP(i, j) is the largest such that
T[i . . . i + − 1] = P[j . . . j + − 1]
it’s the furthest you can go before hitting a mismatch
a b c b a bc bP
nn nm
21. LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P, LCP(i, j) is the largest such that
T[i . . . i + − 1] = P[j . . . j + − 1]
it’s the furthest you can go before hitting a mismatch
a b c b a bc bP
nn nm
22. LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P, LCP(i, j) is the largest such that
T[i . . . i + − 1] = P[j . . . j + − 1]
it’s the furthest you can go before hitting a mismatch
i j
a b c b a bc bP
nn nm
23. LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P, LCP(i, j) is the largest such that
T[i . . . i + − 1] = P[j . . . j + − 1]
it’s the furthest you can go before hitting a mismatch
i j
a b c b a bc bP
nn nm
LCP(i, j) returns 3
24. LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P, LCP(i, j) is the largest such that
T[i . . . i + − 1] = P[j . . . j + − 1]
it’s the furthest you can go before hitting a mismatch
i j
a b c b a bc bP
nn nm
LCP(i, j) returns 4
25. LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P, LCP(i, j) is the largest such that
T[i . . . i + − 1] = P[j . . . j + − 1]
it’s the furthest you can go before hitting a mismatch
a b c b a bc bP
nn nm
i j
LCP(i, j) returns 0
26. LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P, LCP(i, j) is the largest such that
T[i . . . i + − 1] = P[j . . . j + − 1]
it’s the furthest you can go before hitting a mismatch
i j
a b c b a bc bP
nn nm
LCP(i, j) returns 4
27. LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P, LCP(i, j) is the largest such that
T[i . . . i + − 1] = P[j . . . j + − 1]
it’s the furthest you can go before hitting a mismatch
i j
a
We can preprocess P and T for LCP queries in O(n) time and O(n) space
b c b a bc bP
nn nm
LCP(i, j) returns 4
28. LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P, LCP(i, j) is the largest such that
T[i . . . i + − 1] = P[j . . . j + − 1]
it’s the furthest you can go before hitting a mismatch
i j
a
We can preprocess P and T for LCP queries in O(n) time and O(n) space
Each query then takes O(1) time
b c b a bc bP
nn nm
LCP(i, j) returns 4
29. LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P, LCP(i, j) is the largest such that
T[i . . . i + − 1] = P[j . . . j + − 1]
it’s the furthest you can go before hitting a mismatch
i j
a
We can preprocess P and T for LCP queries in O(n) time and O(n) space
Each query then takes O(1) time
we’ll see how to do this later in this lecture
b c b a bc bP
nn nm
LCP(i, j) returns 4
30. LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P, LCP(i, j) is the largest such that
T[i . . . i + − 1] = P[j . . . j + − 1]
it’s the furthest you can go before hitting a mismatch
i j
a
We can preprocess P and T for LCP queries in O(n) time and O(n) space
Each query then takes O(1) time
we’ll see how to do this later in this lecture
we can use LCP queries to solve the k-mismatch problem. . .
b c b a bc bP
nn nm
First let’s see how
LCP(i, j) returns 4
31. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
a a b a
32. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
a a b a
33. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
a a b a
34. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
a a b a
35. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
a a b a
36. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
a a b a
37. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
j
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
a a b a
38. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
j
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
perform LCP(i, j)
a a b a
39. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
j
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
perform LCP(i, j)
a a b a
40. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
j
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
perform LCP(i, j)
a a b a which returns 3
41. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
j
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
perform LCP(i, j)
a a b a which returns 3
42. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
j
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
perform LCP(i, j)
a a b a which returns 3
mismatch!
43. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
j
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
a a b a
44. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
j
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
a a b a
skip over this!
45. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
i
j
a a b a
skip over this!
46. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
i
j
perform LCP(i , j)
a a b a
47. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
i
j
perform LCP(i , j)
a a b a which returns 0
48. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
i
j
a a b a
49. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
i
j
a a b a
50. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
perform LCP(i , j)i
j
a a b a
51. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
perform LCP(i , j)i
j
a a b a which returns 2
52. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
i
j
a a b a
53. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
perform LCP(i , j)
a a b a
i
j
54. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
perform LCP(i , j)
a a b a
i
j
which returns 3
55. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
(or we reach the end of P)
perform LCP(i , j)
a a b a
i
j
which returns 3
56. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
(or we reach the end of P)
a a b a
i
j
57. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
(or we reach the end of P)
We can therefore calculate Hamk(i) for a single alignment i in O(k) time
a a b a
i
j
58. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
(or we reach the end of P)
We can therefore calculate Hamk(i) for a single alignment i in O(k) time
Overall this takes O(nk) time (including the ‘preprocessing’ for LCP queries)
a a b a
i
j
59. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
(or we reach the end of P)
We can therefore calculate Hamk(i) for a single alignment i in O(k) time
Overall this takes O(nk) time (including the ‘preprocessing’ for LCP queries)
this is pretty good but we can do better
a a b a
i
j
60. k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
(or we reach the end of P)
We can therefore calculate Hamk(i) for a single alignment i in O(k) time
Overall this takes O(nk) time (including the ‘preprocessing’ for LCP queries)
this is pretty good but we can do better but first. . . how do we answer those LCP queries?
a a b a
i
j
61. LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
in O(n) prep. time and space
62. LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
in O(n) prep. time and space
63. LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
i j
3
in O(n) prep. time and space
64. LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
i j
LCA(i, j)
3
in O(n) prep. time and space
65. LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
i j
LCA(i, j)
3
in O(n) prep. time and space
66. LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
i j
LCA(i, j)
3
it’s the node representing the longest common prefix of T[i . . . n − 1] and T[j . . . n − 1]
in O(n) prep. time and space
67. LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
i j
LCA(i, j)
3
it’s the node representing the longest common prefix of T[i . . . n − 1] and T[j . . . n − 1]
in O(n) prep. time and space
For any pair of locations i, j in T, LCPT (i, j) is the largest such that
T[i . . . i + − 1] = T[j . . . j + − 1]
Single string LCP:
68. LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
it’s the node representing the longest common prefix of T[i . . . n − 1] and T[j . . . n − 1]
LCA(i, j)
j
1
5
in O(n) prep. time and space
i
For any pair of locations i, j in T, LCPT (i, j) is the largest such that
T[i . . . i + − 1] = T[j . . . j + − 1]
Single string LCP:
69. LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
it’s the node representing the longest common prefix of T[i . . . n − 1] and T[j . . . n − 1]
LCA(i, j)
j
4
i
2
in O(n) prep. time and space
For any pair of locations i, j in T, LCPT (i, j) is the largest such that
T[i . . . i + − 1] = T[j . . . j + − 1]
Single string LCP:
70. LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
it’s the node representing the longest common prefix of T[i . . . n − 1] and T[j . . . n − 1]
LCA(i, j)
j
4
i
2
in O(n) prep. time and space
71. LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
it’s the node representing the longest common prefix of T[i . . . n − 1] and T[j . . . n − 1]
LCA(i, j)
j
4
i
2
We store the root-to-node length at each internal node
so we can recover the length, LCPT (i, j) in O(1) time
in O(n) prep. time and space
72. LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
it’s the node representing the longest common prefix of T[i . . . n − 1] and T[j . . . n − 1]
LCA(i, j)
j
4
i
2
We store the root-to-node length at each internal node
1
3
0
2
so we can recover the length, LCPT (i, j) in O(1) time
in O(n) prep. time and space
73. LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
it’s the node representing the longest common prefix of T[i . . . n − 1] and T[j . . . n − 1]
LCA(i, j)
j
4
i
2
We store the root-to-node length at each internal node
1
3
0
2
so we can recover the length, LCPT (i, j) in O(1) time
in O(n) prep. time and space
So we have O(n) space, O(n) prep. time and O(1) query time for the LCP problem on a single string
74. LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
1
5
LCA(i, j)
j
4
i
2
1
3
0
2
So we have O(n) space, O(n) prep. time and O(1) query time for the LCP problem on a single string
75. LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
1
5
LCA(i, j)
j
4
i
2
1
3
0
2
So we have O(n) space, O(n) prep. time and O(1) query time for the LCP problem on a single string
We can extend this two strings (P and T) by first concatenating them together. . .
(and proceeding as for a single string)
76. LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
1
5
LCA(i, j)
j
4
i
2
1
3
0
2
So we have O(n) space, O(n) prep. time and O(1) query time for the LCP problem on a single string
We can extend this two strings (P and T) by first concatenating them together. . .
So we also have O(n) space, O(n) prep. time and O(1) query time
(and proceeding as for a single string)
for the LCP problem on two strings
77. LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
1
5
LCA(i, j)
j
4
i
2
1
3
0
2
So we have O(n) space, O(n) prep. time and O(1) query time for the LCP problem on a single string
We can extend this two strings (P and T) by first concatenating them together. . .
So we also have O(n) space, O(n) prep. time and O(1) query time
(and proceeding as for a single string)
for the LCP problem on two strings
I.e. for any i in T and j in P, LCP(i, j) is the largest such that
T[i . . . i + − 1] = P[j . . . j + − 1] (as we originally defined it)
79. k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
and infrequent otherwise
k = 4
(
√
k = 2)
m = 9
80. k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
and infrequent otherwise
k = 4
(
√
k = 2)
81. k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent
and infrequent otherwise
k = 4
(
√
k = 2)
82. k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
and infrequent otherwise
k = 4
(
√
k = 2)
83. k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
and infrequent otherwise
k = 4
(
√
k = 2)
, d is frequent
84. k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
c is infrequent
and infrequent otherwise
k = 4
(
√
k = 2)
, d is frequent
85. k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
c is infrequent
and infrequent otherwise
How many frequent symbols can there be?
k = 4
(
√
k = 2)
, d is frequent
86. k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
c is infrequent
and infrequent otherwise
How many frequent symbols can there be? Lots!
k = 4
(
√
k = 2)
, d is frequent
87. k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
c is infrequent
and infrequent otherwise
How many frequent symbols can there be? Lots! there could be m√
k
frequent symbols
k = 4
(
√
k = 2)
, d is frequent
88. k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
c is infrequent
and infrequent otherwise
How many frequent symbols can there be? Lots! there could be m√
k
frequent symbols
Case 1: There are fewer than 2
√
k frequent symbols in P.
k = 4
(
√
k = 2)
, d is frequent
89. k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
c is infrequent
and infrequent otherwise
Algorithm summary
How many frequent symbols can there be? Lots! there could be m√
k
frequent symbols
Case 1: There are fewer than 2
√
k frequent symbols in P.
k = 4
(
√
k = 2)
, d is frequent
90. k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
c is infrequent
and infrequent otherwise
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent
Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture)
Stage 2: Count all matches involving infrequent symbols (as in last lecture)
How many frequent symbols can there be? Lots! there could be m√
k
frequent symbols
Case 1: There are fewer than 2
√
k frequent symbols in P.
k = 4
(
√
k = 2)
, d is frequent
91. k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
c is infrequent
and infrequent otherwise
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent
Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture)
Stage 2: Count all matches involving infrequent symbols (as in last lecture)
How many frequent symbols can there be? Lots! there could be m√
k
frequent symbols
Case 1: There are fewer than 2
√
k frequent symbols in P.
- O(m log m) time
k = 4
(
√
k = 2)
, d is frequent
92. k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
c is infrequent
and infrequent otherwise
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent
Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture)
Stage 2: Count all matches involving infrequent symbols (as in last lecture)
How many frequent symbols can there be? Lots! there could be m√
k
frequent symbols
Case 1: There are fewer than 2
√
k frequent symbols in P.
- O(m log m) time
- O(n
√
k log m) time
k = 4
(
√
k = 2)
, d is frequent
93. k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
c is infrequent
and infrequent otherwise
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent
Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture)
Stage 2: Count all matches involving infrequent symbols (as in last lecture)
How many frequent symbols can there be? Lots! there could be m√
k
frequent symbols
Case 1: There are fewer than 2
√
k frequent symbols in P.
- O(m log m) time
- O(n
√
k log m) time
- O(n
√
k) time
k = 4
(
√
k = 2)
, d is frequent
94. k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
c is infrequent
and infrequent otherwise
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent
Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture)
Stage 2: Count all matches involving infrequent symbols (as in last lecture)
How many frequent symbols can there be? Lots! there could be m√
k
frequent symbols
Case 1: There are fewer than 2
√
k frequent symbols in P.
- O(m log m) time
- O(n
√
k log m) time
- O(n
√
k) time
- O(n
√
k log m) total time
k = 4
(
√
k = 2)
, d is frequent
95. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a d bc ab b da c bbfe
0 1 2 3 4 5 6 7 8 9 10 11 12 13
96. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a d bc ab b da c bbfe
0 1 2 3 4 5 6 7 8 9 10 11 12 13
a is frequent , b is frequent
e and f are infrequent
, d is frequent, c is frequent
97. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
0 1 2 3 4 5 6 7 8 9 10 11 12 13
a is frequent , b is frequent
e and f are infrequent
, d is frequent, c is frequent
98. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
0 1 2 3 4 5 6 7 8 9 10 11 12 13
a is frequent , b is frequent
e and f are infrequent
, d is frequent, c is frequent
99. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
0 1 2 3 4 5 6 7 8 9 10 11 12 13
a is frequent , b is frequent
e and f are infrequent
, d is frequent, c is frequent
100. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
101. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
102. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4
103. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
104. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X)
105. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X)
because there are 2k interesting positions. . . and fewer than k of them match
106. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X)
because there are 2k interesting positions. . . and fewer than k of them match
Fact There are at most n/
√
k values of i with dk(i) k
107. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X)
because there are 2k interesting positions. . . and fewer than k of them match
Fact There are at most n/
√
k values of i with dk(i) k
this follows from a counting argument
108. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/
√
k values of i with dk(i) k
109. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/
√
k values of i with dk(i) k
For any location i , T[i ] = P[j] for either 0 or
√
k distinct j ∈ J
110. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/
√
k values of i with dk(i) k
For any location i , T[i ] = P[j] for either 0 or
√
k distinct j ∈ J
This implies that i dk(i) i j∈J Eq(T[i ], P[j]) n
√
k
111. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/
√
k values of i with dk(i) k
For any location i , T[i ] = P[j] for either 0 or
√
k distinct j ∈ J
This implies that i dk(i) i j∈J Eq(T[i ], P[j]) n
√
k
Eq = 1 if T[i ] = P[j] (and
0 otherwise)
112. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/
√
k values of i with dk(i) k
For any location i , T[i ] = P[j] for either 0 or
√
k distinct j ∈ J
This implies that i dk(i) i j∈J Eq(T[i ], P[j]) n
√
k
113. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/
√
k values of i with dk(i) k
Assume that more than n/
√
k values of i have dk(i) k
114. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/
√
k values of i with dk(i) k
Assume that more than n/
√
k values of i have dk(i) k
So i dk(i) n√
k
+ 1 · k
115. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/
√
k values of i with dk(i) k
Assume that more than n/
√
k values of i have dk(i) k
So i dk(i) n√
k
+ 1 · k > n
√
k
116. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/
√
k values of i with dk(i) k
Assume that more than n/
√
k values of i have dk(i) k
So i dk(i) n√
k
+ 1 · k > n
√
k
Contradiction!
117. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X)
because there are 2k interesting positions. . . and fewer than k of them match
Fact There are at most n/
√
k values of i with dk(i) k
this follows from a counting argument
118. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4
We can filter the text, using the dk(i) values leaving only n/
√
k alignments to check
every other alignment has more than k mismatches
119. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4
We can filter the text, using the dk(i) values leaving only n/
√
k alignments to check
every other alignment has more than k mismatches
Check each of the remaining alignments using LCP queries in O(k) time
per alignment
120. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4
We can filter the text, using the dk(i) values leaving only n/
√
k alignments to check
every other alignment has more than k mismatches
Check each of the remaining alignments using LCP queries in O(k) time
This takes n/
√
k · O(k) = O(n
√
k) total time.
per alignment
121. Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4
We can filter the text, using the dk(i) values leaving only n/
√
k alignments to check
every other alignment has more than k mismatches
Check each of the remaining alignments using LCP queries in O(k) time
How do we compute all the dk(i) values?
This takes n/
√
k · O(k) = O(n
√
k) total time.
per alignment
122. Computing all the dk(i) values
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4
For each text character T[i ],
For each j ∈ J such that P[j] = T[i ]
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
Let dk(i) = 0 for all i.
increase dk(i − j) by one
123. Computing all the dk(i) values
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4
For each text character T[i ],
For each j ∈ J such that P[j] = T[i ]
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
Let dk(i) = 0 for all i.
increase dk(i − j) by one
T[i ] = P[j] for either 0 or
√
k distinct j ∈ J
(store a list of j values for each symbol)
124. Computing all the dk(i) values
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4
For each text character T[i ],
For each j ∈ J such that P[j] = T[i ]
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
Let dk(i) = 0 for all i.
increase dk(i − j) by one
T[i ] = P[j] for either 0 or
√
k distinct j ∈ J
(store a list of j values for each symbol)
This takes O(n
√
k) total time.
126. Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries - O(n) time
127. Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries - O(n) time
Count the number of frequent symbols in P - O(m log m) time
128. Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries - O(n) time
Count the number of frequent symbols in P - O(m log m) time
Case 1: P has at most 2
√
k frequent symbols
Case 2: P has more than 2
√
k frequent symbols
129. Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries - O(n) time
Count the number of frequent symbols in P - O(m log m) time
Case 1: P has at most 2
√
k frequent symbols
Count matches with frequent symbols using cross-correlations - O(n
√
k log m) time
Case 2: P has more than 2
√
k frequent symbols
130. Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries - O(n) time
Count the number of frequent symbols in P - O(m log m) time
Case 1: P has at most 2
√
k frequent symbols
Count matches with frequent symbols using cross-correlations - O(n
√
k log m) time
Count matches with infrequent symbols directly - O(n
√
k) time
Case 2: P has more than 2
√
k frequent symbols
131. Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries - O(n) time
Count the number of frequent symbols in P - O(m log m) time
Case 1: P has at most 2
√
k frequent symbols
Count matches with frequent symbols using cross-correlations - O(n
√
k log m) time
Count matches with infrequent symbols directly - O(n
√
k) time
Case 2: P has more than 2
√
k frequent symbols
Filter the text, leaving n/
√
k alignments - O(n
√
k) time
132. Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries - O(n) time
Count the number of frequent symbols in P - O(m log m) time
Case 1: P has at most 2
√
k frequent symbols
Count matches with frequent symbols using cross-correlations - O(n
√
k log m) time
Count matches with infrequent symbols directly - O(n
√
k) time
Case 2: P has more than 2
√
k frequent symbols
Filter the text, leaving n/
√
k alignments - O(n
√
k) time
Count mismatches at these alignments using LCP queries - O(n
√
k) time
133. Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries - O(n) time
Overall, we obtain a time complexity of O(n
√
k log m).
Count the number of frequent symbols in P - O(m log m) time
Case 1: P has at most 2
√
k frequent symbols
Count matches with frequent symbols using cross-correlations - O(n
√
k log m) time
Count matches with infrequent symbols directly - O(n
√
k) time
Case 2: P has more than 2
√
k frequent symbols
Filter the text, leaving n/
√
k alignments - O(n
√
k) time
Count mismatches at these alignments using LCP queries - O(n
√
k) time
134. Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries - O(n) time
Overall, we obtain a time complexity of O(n
√
k log m).
Count the number of frequent symbols in P - O(m log m) time
Case 1: P has at most 2
√
k frequent symbols
Count matches with frequent symbols using cross-correlations - O(n
√
k log m) time
Count matches with infrequent symbols directly - O(n
√
k) time
Case 2: P has more than 2
√
k frequent symbols
Filter the text, leaving n/
√
k alignments - O(n
√
k) time
Count mismatches at these alignments using LCP queries - O(n
√
k) time
- this can be improved to O(n
√
k log k)
135. Conclusion
T
Input A text string T (length n), a pattern string P (length m) and a positive integer k
P
ba b c a a d ad a
Goal: For all i, output,
a a a
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(6) = 1
Output the number of mismatches. . . unless it’s more than k
(we interpret the output X to mean “too many mismatches”)
Hamk(i) =
Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2
One algorithm takes O(nk) time
The other algorithm takes O(n
√
k log m) time (improvable to O(n
√
k log k) time)
We saw two algorithms for this problem: