SlideShare ist ein Scribd-Unternehmen logo
1 von 135
Downloaden Sie, um offline zu lesen
Advanced Algorithms – COMS31900
Pattern matching part four
Pattern matching with at most k mismatches
Benjamin Sach
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c
a b d
a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
m
i.e. the number of distinct j such that P[j] = T[i + j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c
a b d
a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
m
i.e. the number of distinct j such that P[j] = T[i + j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a
Ham(4) = 1
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P[j] = T[i + j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(5) = 4
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P[j] = T[i + j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(6) = 1
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P[j] = T[i + j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(7) = 3
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P[j] = T[i + j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(7) = 3
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
this is alignment 7
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P[j] = T[i + j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(8) = 3
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
this is alignment 8
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P[j] = T[i + j]
Last lecture we saw two algorithms for this problem:
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(8) = 3
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
this is alignment 8
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P[j] = T[i + j]
Last lecture we saw two algorithms for this problem:
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(8) = 3
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
this is alignment 8
One algorithm takes O(n|Σ| log m) time (where |Σ| is the alphabet size)
The other algorithm takes O(n
√
m log m) time (regardless of the alphabet size)
Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (length m) and a positive integer k
P
ba b c
a b d
a a d ad a
Goal: For all i, output,
a a a
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a
Output the number of mismatches. . . unless it’s more than k
(we interpret the output X to mean “too many mismatches”)
Hamk(i) =



Ham(i) if Ham(i) k
X if Ham(i) > k
Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (length m) and a positive integer k
P
ba b c
a b d
a a d ad a
Goal: For all i, output,
a a a
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a
Output the number of mismatches. . . unless it’s more than k
(we interpret the output X to mean “too many mismatches”)
Hamk(i) =



Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2
Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (length m) and a positive integer k
P
ba b c
a b d
a a d ad a
Goal: For all i, output,
a a a
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a
Hamk(4) = 1
Output the number of mismatches. . . unless it’s more than k
(we interpret the output X to mean “too many mismatches”)
Hamk(i) =



Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2
Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (length m) and a positive integer k
P
ba b c a a d ad a
Goal: For all i, output,
a a a
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(5) = X
Output the number of mismatches. . . unless it’s more than k
(we interpret the output X to mean “too many mismatches”)
Hamk(i) =



Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2
Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (length m) and a positive integer k
P
ba b c a a d ad a
Goal: For all i, output,
a a a
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(6) = 1
Output the number of mismatches. . . unless it’s more than k
(we interpret the output X to mean “too many mismatches”)
Hamk(i) =



Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2
Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (length m) and a positive integer k
P
ba b c a a d ad a
Goal: For all i, output,
a a a
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(7) = 2
Output the number of mismatches. . . unless it’s more than k
(we interpret the output X to mean “too many mismatches”)
Hamk(i) =



Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2
Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (length m) and a positive integer k
P
ba b c a a d ad a
Goal: For all i, output,
a a a
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(8) = X
Output the number of mismatches. . . unless it’s more than k
(we interpret the output X to mean “too many mismatches”)
Hamk(i) =



Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2
Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (length m) and a positive integer k
P
ba b c a a d ad a
Goal: For all i, output,
a a a
• We could use the O(n
√
m log m) time algorithm for Hamming distance. . .
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(8) = X
Output the number of mismatches. . . unless it’s more than k
(we interpret the output X to mean “too many mismatches”)
Hamk(i) =



Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2
Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (length m) and a positive integer k
P
ba b c a a d ad a
Goal: For all i, output,
a a a
• We could use the O(n
√
m log m) time algorithm for Hamming distance. . .
but when k is small we can do much better
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(8) = X
Output the number of mismatches. . . unless it’s more than k
(we interpret the output X to mean “too many mismatches”)
Hamk(i) =



Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2
LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P, LCP(i, j) is the largest such that
T[i . . . i + − 1] = P[j . . . j + − 1]
it’s the furthest you can go before hitting a mismatch
a b c b a bc bP
nn nm
LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P, LCP(i, j) is the largest such that
T[i . . . i + − 1] = P[j . . . j + − 1]
it’s the furthest you can go before hitting a mismatch
a b c b a bc bP
nn nm
LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P, LCP(i, j) is the largest such that
T[i . . . i + − 1] = P[j . . . j + − 1]
it’s the furthest you can go before hitting a mismatch
i j
a b c b a bc bP
nn nm
LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P, LCP(i, j) is the largest such that
T[i . . . i + − 1] = P[j . . . j + − 1]
it’s the furthest you can go before hitting a mismatch
i j
a b c b a bc bP
nn nm
LCP(i, j) returns 3
LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P, LCP(i, j) is the largest such that
T[i . . . i + − 1] = P[j . . . j + − 1]
it’s the furthest you can go before hitting a mismatch
i j
a b c b a bc bP
nn nm
LCP(i, j) returns 4
LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P, LCP(i, j) is the largest such that
T[i . . . i + − 1] = P[j . . . j + − 1]
it’s the furthest you can go before hitting a mismatch
a b c b a bc bP
nn nm
i j
LCP(i, j) returns 0
LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P, LCP(i, j) is the largest such that
T[i . . . i + − 1] = P[j . . . j + − 1]
it’s the furthest you can go before hitting a mismatch
i j
a b c b a bc bP
nn nm
LCP(i, j) returns 4
LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P, LCP(i, j) is the largest such that
T[i . . . i + − 1] = P[j . . . j + − 1]
it’s the furthest you can go before hitting a mismatch
i j
a
We can preprocess P and T for LCP queries in O(n) time and O(n) space
b c b a bc bP
nn nm
LCP(i, j) returns 4
LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P, LCP(i, j) is the largest such that
T[i . . . i + − 1] = P[j . . . j + − 1]
it’s the furthest you can go before hitting a mismatch
i j
a
We can preprocess P and T for LCP queries in O(n) time and O(n) space
Each query then takes O(1) time
b c b a bc bP
nn nm
LCP(i, j) returns 4
LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P, LCP(i, j) is the largest such that
T[i . . . i + − 1] = P[j . . . j + − 1]
it’s the furthest you can go before hitting a mismatch
i j
a
We can preprocess P and T for LCP queries in O(n) time and O(n) space
Each query then takes O(1) time
we’ll see how to do this later in this lecture
b c b a bc bP
nn nm
LCP(i, j) returns 4
LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P, LCP(i, j) is the largest such that
T[i . . . i + − 1] = P[j . . . j + − 1]
it’s the furthest you can go before hitting a mismatch
i j
a
We can preprocess P and T for LCP queries in O(n) time and O(n) space
Each query then takes O(1) time
we’ll see how to do this later in this lecture
we can use LCP queries to solve the k-mismatch problem. . .
b c b a bc bP
nn nm
First let’s see how
LCP(i, j) returns 4
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
j
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
j
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
perform LCP(i, j)
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
j
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
perform LCP(i, j)
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
j
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
perform LCP(i, j)
a a b a which returns 3
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
j
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
perform LCP(i, j)
a a b a which returns 3
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
j
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
perform LCP(i, j)
a a b a which returns 3
mismatch!
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
j
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
j
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
a a b a
skip over this!
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
i
j
a a b a
skip over this!
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
i
j
perform LCP(i , j)
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
i
j
perform LCP(i , j)
a a b a which returns 0
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
i
j
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
i
j
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
perform LCP(i , j)i
j
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
perform LCP(i , j)i
j
a a b a which returns 2
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
i
j
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
perform LCP(i , j)
a a b a
i
j
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
perform LCP(i , j)
a a b a
i
j
which returns 3
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
(or we reach the end of P)
perform LCP(i , j)
a a b a
i
j
which returns 3
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
(or we reach the end of P)
a a b a
i
j
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
(or we reach the end of P)
We can therefore calculate Hamk(i) for a single alignment i in O(k) time
a a b a
i
j
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
(or we reach the end of P)
We can therefore calculate Hamk(i) for a single alignment i in O(k) time
Overall this takes O(nk) time (including the ‘preprocessing’ for LCP queries)
a a b a
i
j
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
(or we reach the end of P)
We can therefore calculate Hamk(i) for a single alignment i in O(k) time
Overall this takes O(nk) time (including the ‘preprocessing’ for LCP queries)
this is pretty good but we can do better
a a b a
i
j
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takes O(1) time and finds a new mismatch
(or we reach the end of P)
We can therefore calculate Hamk(i) for a single alignment i in O(k) time
Overall this takes O(nk) time (including the ‘preprocessing’ for LCP queries)
this is pretty good but we can do better but first. . . how do we answer those LCP queries?
a a b a
i
j
LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
in O(n) prep. time and space
LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
in O(n) prep. time and space
LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
i j
3
in O(n) prep. time and space
LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
i j
LCA(i, j)
3
in O(n) prep. time and space
LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
i j
LCA(i, j)
3
in O(n) prep. time and space
LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
i j
LCA(i, j)
3
it’s the node representing the longest common prefix of T[i . . . n − 1] and T[j . . . n − 1]
in O(n) prep. time and space
LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
i j
LCA(i, j)
3
it’s the node representing the longest common prefix of T[i . . . n − 1] and T[j . . . n − 1]
in O(n) prep. time and space
For any pair of locations i, j in T, LCPT (i, j) is the largest such that
T[i . . . i + − 1] = T[j . . . j + − 1]
Single string LCP:
LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
it’s the node representing the longest common prefix of T[i . . . n − 1] and T[j . . . n − 1]
LCA(i, j)
j
1
5
in O(n) prep. time and space
i
For any pair of locations i, j in T, LCPT (i, j) is the largest such that
T[i . . . i + − 1] = T[j . . . j + − 1]
Single string LCP:
LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
it’s the node representing the longest common prefix of T[i . . . n − 1] and T[j . . . n − 1]
LCA(i, j)
j
4
i
2
in O(n) prep. time and space
For any pair of locations i, j in T, LCPT (i, j) is the largest such that
T[i . . . i + − 1] = T[j . . . j + − 1]
Single string LCP:
LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
it’s the node representing the longest common prefix of T[i . . . n − 1] and T[j . . . n − 1]
LCA(i, j)
j
4
i
2
in O(n) prep. time and space
LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
it’s the node representing the longest common prefix of T[i . . . n − 1] and T[j . . . n − 1]
LCA(i, j)
j
4
i
2
We store the root-to-node length at each internal node
so we can recover the length, LCPT (i, j) in O(1) time
in O(n) prep. time and space
LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
it’s the node representing the longest common prefix of T[i . . . n − 1] and T[j . . . n − 1]
LCA(i, j)
j
4
i
2
We store the root-to-node length at each internal node
1
3
0
2
so we can recover the length, LCPT (i, j) in O(1) time
in O(n) prep. time and space
LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
it’s the node representing the longest common prefix of T[i . . . n − 1] and T[j . . . n − 1]
LCA(i, j)
j
4
i
2
We store the root-to-node length at each internal node
1
3
0
2
so we can recover the length, LCPT (i, j) in O(1) time
in O(n) prep. time and space
So we have O(n) space, O(n) prep. time and O(1) query time for the LCP problem on a single string
LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
1
5
LCA(i, j)
j
4
i
2
1
3
0
2
So we have O(n) space, O(n) prep. time and O(1) query time for the LCP problem on a single string
LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
1
5
LCA(i, j)
j
4
i
2
1
3
0
2
So we have O(n) space, O(n) prep. time and O(1) query time for the LCP problem on a single string
We can extend this two strings (P and T) by first concatenating them together. . .
(and proceeding as for a single string)
LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
1
5
LCA(i, j)
j
4
i
2
1
3
0
2
So we have O(n) space, O(n) prep. time and O(1) query time for the LCP problem on a single string
We can extend this two strings (P and T) by first concatenating them together. . .
So we also have O(n) space, O(n) prep. time and O(1) query time
(and proceeding as for a single string)
for the LCP problem on two strings
LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
1
5
LCA(i, j)
j
4
i
2
1
3
0
2
So we have O(n) space, O(n) prep. time and O(1) query time for the LCP problem on a single string
We can extend this two strings (P and T) by first concatenating them together. . .
So we also have O(n) space, O(n) prep. time and O(1) query time
(and proceeding as for a single string)
for the LCP problem on two strings
I.e. for any i in T and j in P, LCP(i, j) is the largest such that
T[i . . . i + − 1] = P[j . . . j + − 1] (as we originally defined it)
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
and infrequent otherwise
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
and infrequent otherwise
k = 4
(
√
k = 2)
m = 9
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
and infrequent otherwise
k = 4
(
√
k = 2)
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent
and infrequent otherwise
k = 4
(
√
k = 2)
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
and infrequent otherwise
k = 4
(
√
k = 2)
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
and infrequent otherwise
k = 4
(
√
k = 2)
, d is frequent
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
c is infrequent
and infrequent otherwise
k = 4
(
√
k = 2)
, d is frequent
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
c is infrequent
and infrequent otherwise
How many frequent symbols can there be?
k = 4
(
√
k = 2)
, d is frequent
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
c is infrequent
and infrequent otherwise
How many frequent symbols can there be? Lots!
k = 4
(
√
k = 2)
, d is frequent
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
c is infrequent
and infrequent otherwise
How many frequent symbols can there be? Lots! there could be m√
k
frequent symbols
k = 4
(
√
k = 2)
, d is frequent
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
c is infrequent
and infrequent otherwise
How many frequent symbols can there be? Lots! there could be m√
k
frequent symbols
Case 1: There are fewer than 2
√
k frequent symbols in P.
k = 4
(
√
k = 2)
, d is frequent
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
c is infrequent
and infrequent otherwise
Algorithm summary
How many frequent symbols can there be? Lots! there could be m√
k
frequent symbols
Case 1: There are fewer than 2
√
k frequent symbols in P.
k = 4
(
√
k = 2)
, d is frequent
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
c is infrequent
and infrequent otherwise
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent
Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture)
Stage 2: Count all matches involving infrequent symbols (as in last lecture)
How many frequent symbols can there be? Lots! there could be m√
k
frequent symbols
Case 1: There are fewer than 2
√
k frequent symbols in P.
k = 4
(
√
k = 2)
, d is frequent
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
c is infrequent
and infrequent otherwise
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent
Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture)
Stage 2: Count all matches involving infrequent symbols (as in last lecture)
How many frequent symbols can there be? Lots! there could be m√
k
frequent symbols
Case 1: There are fewer than 2
√
k frequent symbols in P.
- O(m log m) time
k = 4
(
√
k = 2)
, d is frequent
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
c is infrequent
and infrequent otherwise
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent
Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture)
Stage 2: Count all matches involving infrequent symbols (as in last lecture)
How many frequent symbols can there be? Lots! there could be m√
k
frequent symbols
Case 1: There are fewer than 2
√
k frequent symbols in P.
- O(m log m) time
- O(n
√
k log m) time
k = 4
(
√
k = 2)
, d is frequent
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
c is infrequent
and infrequent otherwise
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent
Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture)
Stage 2: Count all matches involving infrequent symbols (as in last lecture)
How many frequent symbols can there be? Lots! there could be m√
k
frequent symbols
Case 1: There are fewer than 2
√
k frequent symbols in P.
- O(m log m) time
- O(n
√
k log m) time
- O(n
√
k) time
k = 4
(
√
k = 2)
, d is frequent
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least
√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
c is infrequent
and infrequent otherwise
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent
Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture)
Stage 2: Count all matches involving infrequent symbols (as in last lecture)
How many frequent symbols can there be? Lots! there could be m√
k
frequent symbols
Case 1: There are fewer than 2
√
k frequent symbols in P.
- O(m log m) time
- O(n
√
k log m) time
- O(n
√
k) time
- O(n
√
k log m) total time
k = 4
(
√
k = 2)
, d is frequent
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a d bc ab b da c bbfe
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a d bc ab b da c bbfe
0 1 2 3 4 5 6 7 8 9 10 11 12 13
a is frequent , b is frequent
e and f are infrequent
, d is frequent, c is frequent
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
0 1 2 3 4 5 6 7 8 9 10 11 12 13
a is frequent , b is frequent
e and f are infrequent
, d is frequent, c is frequent
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
0 1 2 3 4 5 6 7 8 9 10 11 12 13
a is frequent , b is frequent
e and f are infrequent
, d is frequent, c is frequent
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
0 1 2 3 4 5 6 7 8 9 10 11 12 13
a is frequent , b is frequent
e and f are infrequent
, d is frequent, c is frequent
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X)
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X)
because there are 2k interesting positions. . . and fewer than k of them match
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X)
because there are 2k interesting positions. . . and fewer than k of them match
Fact There are at most n/
√
k values of i with dk(i) k
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X)
because there are 2k interesting positions. . . and fewer than k of them match
Fact There are at most n/
√
k values of i with dk(i) k
this follows from a counting argument
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/
√
k values of i with dk(i) k
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/
√
k values of i with dk(i) k
For any location i , T[i ] = P[j] for either 0 or
√
k distinct j ∈ J
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/
√
k values of i with dk(i) k
For any location i , T[i ] = P[j] for either 0 or
√
k distinct j ∈ J
This implies that i dk(i) i j∈J Eq(T[i ], P[j]) n
√
k
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/
√
k values of i with dk(i) k
For any location i , T[i ] = P[j] for either 0 or
√
k distinct j ∈ J
This implies that i dk(i) i j∈J Eq(T[i ], P[j]) n
√
k
Eq = 1 if T[i ] = P[j] (and
0 otherwise)
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/
√
k values of i with dk(i) k
For any location i , T[i ] = P[j] for either 0 or
√
k distinct j ∈ J
This implies that i dk(i) i j∈J Eq(T[i ], P[j]) n
√
k
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/
√
k values of i with dk(i) k
Assume that more than n/
√
k values of i have dk(i) k
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/
√
k values of i with dk(i) k
Assume that more than n/
√
k values of i have dk(i) k
So i dk(i) n√
k
+ 1 · k
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/
√
k values of i with dk(i) k
Assume that more than n/
√
k values of i have dk(i) k
So i dk(i) n√
k
+ 1 · k > n
√
k
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/
√
k values of i with dk(i) k
Assume that more than n/
√
k values of i have dk(i) k
So i dk(i) n√
k
+ 1 · k > n
√
k
Contradiction!
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X)
because there are 2k interesting positions. . . and fewer than k of them match
Fact There are at most n/
√
k values of i with dk(i) k
this follows from a counting argument
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4
We can filter the text, using the dk(i) values leaving only n/
√
k alignments to check
every other alignment has more than k mismatches
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4
We can filter the text, using the dk(i) values leaving only n/
√
k alignments to check
every other alignment has more than k mismatches
Check each of the remaining alignments using LCP queries in O(k) time
per alignment
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4
We can filter the text, using the dk(i) values leaving only n/
√
k alignments to check
every other alignment has more than k mismatches
Check each of the remaining alignments using LCP queries in O(k) time
This takes n/
√
k · O(k) = O(n
√
k) total time.
per alignment
Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4
We can filter the text, using the dk(i) values leaving only n/
√
k alignments to check
every other alignment has more than k mismatches
Check each of the remaining alignments using LCP queries in O(k) time
How do we compute all the dk(i) values?
This takes n/
√
k · O(k) = O(n
√
k) total time.
per alignment
Computing all the dk(i) values
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4
For each text character T[i ],
For each j ∈ J such that P[j] = T[i ]
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
Let dk(i) = 0 for all i.
increase dk(i − j) by one
Computing all the dk(i) values
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4
For each text character T[i ],
For each j ∈ J such that P[j] = T[i ]
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
Let dk(i) = 0 for all i.
increase dk(i − j) by one
T[i ] = P[j] for either 0 or
√
k distinct j ∈ J
(store a list of j values for each symbol)
Computing all the dk(i) values
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4
For each text character T[i ],
For each j ∈ J such that P[j] = T[i ]
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations
Let dk(i) = 0 for all i.
increase dk(i − j) by one
T[i ] = P[j] for either 0 or
√
k distinct j ∈ J
(store a list of j values for each symbol)
This takes O(n
√
k) total time.
Pattern matching with k-mismatches: putting it all together
Algorithm summary
Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries - O(n) time
Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries - O(n) time
Count the number of frequent symbols in P - O(m log m) time
Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries - O(n) time
Count the number of frequent symbols in P - O(m log m) time
Case 1: P has at most 2
√
k frequent symbols
Case 2: P has more than 2
√
k frequent symbols
Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries - O(n) time
Count the number of frequent symbols in P - O(m log m) time
Case 1: P has at most 2
√
k frequent symbols
Count matches with frequent symbols using cross-correlations - O(n
√
k log m) time
Case 2: P has more than 2
√
k frequent symbols
Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries - O(n) time
Count the number of frequent symbols in P - O(m log m) time
Case 1: P has at most 2
√
k frequent symbols
Count matches with frequent symbols using cross-correlations - O(n
√
k log m) time
Count matches with infrequent symbols directly - O(n
√
k) time
Case 2: P has more than 2
√
k frequent symbols
Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries - O(n) time
Count the number of frequent symbols in P - O(m log m) time
Case 1: P has at most 2
√
k frequent symbols
Count matches with frequent symbols using cross-correlations - O(n
√
k log m) time
Count matches with infrequent symbols directly - O(n
√
k) time
Case 2: P has more than 2
√
k frequent symbols
Filter the text, leaving n/
√
k alignments - O(n
√
k) time
Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries - O(n) time
Count the number of frequent symbols in P - O(m log m) time
Case 1: P has at most 2
√
k frequent symbols
Count matches with frequent symbols using cross-correlations - O(n
√
k log m) time
Count matches with infrequent symbols directly - O(n
√
k) time
Case 2: P has more than 2
√
k frequent symbols
Filter the text, leaving n/
√
k alignments - O(n
√
k) time
Count mismatches at these alignments using LCP queries - O(n
√
k) time
Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries - O(n) time
Overall, we obtain a time complexity of O(n
√
k log m).
Count the number of frequent symbols in P - O(m log m) time
Case 1: P has at most 2
√
k frequent symbols
Count matches with frequent symbols using cross-correlations - O(n
√
k log m) time
Count matches with infrequent symbols directly - O(n
√
k) time
Case 2: P has more than 2
√
k frequent symbols
Filter the text, leaving n/
√
k alignments - O(n
√
k) time
Count mismatches at these alignments using LCP queries - O(n
√
k) time
Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries - O(n) time
Overall, we obtain a time complexity of O(n
√
k log m).
Count the number of frequent symbols in P - O(m log m) time
Case 1: P has at most 2
√
k frequent symbols
Count matches with frequent symbols using cross-correlations - O(n
√
k log m) time
Count matches with infrequent symbols directly - O(n
√
k) time
Case 2: P has more than 2
√
k frequent symbols
Filter the text, leaving n/
√
k alignments - O(n
√
k) time
Count mismatches at these alignments using LCP queries - O(n
√
k) time
- this can be improved to O(n
√
k log k)
Conclusion
T
Input A text string T (length n), a pattern string P (length m) and a positive integer k
P
ba b c a a d ad a
Goal: For all i, output,
a a a
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(6) = 1
Output the number of mismatches. . . unless it’s more than k
(we interpret the output X to mean “too many mismatches”)
Hamk(i) =



Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2
One algorithm takes O(nk) time
The other algorithm takes O(n
√
k log m) time (improvable to O(n
√
k log k) time)
We saw two algorithms for this problem:

Weitere ähnliche Inhalte

Was ist angesagt?

String matching Algorithm by Foysal
String matching Algorithm by FoysalString matching Algorithm by Foysal
String matching Algorithm by FoysalFoysal Mahmud
 
Inversion Theorem for Generalized Fractional Hilbert Transform
Inversion Theorem for Generalized Fractional Hilbert TransformInversion Theorem for Generalized Fractional Hilbert Transform
Inversion Theorem for Generalized Fractional Hilbert Transforminventionjournals
 
Special Elements of a Ternary Semiring
Special Elements of a Ternary SemiringSpecial Elements of a Ternary Semiring
Special Elements of a Ternary SemiringIJERA Editor
 
3.1 intro toautomatatheory h1
3.1 intro toautomatatheory  h13.1 intro toautomatatheory  h1
3.1 intro toautomatatheory h1Rajendran
 
Fourier series and it's examples
Fourier series and it's examplesFourier series and it's examples
Fourier series and it's examplesZorawarJaat
 
History and Real Life Applications of Fourier Analaysis
History and Real Life Applications of Fourier AnalaysisHistory and Real Life Applications of Fourier Analaysis
History and Real Life Applications of Fourier AnalaysisSyed Ahmed Zaki
 
theory of computation lecture 02
theory of computation lecture 02theory of computation lecture 02
theory of computation lecture 028threspecter
 
lecture 17
lecture 17lecture 17
lecture 17sajinsc
 
Fourier series and fourier integral
Fourier series and fourier integralFourier series and fourier integral
Fourier series and fourier integralashuuhsaqwe
 

Was ist angesagt? (20)

Huffman analysis
Huffman analysisHuffman analysis
Huffman analysis
 
String matching Algorithm by Foysal
String matching Algorithm by FoysalString matching Algorithm by Foysal
String matching Algorithm by Foysal
 
Software Construction Assignment Help
Software Construction Assignment HelpSoftware Construction Assignment Help
Software Construction Assignment Help
 
smtlecture.4
smtlecture.4smtlecture.4
smtlecture.4
 
Inversion Theorem for Generalized Fractional Hilbert Transform
Inversion Theorem for Generalized Fractional Hilbert TransformInversion Theorem for Generalized Fractional Hilbert Transform
Inversion Theorem for Generalized Fractional Hilbert Transform
 
Special Elements of a Ternary Semiring
Special Elements of a Ternary SemiringSpecial Elements of a Ternary Semiring
Special Elements of a Ternary Semiring
 
Dynamic programing
Dynamic programingDynamic programing
Dynamic programing
 
3.1 intro toautomatatheory h1
3.1 intro toautomatatheory  h13.1 intro toautomatatheory  h1
3.1 intro toautomatatheory h1
 
Trie Data Structure
Trie Data Structure Trie Data Structure
Trie Data Structure
 
Fourier series and it's examples
Fourier series and it's examplesFourier series and it's examples
Fourier series and it's examples
 
Computer Science Assignment Help
Computer Science Assignment HelpComputer Science Assignment Help
Computer Science Assignment Help
 
Theory of computation Lec1
Theory of computation Lec1Theory of computation Lec1
Theory of computation Lec1
 
History and Real Life Applications of Fourier Analaysis
History and Real Life Applications of Fourier AnalaysisHistory and Real Life Applications of Fourier Analaysis
History and Real Life Applications of Fourier Analaysis
 
Discrete Fourier Series | Discrete Fourier Transform | Discrete Time Fourier ...
Discrete Fourier Series | Discrete Fourier Transform | Discrete Time Fourier ...Discrete Fourier Series | Discrete Fourier Transform | Discrete Time Fourier ...
Discrete Fourier Series | Discrete Fourier Transform | Discrete Time Fourier ...
 
Computer Science Assignment Help
Computer Science Assignment Help Computer Science Assignment Help
Computer Science Assignment Help
 
theory of computation lecture 02
theory of computation lecture 02theory of computation lecture 02
theory of computation lecture 02
 
Aho corasick-lecture
Aho corasick-lectureAho corasick-lecture
Aho corasick-lecture
 
Tries data structures
Tries data structuresTries data structures
Tries data structures
 
lecture 17
lecture 17lecture 17
lecture 17
 
Fourier series and fourier integral
Fourier series and fourier integralFourier series and fourier integral
Fourier series and fourier integral
 

Andere mochten auch

The Effect of Vancomycin Doses Greater Than 2 Grams on Serum
The Effect of Vancomycin Doses Greater Than 2 Grams on SerumThe Effect of Vancomycin Doses Greater Than 2 Grams on Serum
The Effect of Vancomycin Doses Greater Than 2 Grams on SerumJordan Mustonen
 
Palmer stephanie poetry
Palmer stephanie  poetryPalmer stephanie  poetry
Palmer stephanie poetrysvpalmer
 
Niveles de Social Media Command Center
Niveles de Social Media Command CenterNiveles de Social Media Command Center
Niveles de Social Media Command CenterSolvis Consulting, LLC
 
Computational Poetry 電腦賦詩
Computational Poetry 電腦賦詩Computational Poetry 電腦賦詩
Computational Poetry 電腦賦詩Mark Chang
 

Andere mochten auch (8)

Campus party
Campus partyCampus party
Campus party
 
researchpaper
researchpaperresearchpaper
researchpaper
 
Steven E Smith CV
Steven E Smith CVSteven E Smith CV
Steven E Smith CV
 
The Effect of Vancomycin Doses Greater Than 2 Grams on Serum
The Effect of Vancomycin Doses Greater Than 2 Grams on SerumThe Effect of Vancomycin Doses Greater Than 2 Grams on Serum
The Effect of Vancomycin Doses Greater Than 2 Grams on Serum
 
Palmer stephanie poetry
Palmer stephanie  poetryPalmer stephanie  poetry
Palmer stephanie poetry
 
Niveles de Social Media Command Center
Niveles de Social Media Command CenterNiveles de Social Media Command Center
Niveles de Social Media Command Center
 
Computational Poetry 電腦賦詩
Computational Poetry 電腦賦詩Computational Poetry 電腦賦詩
Computational Poetry 電腦賦詩
 
Etica valoresdemocracia 7
Etica valoresdemocracia 7Etica valoresdemocracia 7
Etica valoresdemocracia 7
 

Ähnlich wie Pattern Matching Part Two: k-mismatches

Pattern Matching Part Three: Hamming Distance
Pattern Matching Part Three: Hamming DistancePattern Matching Part Three: Hamming Distance
Pattern Matching Part Three: Hamming DistanceBenjamin Sach
 
Design and Analysis of Algorithms Assignment Help
Design and Analysis of Algorithms Assignment HelpDesign and Analysis of Algorithms Assignment Help
Design and Analysis of Algorithms Assignment HelpProgramming Homework Help
 
Huffman coding
Huffman codingHuffman coding
Huffman codingGeorge Ang
 
Asymptotic notation
Asymptotic notationAsymptotic notation
Asymptotic notationsajinis3
 
Efficient Random-Walk Methods forApproximating Polytope Volume
Efficient Random-Walk Methods forApproximating Polytope VolumeEfficient Random-Walk Methods forApproximating Polytope Volume
Efficient Random-Walk Methods forApproximating Polytope VolumeVissarion Fisikopoulos
 
009 Data Handling cs class 12 cbse mount litera
009 Data Handling cs class 12 cbse mount litera009 Data Handling cs class 12 cbse mount litera
009 Data Handling cs class 12 cbse mount literaRithinA1
 
14th Athens Colloquium on Algorithms and Complexity (ACAC19)
14th Athens Colloquium on Algorithms and Complexity (ACAC19)14th Athens Colloquium on Algorithms and Complexity (ACAC19)
14th Athens Colloquium on Algorithms and Complexity (ACAC19)Apostolos Chalkis
 
Linear block coding
Linear block codingLinear block coding
Linear block codingjknm
 
Chapter 6 intermediate code generation
Chapter 6   intermediate code generationChapter 6   intermediate code generation
Chapter 6 intermediate code generationVipul Naik
 

Ähnlich wie Pattern Matching Part Two: k-mismatches (20)

Pattern Matching Part Three: Hamming Distance
Pattern Matching Part Three: Hamming DistancePattern Matching Part Three: Hamming Distance
Pattern Matching Part Three: Hamming Distance
 
lec17.ppt
lec17.pptlec17.ppt
lec17.ppt
 
Lec17
Lec17Lec17
Lec17
 
Ch2
Ch2Ch2
Ch2
 
Programming Exam Help
Programming Exam Help Programming Exam Help
Programming Exam Help
 
Asymptotic Notation
Asymptotic NotationAsymptotic Notation
Asymptotic Notation
 
Design and Analysis of Algorithms Assignment Help
Design and Analysis of Algorithms Assignment HelpDesign and Analysis of Algorithms Assignment Help
Design and Analysis of Algorithms Assignment Help
 
Huffman coding
Huffman codingHuffman coding
Huffman coding
 
Asymptotic notation
Asymptotic notationAsymptotic notation
Asymptotic notation
 
25 String Matching
25 String Matching25 String Matching
25 String Matching
 
smtlecture.5
smtlecture.5smtlecture.5
smtlecture.5
 
Huffman Coding
Huffman CodingHuffman Coding
Huffman Coding
 
Ch8a
Ch8aCh8a
Ch8a
 
Efficient Random-Walk Methods forApproximating Polytope Volume
Efficient Random-Walk Methods forApproximating Polytope VolumeEfficient Random-Walk Methods forApproximating Polytope Volume
Efficient Random-Walk Methods forApproximating Polytope Volume
 
009 Data Handling cs class 12 cbse mount litera
009 Data Handling cs class 12 cbse mount litera009 Data Handling cs class 12 cbse mount litera
009 Data Handling cs class 12 cbse mount litera
 
14th Athens Colloquium on Algorithms and Complexity (ACAC19)
14th Athens Colloquium on Algorithms and Complexity (ACAC19)14th Athens Colloquium on Algorithms and Complexity (ACAC19)
14th Athens Colloquium on Algorithms and Complexity (ACAC19)
 
Linear block coding
Linear block codingLinear block coding
Linear block coding
 
Unit 3
Unit 3Unit 3
Unit 3
 
Unit 3
Unit 3Unit 3
Unit 3
 
Chapter 6 intermediate code generation
Chapter 6   intermediate code generationChapter 6   intermediate code generation
Chapter 6 intermediate code generation
 

Mehr von Benjamin Sach

Approximation Algorithms Part Four: APTAS
Approximation Algorithms Part Four: APTASApproximation Algorithms Part Four: APTAS
Approximation Algorithms Part Four: APTASBenjamin Sach
 
Approximation Algorithms Part Three: (F)PTAS
Approximation Algorithms Part Three: (F)PTASApproximation Algorithms Part Three: (F)PTAS
Approximation Algorithms Part Three: (F)PTASBenjamin Sach
 
Approximation Algorithms Part Two: More Constant factor approximations
Approximation Algorithms Part Two: More Constant factor approximationsApproximation Algorithms Part Two: More Constant factor approximations
Approximation Algorithms Part Two: More Constant factor approximationsBenjamin Sach
 
Approximation Algorithms Part One: Constant factor approximations
Approximation Algorithms Part One: Constant factor approximationsApproximation Algorithms Part One: Constant factor approximations
Approximation Algorithms Part One: Constant factor approximationsBenjamin Sach
 
Orthogonal Range Searching
Orthogonal Range SearchingOrthogonal Range Searching
Orthogonal Range SearchingBenjamin Sach
 
Lowest Common Ancestor
Lowest Common AncestorLowest Common Ancestor
Lowest Common AncestorBenjamin Sach
 
Range Minimum Queries
Range Minimum QueriesRange Minimum Queries
Range Minimum QueriesBenjamin Sach
 
Hashing Part Two: Cuckoo Hashing
Hashing Part Two: Cuckoo HashingHashing Part Two: Cuckoo Hashing
Hashing Part Two: Cuckoo HashingBenjamin Sach
 
Hashing Part Two: Static Perfect Hashing
Hashing Part Two: Static Perfect HashingHashing Part Two: Static Perfect Hashing
Hashing Part Two: Static Perfect HashingBenjamin Sach
 
Minimum Spanning Trees (via Disjoint Sets)
Minimum Spanning Trees (via Disjoint Sets)Minimum Spanning Trees (via Disjoint Sets)
Minimum Spanning Trees (via Disjoint Sets)Benjamin Sach
 
Shortest Paths Part 1: Priority Queues and Dijkstra's Algorithm
Shortest Paths Part 1: Priority Queues and Dijkstra's AlgorithmShortest Paths Part 1: Priority Queues and Dijkstra's Algorithm
Shortest Paths Part 1: Priority Queues and Dijkstra's AlgorithmBenjamin Sach
 
Depth First Search and Breadth First Search
Depth First Search and Breadth First SearchDepth First Search and Breadth First Search
Depth First Search and Breadth First SearchBenjamin Sach
 
Shortest Paths Part 2: Negative Weights and All-pairs
Shortest Paths Part 2: Negative Weights and All-pairsShortest Paths Part 2: Negative Weights and All-pairs
Shortest Paths Part 2: Negative Weights and All-pairsBenjamin Sach
 
Line Segment Intersections
Line Segment IntersectionsLine Segment Intersections
Line Segment IntersectionsBenjamin Sach
 
Self-balancing Trees and Skip Lists
Self-balancing Trees and Skip ListsSelf-balancing Trees and Skip Lists
Self-balancing Trees and Skip ListsBenjamin Sach
 

Mehr von Benjamin Sach (20)

Approximation Algorithms Part Four: APTAS
Approximation Algorithms Part Four: APTASApproximation Algorithms Part Four: APTAS
Approximation Algorithms Part Four: APTAS
 
Approximation Algorithms Part Three: (F)PTAS
Approximation Algorithms Part Three: (F)PTASApproximation Algorithms Part Three: (F)PTAS
Approximation Algorithms Part Three: (F)PTAS
 
Approximation Algorithms Part Two: More Constant factor approximations
Approximation Algorithms Part Two: More Constant factor approximationsApproximation Algorithms Part Two: More Constant factor approximations
Approximation Algorithms Part Two: More Constant factor approximations
 
Approximation Algorithms Part One: Constant factor approximations
Approximation Algorithms Part One: Constant factor approximationsApproximation Algorithms Part One: Constant factor approximations
Approximation Algorithms Part One: Constant factor approximations
 
van Emde Boas trees
van Emde Boas treesvan Emde Boas trees
van Emde Boas trees
 
Orthogonal Range Searching
Orthogonal Range SearchingOrthogonal Range Searching
Orthogonal Range Searching
 
Lowest Common Ancestor
Lowest Common AncestorLowest Common Ancestor
Lowest Common Ancestor
 
Range Minimum Queries
Range Minimum QueriesRange Minimum Queries
Range Minimum Queries
 
Hashing Part Two: Cuckoo Hashing
Hashing Part Two: Cuckoo HashingHashing Part Two: Cuckoo Hashing
Hashing Part Two: Cuckoo Hashing
 
Hashing Part Two: Static Perfect Hashing
Hashing Part Two: Static Perfect HashingHashing Part Two: Static Perfect Hashing
Hashing Part Two: Static Perfect Hashing
 
Hashing Part One
Hashing Part OneHashing Part One
Hashing Part One
 
Probability Recap
Probability RecapProbability Recap
Probability Recap
 
Bloom Filters
Bloom FiltersBloom Filters
Bloom Filters
 
Dynamic Programming
Dynamic ProgrammingDynamic Programming
Dynamic Programming
 
Minimum Spanning Trees (via Disjoint Sets)
Minimum Spanning Trees (via Disjoint Sets)Minimum Spanning Trees (via Disjoint Sets)
Minimum Spanning Trees (via Disjoint Sets)
 
Shortest Paths Part 1: Priority Queues and Dijkstra's Algorithm
Shortest Paths Part 1: Priority Queues and Dijkstra's AlgorithmShortest Paths Part 1: Priority Queues and Dijkstra's Algorithm
Shortest Paths Part 1: Priority Queues and Dijkstra's Algorithm
 
Depth First Search and Breadth First Search
Depth First Search and Breadth First SearchDepth First Search and Breadth First Search
Depth First Search and Breadth First Search
 
Shortest Paths Part 2: Negative Weights and All-pairs
Shortest Paths Part 2: Negative Weights and All-pairsShortest Paths Part 2: Negative Weights and All-pairs
Shortest Paths Part 2: Negative Weights and All-pairs
 
Line Segment Intersections
Line Segment IntersectionsLine Segment Intersections
Line Segment Intersections
 
Self-balancing Trees and Skip Lists
Self-balancing Trees and Skip ListsSelf-balancing Trees and Skip Lists
Self-balancing Trees and Skip Lists
 

Kürzlich hochgeladen

Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxJisc
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxCeline George
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 

Kürzlich hochgeladen (20)

Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 

Pattern Matching Part Two: k-mismatches

  • 1. Advanced Algorithms – COMS31900 Pattern matching part four Pattern matching with at most k mismatches Benjamin Sach
  • 2. Pattern matching with mismatches T Input: A text string T (length n) and a pattern string P (length m) P ba b c a b d a a d ad a Goal: For every alignment i, output The Hamming distance is the number of mismatches. . . c a a m i.e. the number of distinct j such that P[j] = T[i + j] 0 1 2 3 4 5 6 7 8 9 10 11 12 n a Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
  • 3. Pattern matching with mismatches T Input: A text string T (length n) and a pattern string P (length m) P ba b c a b d a a d ad a Goal: For every alignment i, output The Hamming distance is the number of mismatches. . . c a a m i.e. the number of distinct j such that P[j] = T[i + j] 0 1 2 3 4 5 6 7 8 9 10 11 12 n a Ham(4) = 1 Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
  • 4. Pattern matching with mismatches T Input: A text string T (length n) and a pattern string P (length m) P ba b c a a d ad a Goal: For every alignment i, output The Hamming distance is the number of mismatches. . . c a a i.e. the number of distinct j such that P[j] = T[i + j] 0 1 2 3 4 5 6 7 8 9 10 11 12 n a b d m a Ham(5) = 4 Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
  • 5. Pattern matching with mismatches T Input: A text string T (length n) and a pattern string P (length m) P ba b c a a d ad a Goal: For every alignment i, output The Hamming distance is the number of mismatches. . . c a a i.e. the number of distinct j such that P[j] = T[i + j] 0 1 2 3 4 5 6 7 8 9 10 11 12 n a b d m a Ham(6) = 1 Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
  • 6. Pattern matching with mismatches T Input: A text string T (length n) and a pattern string P (length m) P ba b c a a d ad a Goal: For every alignment i, output The Hamming distance is the number of mismatches. . . c a a i.e. the number of distinct j such that P[j] = T[i + j] 0 1 2 3 4 5 6 7 8 9 10 11 12 n a b d m a Ham(7) = 3 Ham(i), the Hamming distance between P and T[i . . . i + m − 1]
  • 7. Pattern matching with mismatches T Input: A text string T (length n) and a pattern string P (length m) P ba b c a a d ad a Goal: For every alignment i, output The Hamming distance is the number of mismatches. . . c a a i.e. the number of distinct j such that P[j] = T[i + j] 0 1 2 3 4 5 6 7 8 9 10 11 12 n a b d m a Ham(7) = 3 Ham(i), the Hamming distance between P and T[i . . . i + m − 1] this is alignment 7
  • 8. Pattern matching with mismatches T Input: A text string T (length n) and a pattern string P (length m) P ba b c a a d ad a Goal: For every alignment i, output The Hamming distance is the number of mismatches. . . c a a i.e. the number of distinct j such that P[j] = T[i + j] 0 1 2 3 4 5 6 7 8 9 10 11 12 n a b d m a Ham(8) = 3 Ham(i), the Hamming distance between P and T[i . . . i + m − 1] this is alignment 8
  • 9. Pattern matching with mismatches T Input: A text string T (length n) and a pattern string P (length m) P ba b c a a d ad a Goal: For every alignment i, output The Hamming distance is the number of mismatches. . . c a a i.e. the number of distinct j such that P[j] = T[i + j] Last lecture we saw two algorithms for this problem: 0 1 2 3 4 5 6 7 8 9 10 11 12 n a b d m a Ham(8) = 3 Ham(i), the Hamming distance between P and T[i . . . i + m − 1] this is alignment 8
  • 10. Pattern matching with mismatches T Input: A text string T (length n) and a pattern string P (length m) P ba b c a a d ad a Goal: For every alignment i, output The Hamming distance is the number of mismatches. . . c a a i.e. the number of distinct j such that P[j] = T[i + j] Last lecture we saw two algorithms for this problem: 0 1 2 3 4 5 6 7 8 9 10 11 12 n a b d m a Ham(8) = 3 Ham(i), the Hamming distance between P and T[i . . . i + m − 1] this is alignment 8 One algorithm takes O(n|Σ| log m) time (where |Σ| is the alphabet size) The other algorithm takes O(n √ m log m) time (regardless of the alphabet size)
  • 11. Pattern matching with few mismatches (k-mismatch) T Input A text string T (length n), a pattern string P (length m) and a positive integer k P ba b c a b d a a d ad a Goal: For all i, output, a a a m 0 1 2 3 4 5 6 7 8 9 10 11 12 n a Output the number of mismatches. . . unless it’s more than k (we interpret the output X to mean “too many mismatches”) Hamk(i) =    Ham(i) if Ham(i) k X if Ham(i) > k
  • 12. Pattern matching with few mismatches (k-mismatch) T Input A text string T (length n), a pattern string P (length m) and a positive integer k P ba b c a b d a a d ad a Goal: For all i, output, a a a m 0 1 2 3 4 5 6 7 8 9 10 11 12 n a Output the number of mismatches. . . unless it’s more than k (we interpret the output X to mean “too many mismatches”) Hamk(i) =    Ham(i) if Ham(i) k X if Ham(i) > k k = 2
  • 13. Pattern matching with few mismatches (k-mismatch) T Input A text string T (length n), a pattern string P (length m) and a positive integer k P ba b c a b d a a d ad a Goal: For all i, output, a a a m 0 1 2 3 4 5 6 7 8 9 10 11 12 n a Hamk(4) = 1 Output the number of mismatches. . . unless it’s more than k (we interpret the output X to mean “too many mismatches”) Hamk(i) =    Ham(i) if Ham(i) k X if Ham(i) > k k = 2
  • 14. Pattern matching with few mismatches (k-mismatch) T Input A text string T (length n), a pattern string P (length m) and a positive integer k P ba b c a a d ad a Goal: For all i, output, a a a 0 1 2 3 4 5 6 7 8 9 10 11 12 n a b d m a Hamk(5) = X Output the number of mismatches. . . unless it’s more than k (we interpret the output X to mean “too many mismatches”) Hamk(i) =    Ham(i) if Ham(i) k X if Ham(i) > k k = 2
  • 15. Pattern matching with few mismatches (k-mismatch) T Input A text string T (length n), a pattern string P (length m) and a positive integer k P ba b c a a d ad a Goal: For all i, output, a a a 0 1 2 3 4 5 6 7 8 9 10 11 12 n a b d m a Hamk(6) = 1 Output the number of mismatches. . . unless it’s more than k (we interpret the output X to mean “too many mismatches”) Hamk(i) =    Ham(i) if Ham(i) k X if Ham(i) > k k = 2
  • 16. Pattern matching with few mismatches (k-mismatch) T Input A text string T (length n), a pattern string P (length m) and a positive integer k P ba b c a a d ad a Goal: For all i, output, a a a 0 1 2 3 4 5 6 7 8 9 10 11 12 n a b d m a Hamk(7) = 2 Output the number of mismatches. . . unless it’s more than k (we interpret the output X to mean “too many mismatches”) Hamk(i) =    Ham(i) if Ham(i) k X if Ham(i) > k k = 2
  • 17. Pattern matching with few mismatches (k-mismatch) T Input A text string T (length n), a pattern string P (length m) and a positive integer k P ba b c a a d ad a Goal: For all i, output, a a a 0 1 2 3 4 5 6 7 8 9 10 11 12 n a b d m a Hamk(8) = X Output the number of mismatches. . . unless it’s more than k (we interpret the output X to mean “too many mismatches”) Hamk(i) =    Ham(i) if Ham(i) k X if Ham(i) > k k = 2
  • 18. Pattern matching with few mismatches (k-mismatch) T Input A text string T (length n), a pattern string P (length m) and a positive integer k P ba b c a a d ad a Goal: For all i, output, a a a • We could use the O(n √ m log m) time algorithm for Hamming distance. . . 0 1 2 3 4 5 6 7 8 9 10 11 12 n a b d m a Hamk(8) = X Output the number of mismatches. . . unless it’s more than k (we interpret the output X to mean “too many mismatches”) Hamk(i) =    Ham(i) if Ham(i) k X if Ham(i) > k k = 2
  • 19. Pattern matching with few mismatches (k-mismatch) T Input A text string T (length n), a pattern string P (length m) and a positive integer k P ba b c a a d ad a Goal: For all i, output, a a a • We could use the O(n √ m log m) time algorithm for Hamming distance. . . but when k is small we can do much better 0 1 2 3 4 5 6 7 8 9 10 11 12 n a b d m a Hamk(8) = X Output the number of mismatches. . . unless it’s more than k (we interpret the output X to mean “too many mismatches”) Hamk(i) =    Ham(i) if Ham(i) k X if Ham(i) > k k = 2
  • 20. LCP - the Longest Common Prefix T ba b c a b aa b cb a b For any pair of locations i in T and j in P, LCP(i, j) is the largest such that T[i . . . i + − 1] = P[j . . . j + − 1] it’s the furthest you can go before hitting a mismatch a b c b a bc bP nn nm
  • 21. LCP - the Longest Common Prefix T ba b c a b aa b cb a b For any pair of locations i in T and j in P, LCP(i, j) is the largest such that T[i . . . i + − 1] = P[j . . . j + − 1] it’s the furthest you can go before hitting a mismatch a b c b a bc bP nn nm
  • 22. LCP - the Longest Common Prefix T ba b c a b aa b cb a b For any pair of locations i in T and j in P, LCP(i, j) is the largest such that T[i . . . i + − 1] = P[j . . . j + − 1] it’s the furthest you can go before hitting a mismatch i j a b c b a bc bP nn nm
  • 23. LCP - the Longest Common Prefix T ba b c a b aa b cb a b For any pair of locations i in T and j in P, LCP(i, j) is the largest such that T[i . . . i + − 1] = P[j . . . j + − 1] it’s the furthest you can go before hitting a mismatch i j a b c b a bc bP nn nm LCP(i, j) returns 3
  • 24. LCP - the Longest Common Prefix T ba b c a b aa b cb a b For any pair of locations i in T and j in P, LCP(i, j) is the largest such that T[i . . . i + − 1] = P[j . . . j + − 1] it’s the furthest you can go before hitting a mismatch i j a b c b a bc bP nn nm LCP(i, j) returns 4
  • 25. LCP - the Longest Common Prefix T ba b c a b aa b cb a b For any pair of locations i in T and j in P, LCP(i, j) is the largest such that T[i . . . i + − 1] = P[j . . . j + − 1] it’s the furthest you can go before hitting a mismatch a b c b a bc bP nn nm i j LCP(i, j) returns 0
  • 26. LCP - the Longest Common Prefix T ba b c a b aa b cb a b For any pair of locations i in T and j in P, LCP(i, j) is the largest such that T[i . . . i + − 1] = P[j . . . j + − 1] it’s the furthest you can go before hitting a mismatch i j a b c b a bc bP nn nm LCP(i, j) returns 4
  • 27. LCP - the Longest Common Prefix T ba b c a b aa b cb a b For any pair of locations i in T and j in P, LCP(i, j) is the largest such that T[i . . . i + − 1] = P[j . . . j + − 1] it’s the furthest you can go before hitting a mismatch i j a We can preprocess P and T for LCP queries in O(n) time and O(n) space b c b a bc bP nn nm LCP(i, j) returns 4
  • 28. LCP - the Longest Common Prefix T ba b c a b aa b cb a b For any pair of locations i in T and j in P, LCP(i, j) is the largest such that T[i . . . i + − 1] = P[j . . . j + − 1] it’s the furthest you can go before hitting a mismatch i j a We can preprocess P and T for LCP queries in O(n) time and O(n) space Each query then takes O(1) time b c b a bc bP nn nm LCP(i, j) returns 4
  • 29. LCP - the Longest Common Prefix T ba b c a b aa b cb a b For any pair of locations i in T and j in P, LCP(i, j) is the largest such that T[i . . . i + − 1] = P[j . . . j + − 1] it’s the furthest you can go before hitting a mismatch i j a We can preprocess P and T for LCP queries in O(n) time and O(n) space Each query then takes O(1) time we’ll see how to do this later in this lecture b c b a bc bP nn nm LCP(i, j) returns 4
  • 30. LCP - the Longest Common Prefix T ba b c a b aa b cb a b For any pair of locations i in T and j in P, LCP(i, j) is the largest such that T[i . . . i + − 1] = P[j . . . j + − 1] it’s the furthest you can go before hitting a mismatch i j a We can preprocess P and T for LCP queries in O(n) time and O(n) space Each query then takes O(1) time we’ll see how to do this later in this lecture we can use LCP queries to solve the k-mismatch problem. . . b c b a bc bP nn nm First let’s see how LCP(i, j) returns 4
  • 31. k-mismatch using LCP queries a b c aT a b c a b aa b cb b a b c b a bP nn a a b a
  • 32. k-mismatch using LCP queries a b c aT a b c a b aa b cb b a b c b a bP nn a a b a
  • 33. k-mismatch using LCP queries a b c aT a b c a b aa b cb b a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) a a b a
  • 34. k-mismatch using LCP queries a b c aT a b c a b aa b cb b a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries a a b a
  • 35. k-mismatch using LCP queries a b c aT a b c a b aa b cb b a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries each query takes O(1) time and finds a new mismatch a a b a
  • 36. k-mismatch using LCP queries a b c aT a b c a b aa b cb b i a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries each query takes O(1) time and finds a new mismatch a a b a
  • 37. k-mismatch using LCP queries a b c aT a b c a b aa b cb b i j a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries each query takes O(1) time and finds a new mismatch a a b a
  • 38. k-mismatch using LCP queries a b c aT a b c a b aa b cb b i j a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries each query takes O(1) time and finds a new mismatch perform LCP(i, j) a a b a
  • 39. k-mismatch using LCP queries a b c aT a b c a b aa b cb b i j a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries each query takes O(1) time and finds a new mismatch perform LCP(i, j) a a b a
  • 40. k-mismatch using LCP queries a b c aT a b c a b aa b cb b i j a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries each query takes O(1) time and finds a new mismatch perform LCP(i, j) a a b a which returns 3
  • 41. k-mismatch using LCP queries a b c aT a b c a b aa b cb b i j a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries each query takes O(1) time and finds a new mismatch perform LCP(i, j) a a b a which returns 3
  • 42. k-mismatch using LCP queries a b c aT a b c a b aa b cb b i j a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries each query takes O(1) time and finds a new mismatch perform LCP(i, j) a a b a which returns 3 mismatch!
  • 43. k-mismatch using LCP queries a b c aT a b c a b aa b cb b i j a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries each query takes O(1) time and finds a new mismatch a a b a
  • 44. k-mismatch using LCP queries a b c aT a b c a b aa b cb b i j a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries each query takes O(1) time and finds a new mismatch a a b a skip over this!
  • 45. k-mismatch using LCP queries a b c aT a b c a b aa b cb b a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries each query takes O(1) time and finds a new mismatch i j a a b a skip over this!
  • 46. k-mismatch using LCP queries a b c aT a b c a b aa b cb b a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries each query takes O(1) time and finds a new mismatch i j perform LCP(i , j) a a b a
  • 47. k-mismatch using LCP queries a b c aT a b c a b aa b cb b a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries each query takes O(1) time and finds a new mismatch i j perform LCP(i , j) a a b a which returns 0
  • 48. k-mismatch using LCP queries a b c aT a b c a b aa b cb b a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries each query takes O(1) time and finds a new mismatch i j a a b a
  • 49. k-mismatch using LCP queries a b c aT a b c a b aa b cb b a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries each query takes O(1) time and finds a new mismatch i j a a b a
  • 50. k-mismatch using LCP queries a b c aT a b c a b aa b cb b a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries each query takes O(1) time and finds a new mismatch perform LCP(i , j)i j a a b a
  • 51. k-mismatch using LCP queries a b c aT a b c a b aa b cb b a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries each query takes O(1) time and finds a new mismatch perform LCP(i , j)i j a a b a which returns 2
  • 52. k-mismatch using LCP queries a b c aT a b c a b aa b cb b a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries each query takes O(1) time and finds a new mismatch i j a a b a
  • 53. k-mismatch using LCP queries a b c aT a b c a b aa b cb b a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries each query takes O(1) time and finds a new mismatch perform LCP(i , j) a a b a i j
  • 54. k-mismatch using LCP queries a b c aT a b c a b aa b cb b a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries each query takes O(1) time and finds a new mismatch perform LCP(i , j) a a b a i j which returns 3
  • 55. k-mismatch using LCP queries a b c aT a b c a b aa b cb b a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries each query takes O(1) time and finds a new mismatch (or we reach the end of P) perform LCP(i , j) a a b a i j which returns 3
  • 56. k-mismatch using LCP queries a b c aT a b c a b aa b cb b a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries each query takes O(1) time and finds a new mismatch (or we reach the end of P) a a b a i j
  • 57. k-mismatch using LCP queries a b c aT a b c a b aa b cb b a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries each query takes O(1) time and finds a new mismatch (or we reach the end of P) We can therefore calculate Hamk(i) for a single alignment i in O(k) time a a b a i j
  • 58. k-mismatch using LCP queries a b c aT a b c a b aa b cb b a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries each query takes O(1) time and finds a new mismatch (or we reach the end of P) We can therefore calculate Hamk(i) for a single alignment i in O(k) time Overall this takes O(nk) time (including the ‘preprocessing’ for LCP queries) a a b a i j
  • 59. k-mismatch using LCP queries a b c aT a b c a b aa b cb b a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries each query takes O(1) time and finds a new mismatch (or we reach the end of P) We can therefore calculate Hamk(i) for a single alignment i in O(k) time Overall this takes O(nk) time (including the ‘preprocessing’ for LCP queries) this is pretty good but we can do better a a b a i j
  • 60. k-mismatch using LCP queries a b c aT a b c a b aa b cb b a b c b a bP nn Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P (we do this for each i seperately) We can do this using (at most) k + 1 LCP queries each query takes O(1) time and finds a new mismatch (or we reach the end of P) We can therefore calculate Hamk(i) for a single alignment i in O(k) time Overall this takes O(nk) time (including the ‘preprocessing’ for LCP queries) this is pretty good but we can do better but first. . . how do we answer those LCP queries? a a b a i j
  • 61. LCPs in Suffix Trees TT b n aaa sn n $ a s$ nas$ nas$ s$ na s$ bananas$ 7$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 3 0 2 4 6 nas$ This is the suffix tree of this text 2 Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries 1 5 in O(n) prep. time and space
  • 62. LCPs in Suffix Trees TT b n aaa sn n $ a s$ nas$ nas$ s$ na s$ bananas$ 7$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 3 0 2 4 6 nas$ This is the suffix tree of this text 2 What is the LCA of the leaves representing suffixes i and j? Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries 1 5 in O(n) prep. time and space
  • 63. LCPs in Suffix Trees TT b n aaa sn n $ a s$ nas$ nas$ s$ na s$ bananas$ 7$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 3 0 2 4 6 nas$ This is the suffix tree of this text 2 What is the LCA of the leaves representing suffixes i and j? Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries 1 5 i j 3 in O(n) prep. time and space
  • 64. LCPs in Suffix Trees TT b n aaa sn n $ a s$ nas$ nas$ s$ na s$ bananas$ 7$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 3 0 2 4 6 nas$ This is the suffix tree of this text 2 What is the LCA of the leaves representing suffixes i and j? Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries 1 5 i j LCA(i, j) 3 in O(n) prep. time and space
  • 65. LCPs in Suffix Trees TT b n aaa sn n $ a s$ nas$ nas$ s$ na s$ bananas$ 7$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 3 0 2 4 6 nas$ This is the suffix tree of this text 2 What is the LCA of the leaves representing suffixes i and j? Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries 1 5 i j LCA(i, j) 3 in O(n) prep. time and space
  • 66. LCPs in Suffix Trees TT b n aaa sn n $ a s$ nas$ nas$ s$ na s$ bananas$ 7$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 3 0 2 4 6 nas$ This is the suffix tree of this text 2 What is the LCA of the leaves representing suffixes i and j? Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries 1 5 i j LCA(i, j) 3 it’s the node representing the longest common prefix of T[i . . . n − 1] and T[j . . . n − 1] in O(n) prep. time and space
  • 67. LCPs in Suffix Trees TT b n aaa sn n $ a s$ nas$ nas$ s$ na s$ bananas$ 7$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 3 0 2 4 6 nas$ This is the suffix tree of this text 2 What is the LCA of the leaves representing suffixes i and j? Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries 1 5 i j LCA(i, j) 3 it’s the node representing the longest common prefix of T[i . . . n − 1] and T[j . . . n − 1] in O(n) prep. time and space For any pair of locations i, j in T, LCPT (i, j) is the largest such that T[i . . . i + − 1] = T[j . . . j + − 1] Single string LCP:
  • 68. LCPs in Suffix Trees TT b n aaa sn n $ a s$ nas$ nas$ s$ na s$ bananas$ 7$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 3 0 2 4 6 nas$ This is the suffix tree of this text 2 What is the LCA of the leaves representing suffixes i and j? Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries 1 5 it’s the node representing the longest common prefix of T[i . . . n − 1] and T[j . . . n − 1] LCA(i, j) j 1 5 in O(n) prep. time and space i For any pair of locations i, j in T, LCPT (i, j) is the largest such that T[i . . . i + − 1] = T[j . . . j + − 1] Single string LCP:
  • 69. LCPs in Suffix Trees TT b n aaa sn n $ a s$ nas$ nas$ s$ na s$ bananas$ 7$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 3 0 2 4 6 nas$ This is the suffix tree of this text 2 What is the LCA of the leaves representing suffixes i and j? Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries 1 5 it’s the node representing the longest common prefix of T[i . . . n − 1] and T[j . . . n − 1] LCA(i, j) j 4 i 2 in O(n) prep. time and space For any pair of locations i, j in T, LCPT (i, j) is the largest such that T[i . . . i + − 1] = T[j . . . j + − 1] Single string LCP:
  • 70. LCPs in Suffix Trees TT b n aaa sn n $ a s$ nas$ nas$ s$ na s$ bananas$ 7$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 3 0 2 4 6 nas$ This is the suffix tree of this text 2 What is the LCA of the leaves representing suffixes i and j? Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries 1 5 it’s the node representing the longest common prefix of T[i . . . n − 1] and T[j . . . n − 1] LCA(i, j) j 4 i 2 in O(n) prep. time and space
  • 71. LCPs in Suffix Trees TT b n aaa sn n $ a s$ nas$ nas$ s$ na s$ bananas$ 7$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 3 0 2 4 6 nas$ This is the suffix tree of this text 2 What is the LCA of the leaves representing suffixes i and j? Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries 1 5 it’s the node representing the longest common prefix of T[i . . . n − 1] and T[j . . . n − 1] LCA(i, j) j 4 i 2 We store the root-to-node length at each internal node so we can recover the length, LCPT (i, j) in O(1) time in O(n) prep. time and space
  • 72. LCPs in Suffix Trees TT b n aaa sn n $ a s$ nas$ nas$ s$ na s$ bananas$ 7$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 3 0 2 4 6 nas$ This is the suffix tree of this text 2 What is the LCA of the leaves representing suffixes i and j? Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries 1 5 it’s the node representing the longest common prefix of T[i . . . n − 1] and T[j . . . n − 1] LCA(i, j) j 4 i 2 We store the root-to-node length at each internal node 1 3 0 2 so we can recover the length, LCPT (i, j) in O(1) time in O(n) prep. time and space
  • 73. LCPs in Suffix Trees TT b n aaa sn n $ a s$ nas$ nas$ s$ na s$ bananas$ 7$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 3 0 2 4 6 nas$ This is the suffix tree of this text 2 What is the LCA of the leaves representing suffixes i and j? Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries 1 5 it’s the node representing the longest common prefix of T[i . . . n − 1] and T[j . . . n − 1] LCA(i, j) j 4 i 2 We store the root-to-node length at each internal node 1 3 0 2 so we can recover the length, LCPT (i, j) in O(1) time in O(n) prep. time and space So we have O(n) space, O(n) prep. time and O(1) query time for the LCP problem on a single string
  • 74. LCPs in Suffix Trees TT b n aaa sn n $ a s$ nas$ nas$ s$ na s$ bananas$ 7$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 3 0 2 4 6 nas$ This is the suffix tree of this text 2 1 5 LCA(i, j) j 4 i 2 1 3 0 2 So we have O(n) space, O(n) prep. time and O(1) query time for the LCP problem on a single string
  • 75. LCPs in Suffix Trees TT b n aaa sn n $ a s$ nas$ nas$ s$ na s$ bananas$ 7$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 3 0 2 4 6 nas$ This is the suffix tree of this text 2 1 5 LCA(i, j) j 4 i 2 1 3 0 2 So we have O(n) space, O(n) prep. time and O(1) query time for the LCP problem on a single string We can extend this two strings (P and T) by first concatenating them together. . . (and proceeding as for a single string)
  • 76. LCPs in Suffix Trees TT b n aaa sn n $ a s$ nas$ nas$ s$ na s$ bananas$ 7$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 3 0 2 4 6 nas$ This is the suffix tree of this text 2 1 5 LCA(i, j) j 4 i 2 1 3 0 2 So we have O(n) space, O(n) prep. time and O(1) query time for the LCP problem on a single string We can extend this two strings (P and T) by first concatenating them together. . . So we also have O(n) space, O(n) prep. time and O(1) query time (and proceeding as for a single string) for the LCP problem on two strings
  • 77. LCPs in Suffix Trees TT b n aaa sn n $ a s$ nas$ nas$ s$ na s$ bananas$ 7$ b n aaa sn n aaa sn n aa sn aa sn a sn a s s suffixes $ $ $ $ $ $ $ 0 1 2 3 4 5 6 $ 7 3 0 2 4 6 nas$ This is the suffix tree of this text 2 1 5 LCA(i, j) j 4 i 2 1 3 0 2 So we have O(n) space, O(n) prep. time and O(1) query time for the LCP problem on a single string We can extend this two strings (P and T) by first concatenating them together. . . So we also have O(n) space, O(n) prep. time and O(1) query time (and proceeding as for a single string) for the LCP problem on two strings I.e. for any i in T and j in P, LCP(i, j) is the largest such that T[i . . . i + − 1] = P[j . . . j + − 1] (as we originally defined it)
  • 78. k-mismatch using frequent/infrequent symbols Definition: A symbol is frequent if it occurs at least √ k times in P, and infrequent otherwise
  • 79. k-mismatch using frequent/infrequent symbols Definition: A symbol is frequent if it occurs at least √ k times in P, P a d bc ab b da 0 1 2 3 4 5 6 7 8 and infrequent otherwise k = 4 ( √ k = 2) m = 9
  • 80. k-mismatch using frequent/infrequent symbols Definition: A symbol is frequent if it occurs at least √ k times in P, P a d bc ab b da 0 1 2 3 4 5 6 7 8 and infrequent otherwise k = 4 ( √ k = 2)
  • 81. k-mismatch using frequent/infrequent symbols Definition: A symbol is frequent if it occurs at least √ k times in P, P a d bc ab b da 0 1 2 3 4 5 6 7 8 a is frequent and infrequent otherwise k = 4 ( √ k = 2)
  • 82. k-mismatch using frequent/infrequent symbols Definition: A symbol is frequent if it occurs at least √ k times in P, P a d bc ab b da 0 1 2 3 4 5 6 7 8 a is frequent , b is frequent and infrequent otherwise k = 4 ( √ k = 2)
  • 83. k-mismatch using frequent/infrequent symbols Definition: A symbol is frequent if it occurs at least √ k times in P, P a d bc ab b da 0 1 2 3 4 5 6 7 8 a is frequent , b is frequent and infrequent otherwise k = 4 ( √ k = 2) , d is frequent
  • 84. k-mismatch using frequent/infrequent symbols Definition: A symbol is frequent if it occurs at least √ k times in P, P a d bc ab b da 0 1 2 3 4 5 6 7 8 a is frequent , b is frequent c is infrequent and infrequent otherwise k = 4 ( √ k = 2) , d is frequent
  • 85. k-mismatch using frequent/infrequent symbols Definition: A symbol is frequent if it occurs at least √ k times in P, P a d bc ab b da 0 1 2 3 4 5 6 7 8 a is frequent , b is frequent c is infrequent and infrequent otherwise How many frequent symbols can there be? k = 4 ( √ k = 2) , d is frequent
  • 86. k-mismatch using frequent/infrequent symbols Definition: A symbol is frequent if it occurs at least √ k times in P, P a d bc ab b da 0 1 2 3 4 5 6 7 8 a is frequent , b is frequent c is infrequent and infrequent otherwise How many frequent symbols can there be? Lots! k = 4 ( √ k = 2) , d is frequent
  • 87. k-mismatch using frequent/infrequent symbols Definition: A symbol is frequent if it occurs at least √ k times in P, P a d bc ab b da 0 1 2 3 4 5 6 7 8 a is frequent , b is frequent c is infrequent and infrequent otherwise How many frequent symbols can there be? Lots! there could be m√ k frequent symbols k = 4 ( √ k = 2) , d is frequent
  • 88. k-mismatch using frequent/infrequent symbols Definition: A symbol is frequent if it occurs at least √ k times in P, P a d bc ab b da 0 1 2 3 4 5 6 7 8 a is frequent , b is frequent c is infrequent and infrequent otherwise How many frequent symbols can there be? Lots! there could be m√ k frequent symbols Case 1: There are fewer than 2 √ k frequent symbols in P. k = 4 ( √ k = 2) , d is frequent
  • 89. k-mismatch using frequent/infrequent symbols Definition: A symbol is frequent if it occurs at least √ k times in P, P a d bc ab b da 0 1 2 3 4 5 6 7 8 a is frequent , b is frequent c is infrequent and infrequent otherwise Algorithm summary How many frequent symbols can there be? Lots! there could be m√ k frequent symbols Case 1: There are fewer than 2 √ k frequent symbols in P. k = 4 ( √ k = 2) , d is frequent
  • 90. k-mismatch using frequent/infrequent symbols Definition: A symbol is frequent if it occurs at least √ k times in P, P a d bc ab b da 0 1 2 3 4 5 6 7 8 a is frequent , b is frequent c is infrequent and infrequent otherwise Algorithm summary Stage 0: Classify each symbol as frequent or infrequent Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture) Stage 2: Count all matches involving infrequent symbols (as in last lecture) How many frequent symbols can there be? Lots! there could be m√ k frequent symbols Case 1: There are fewer than 2 √ k frequent symbols in P. k = 4 ( √ k = 2) , d is frequent
  • 91. k-mismatch using frequent/infrequent symbols Definition: A symbol is frequent if it occurs at least √ k times in P, P a d bc ab b da 0 1 2 3 4 5 6 7 8 a is frequent , b is frequent c is infrequent and infrequent otherwise Algorithm summary Stage 0: Classify each symbol as frequent or infrequent Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture) Stage 2: Count all matches involving infrequent symbols (as in last lecture) How many frequent symbols can there be? Lots! there could be m√ k frequent symbols Case 1: There are fewer than 2 √ k frequent symbols in P. - O(m log m) time k = 4 ( √ k = 2) , d is frequent
  • 92. k-mismatch using frequent/infrequent symbols Definition: A symbol is frequent if it occurs at least √ k times in P, P a d bc ab b da 0 1 2 3 4 5 6 7 8 a is frequent , b is frequent c is infrequent and infrequent otherwise Algorithm summary Stage 0: Classify each symbol as frequent or infrequent Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture) Stage 2: Count all matches involving infrequent symbols (as in last lecture) How many frequent symbols can there be? Lots! there could be m√ k frequent symbols Case 1: There are fewer than 2 √ k frequent symbols in P. - O(m log m) time - O(n √ k log m) time k = 4 ( √ k = 2) , d is frequent
  • 93. k-mismatch using frequent/infrequent symbols Definition: A symbol is frequent if it occurs at least √ k times in P, P a d bc ab b da 0 1 2 3 4 5 6 7 8 a is frequent , b is frequent c is infrequent and infrequent otherwise Algorithm summary Stage 0: Classify each symbol as frequent or infrequent Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture) Stage 2: Count all matches involving infrequent symbols (as in last lecture) How many frequent symbols can there be? Lots! there could be m√ k frequent symbols Case 1: There are fewer than 2 √ k frequent symbols in P. - O(m log m) time - O(n √ k log m) time - O(n √ k) time k = 4 ( √ k = 2) , d is frequent
  • 94. k-mismatch using frequent/infrequent symbols Definition: A symbol is frequent if it occurs at least √ k times in P, P a d bc ab b da 0 1 2 3 4 5 6 7 8 a is frequent , b is frequent c is infrequent and infrequent otherwise Algorithm summary Stage 0: Classify each symbol as frequent or infrequent Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture) Stage 2: Count all matches involving infrequent symbols (as in last lecture) How many frequent symbols can there be? Lots! there could be m√ k frequent symbols Case 1: There are fewer than 2 √ k frequent symbols in P. - O(m log m) time - O(n √ k log m) time - O(n √ k) time - O(n √ k log m) total time k = 4 ( √ k = 2) , d is frequent
  • 95. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a d bc ab b da c bbfe 0 1 2 3 4 5 6 7 8 9 10 11 12 13
  • 96. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a d bc ab b da c bbfe 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a is frequent , b is frequent e and f are infrequent , d is frequent, c is frequent
  • 97. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a is frequent , b is frequent e and f are infrequent , d is frequent, c is frequent
  • 98. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a is frequent , b is frequent e and f are infrequent , d is frequent, c is frequent
  • 99. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe J = {0, 2, 3, 4, 5, 7, 9, 10} 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a is frequent , b is frequent e and f are infrequent , d is frequent, c is frequent
  • 100. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe J = {0, 2, 3, 4, 5, 7, 9, 10}
  • 101. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe J = {0, 2, 3, 4, 5, 7, 9, 10} Let dk(i) be the number of j ∈ J such that P[j] = T[i + j] i.e. the number of (single character) matches involving interesting pattern locations
  • 102. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe J = {0, 2, 3, 4, 5, 7, 9, 10} Let dk(i) be the number of j ∈ J such that P[j] = T[i + j] i.e. the number of (single character) matches involving interesting pattern locations b da b d bfT b b c caa f e f cc c aa ab e e i = 4
  • 103. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe J = {0, 2, 3, 4, 5, 7, 9, 10} Let dk(i) be the number of j ∈ J such that P[j] = T[i + j] i.e. the number of (single character) matches involving interesting pattern locations b da b d bfT b b c caa f e f cc c aa ab e e i = 4 dk(i) = 3
  • 104. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe J = {0, 2, 3, 4, 5, 7, 9, 10} Let dk(i) be the number of j ∈ J such that P[j] = T[i + j] i.e. the number of (single character) matches involving interesting pattern locations b da b d bfT b b c caa f e f cc c aa ab e e i = 4 dk(i) = 3 Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X)
  • 105. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe J = {0, 2, 3, 4, 5, 7, 9, 10} Let dk(i) be the number of j ∈ J such that P[j] = T[i + j] i.e. the number of (single character) matches involving interesting pattern locations b da b d bfT b b c caa f e f cc c aa ab e e i = 4 dk(i) = 3 Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X) because there are 2k interesting positions. . . and fewer than k of them match
  • 106. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe J = {0, 2, 3, 4, 5, 7, 9, 10} Let dk(i) be the number of j ∈ J such that P[j] = T[i + j] i.e. the number of (single character) matches involving interesting pattern locations b da b d bfT b b c caa f e f cc c aa ab e e i = 4 dk(i) = 3 Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X) because there are 2k interesting positions. . . and fewer than k of them match Fact There are at most n/ √ k values of i with dk(i) k
  • 107. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe J = {0, 2, 3, 4, 5, 7, 9, 10} Let dk(i) be the number of j ∈ J such that P[j] = T[i + j] i.e. the number of (single character) matches involving interesting pattern locations b da b d bfT b b c caa f e f cc c aa ab e e i = 4 dk(i) = 3 Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X) because there are 2k interesting positions. . . and fewer than k of them match Fact There are at most n/ √ k values of i with dk(i) k this follows from a counting argument
  • 108. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe J = {0, 2, 3, 4, 5, 7, 9, 10} Let dk(i) be the number of j ∈ J such that P[j] = T[i + j] i.e. the number of (single character) matches involving interesting pattern locations b da b d bfT b b c caa f e f cc c aa ab e e i = 4 dk(i) = 3 Fact There are at most n/ √ k values of i with dk(i) k
  • 109. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe J = {0, 2, 3, 4, 5, 7, 9, 10} Let dk(i) be the number of j ∈ J such that P[j] = T[i + j] i.e. the number of (single character) matches involving interesting pattern locations b da b d bfT b b c caa f e f cc c aa ab e e i = 4 dk(i) = 3 Fact There are at most n/ √ k values of i with dk(i) k For any location i , T[i ] = P[j] for either 0 or √ k distinct j ∈ J
  • 110. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe J = {0, 2, 3, 4, 5, 7, 9, 10} Let dk(i) be the number of j ∈ J such that P[j] = T[i + j] i.e. the number of (single character) matches involving interesting pattern locations b da b d bfT b b c caa f e f cc c aa ab e e i = 4 dk(i) = 3 Fact There are at most n/ √ k values of i with dk(i) k For any location i , T[i ] = P[j] for either 0 or √ k distinct j ∈ J This implies that i dk(i) i j∈J Eq(T[i ], P[j]) n √ k
  • 111. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe J = {0, 2, 3, 4, 5, 7, 9, 10} Let dk(i) be the number of j ∈ J such that P[j] = T[i + j] i.e. the number of (single character) matches involving interesting pattern locations b da b d bfT b b c caa f e f cc c aa ab e e i = 4 dk(i) = 3 Fact There are at most n/ √ k values of i with dk(i) k For any location i , T[i ] = P[j] for either 0 or √ k distinct j ∈ J This implies that i dk(i) i j∈J Eq(T[i ], P[j]) n √ k Eq = 1 if T[i ] = P[j] (and 0 otherwise)
  • 112. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe J = {0, 2, 3, 4, 5, 7, 9, 10} Let dk(i) be the number of j ∈ J such that P[j] = T[i + j] i.e. the number of (single character) matches involving interesting pattern locations b da b d bfT b b c caa f e f cc c aa ab e e i = 4 dk(i) = 3 Fact There are at most n/ √ k values of i with dk(i) k For any location i , T[i ] = P[j] for either 0 or √ k distinct j ∈ J This implies that i dk(i) i j∈J Eq(T[i ], P[j]) n √ k
  • 113. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe J = {0, 2, 3, 4, 5, 7, 9, 10} Let dk(i) be the number of j ∈ J such that P[j] = T[i + j] i.e. the number of (single character) matches involving interesting pattern locations b da b d bfT b b c caa f e f cc c aa ab e e i = 4 dk(i) = 3 Fact There are at most n/ √ k values of i with dk(i) k Assume that more than n/ √ k values of i have dk(i) k
  • 114. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe J = {0, 2, 3, 4, 5, 7, 9, 10} Let dk(i) be the number of j ∈ J such that P[j] = T[i + j] i.e. the number of (single character) matches involving interesting pattern locations b da b d bfT b b c caa f e f cc c aa ab e e i = 4 dk(i) = 3 Fact There are at most n/ √ k values of i with dk(i) k Assume that more than n/ √ k values of i have dk(i) k So i dk(i) n√ k + 1 · k
  • 115. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe J = {0, 2, 3, 4, 5, 7, 9, 10} Let dk(i) be the number of j ∈ J such that P[j] = T[i + j] i.e. the number of (single character) matches involving interesting pattern locations b da b d bfT b b c caa f e f cc c aa ab e e i = 4 dk(i) = 3 Fact There are at most n/ √ k values of i with dk(i) k Assume that more than n/ √ k values of i have dk(i) k So i dk(i) n√ k + 1 · k > n √ k
  • 116. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe J = {0, 2, 3, 4, 5, 7, 9, 10} Let dk(i) be the number of j ∈ J such that P[j] = T[i + j] i.e. the number of (single character) matches involving interesting pattern locations b da b d bfT b b c caa f e f cc c aa ab e e i = 4 dk(i) = 3 Fact There are at most n/ √ k values of i with dk(i) k Assume that more than n/ √ k values of i have dk(i) k So i dk(i) n√ k + 1 · k > n √ k Contradiction!
  • 117. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe J = {0, 2, 3, 4, 5, 7, 9, 10} Let dk(i) be the number of j ∈ J such that P[j] = T[i + j] i.e. the number of (single character) matches involving interesting pattern locations b da b d bfT b b c caa f e f cc c aa ab e e i = 4 dk(i) = 3 Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X) because there are 2k interesting positions. . . and fewer than k of them match Fact There are at most n/ √ k values of i with dk(i) k this follows from a counting argument
  • 118. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe J = {0, 2, 3, 4, 5, 7, 9, 10} b da b d bfT b b c caa f e f cc c aa ab e e i = 4 We can filter the text, using the dk(i) values leaving only n/ √ k alignments to check every other alignment has more than k mismatches
  • 119. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe J = {0, 2, 3, 4, 5, 7, 9, 10} b da b d bfT b b c caa f e f cc c aa ab e e i = 4 We can filter the text, using the dk(i) values leaving only n/ √ k alignments to check every other alignment has more than k mismatches Check each of the remaining alignments using LCP queries in O(k) time per alignment
  • 120. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe J = {0, 2, 3, 4, 5, 7, 9, 10} b da b d bfT b b c caa f e f cc c aa ab e e i = 4 We can filter the text, using the dk(i) values leaving only n/ √ k alignments to check every other alignment has more than k mismatches Check each of the remaining alignments using LCP queries in O(k) time This takes n/ √ k · O(k) = O(n √ k) total time. per alignment
  • 121. Case 2: There are at least 2 √ k frequent symbols Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe J = {0, 2, 3, 4, 5, 7, 9, 10} b da b d bfT b b c caa f e f cc c aa ab e e i = 4 We can filter the text, using the dk(i) values leaving only n/ √ k alignments to check every other alignment has more than k mismatches Check each of the remaining alignments using LCP queries in O(k) time How do we compute all the dk(i) values? This takes n/ √ k · O(k) = O(n √ k) total time. per alignment
  • 122. Computing all the dk(i) values Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe J = {0, 2, 3, 4, 5, 7, 9, 10} b da b d bfT b b c caa f e f cc c aa ab e e i = 4 For each text character T[i ], For each j ∈ J such that P[j] = T[i ] Let dk(i) be the number of j ∈ J such that P[j] = T[i + j] i.e. the number of (single character) matches involving interesting pattern locations Let dk(i) = 0 for all i. increase dk(i − j) by one
  • 123. Computing all the dk(i) values Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe J = {0, 2, 3, 4, 5, 7, 9, 10} b da b d bfT b b c caa f e f cc c aa ab e e i = 4 For each text character T[i ], For each j ∈ J such that P[j] = T[i ] Let dk(i) be the number of j ∈ J such that P[j] = T[i + j] i.e. the number of (single character) matches involving interesting pattern locations Let dk(i) = 0 for all i. increase dk(i − j) by one T[i ] = P[j] for either 0 or √ k distinct j ∈ J (store a list of j values for each symbol)
  • 124. Computing all the dk(i) values Pick any 2 √ k frequent symbols and for each symbol pick √ k occurences in P. This gives us 2k interesting pattern locations, denoted J k = 4d bc ab b daP a dcb b da c bbfe J = {0, 2, 3, 4, 5, 7, 9, 10} b da b d bfT b b c caa f e f cc c aa ab e e i = 4 For each text character T[i ], For each j ∈ J such that P[j] = T[i ] Let dk(i) be the number of j ∈ J such that P[j] = T[i + j] i.e. the number of (single character) matches involving interesting pattern locations Let dk(i) = 0 for all i. increase dk(i − j) by one T[i ] = P[j] for either 0 or √ k distinct j ∈ J (store a list of j values for each symbol) This takes O(n √ k) total time.
  • 125. Pattern matching with k-mismatches: putting it all together Algorithm summary
  • 126. Pattern matching with k-mismatches: putting it all together Algorithm summary Preprocess P, T for LCP queries - O(n) time
  • 127. Pattern matching with k-mismatches: putting it all together Algorithm summary Preprocess P, T for LCP queries - O(n) time Count the number of frequent symbols in P - O(m log m) time
  • 128. Pattern matching with k-mismatches: putting it all together Algorithm summary Preprocess P, T for LCP queries - O(n) time Count the number of frequent symbols in P - O(m log m) time Case 1: P has at most 2 √ k frequent symbols Case 2: P has more than 2 √ k frequent symbols
  • 129. Pattern matching with k-mismatches: putting it all together Algorithm summary Preprocess P, T for LCP queries - O(n) time Count the number of frequent symbols in P - O(m log m) time Case 1: P has at most 2 √ k frequent symbols Count matches with frequent symbols using cross-correlations - O(n √ k log m) time Case 2: P has more than 2 √ k frequent symbols
  • 130. Pattern matching with k-mismatches: putting it all together Algorithm summary Preprocess P, T for LCP queries - O(n) time Count the number of frequent symbols in P - O(m log m) time Case 1: P has at most 2 √ k frequent symbols Count matches with frequent symbols using cross-correlations - O(n √ k log m) time Count matches with infrequent symbols directly - O(n √ k) time Case 2: P has more than 2 √ k frequent symbols
  • 131. Pattern matching with k-mismatches: putting it all together Algorithm summary Preprocess P, T for LCP queries - O(n) time Count the number of frequent symbols in P - O(m log m) time Case 1: P has at most 2 √ k frequent symbols Count matches with frequent symbols using cross-correlations - O(n √ k log m) time Count matches with infrequent symbols directly - O(n √ k) time Case 2: P has more than 2 √ k frequent symbols Filter the text, leaving n/ √ k alignments - O(n √ k) time
  • 132. Pattern matching with k-mismatches: putting it all together Algorithm summary Preprocess P, T for LCP queries - O(n) time Count the number of frequent symbols in P - O(m log m) time Case 1: P has at most 2 √ k frequent symbols Count matches with frequent symbols using cross-correlations - O(n √ k log m) time Count matches with infrequent symbols directly - O(n √ k) time Case 2: P has more than 2 √ k frequent symbols Filter the text, leaving n/ √ k alignments - O(n √ k) time Count mismatches at these alignments using LCP queries - O(n √ k) time
  • 133. Pattern matching with k-mismatches: putting it all together Algorithm summary Preprocess P, T for LCP queries - O(n) time Overall, we obtain a time complexity of O(n √ k log m). Count the number of frequent symbols in P - O(m log m) time Case 1: P has at most 2 √ k frequent symbols Count matches with frequent symbols using cross-correlations - O(n √ k log m) time Count matches with infrequent symbols directly - O(n √ k) time Case 2: P has more than 2 √ k frequent symbols Filter the text, leaving n/ √ k alignments - O(n √ k) time Count mismatches at these alignments using LCP queries - O(n √ k) time
  • 134. Pattern matching with k-mismatches: putting it all together Algorithm summary Preprocess P, T for LCP queries - O(n) time Overall, we obtain a time complexity of O(n √ k log m). Count the number of frequent symbols in P - O(m log m) time Case 1: P has at most 2 √ k frequent symbols Count matches with frequent symbols using cross-correlations - O(n √ k log m) time Count matches with infrequent symbols directly - O(n √ k) time Case 2: P has more than 2 √ k frequent symbols Filter the text, leaving n/ √ k alignments - O(n √ k) time Count mismatches at these alignments using LCP queries - O(n √ k) time - this can be improved to O(n √ k log k)
  • 135. Conclusion T Input A text string T (length n), a pattern string P (length m) and a positive integer k P ba b c a a d ad a Goal: For all i, output, a a a 0 1 2 3 4 5 6 7 8 9 10 11 12 n a b d m a Hamk(6) = 1 Output the number of mismatches. . . unless it’s more than k (we interpret the output X to mean “too many mismatches”) Hamk(i) =    Ham(i) if Ham(i) k X if Ham(i) > k k = 2 One algorithm takes O(nk) time The other algorithm takes O(n √ k log m) time (improvable to O(n √ k log k) time) We saw two algorithms for this problem: