Pattern Matching Part Two: k-mismatches

Advanced Algorithms – COMS31900
Pattern matching part four
Pattern matching with at most k mismatches
Benjamin Sach

Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (length m)
P
ba b c
a b d
a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
m
i.e. the number of distinct j such that P[j] = T[i + j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a
Ham(i), the Hamming distance between P and T[i . . . i + m − 1]

T
P
ba b c
a b d
a a d ad a
c a a
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a
Ham(4) = 1

T
P
ba b c a a d ad a
c a a
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(5) = 4

T
P
ba b c a a d ad a
c a a
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(6) = 1

T
P
ba b c a a d ad a
c a a
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(7) = 3

T
P
ba b c a a d ad a
c a a
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(7) = 3
this is alignment 7

T
P
ba b c a a d ad a
c a a
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(8) = 3
this is alignment 8

T
P
ba b c a a d ad a
c a a
Last lecture we saw two algorithms for this problem:
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(8) = 3
this is alignment 8

T
P
ba b c a a d ad a
c a a
Last lecture we saw two algorithms for this problem:
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(8) = 3
this is alignment 8
One algorithm takes O(n|Σ| log m) time (where |Σ| is the alphabet size)
The other algorithm takes O(n
√
m log m) time (regardless of the alphabet size)

Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (length m) and a positive integer k
P
ba b c
a b d
a a d ad a
Goal: For all i, output,
a a a
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a
Output the number of mismatches. . . unless it’s more than k
(we interpret the output X to mean “too many mismatches”)
Hamk(i) =



Ham(i) if Ham(i) k
X if Ham(i) > k

T
P
ba b c
a b d
a a d ad a
a a a
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a
Hamk(i) =



Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2

T
P
ba b c
a b d
a a d ad a
a a a
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a
Hamk(4) = 1
Hamk(i) =



Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2

T
P
ba b c a a d ad a
a a a
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(5) = X
Hamk(i) =



Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2

T
P
ba b c a a d ad a
a a a
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(6) = 1
Hamk(i) =



Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2

T
P
ba b c a a d ad a
a a a
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(7) = 2
Hamk(i) =



Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2

T
P
ba b c a a d ad a
a a a
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(8) = X
Hamk(i) =



Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2

T
P
ba b c a a d ad a
a a a
• We could use the O(n
√
m log m) time algorithm for Hamming distance. . .
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(8) = X
Hamk(i) =



Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2

T
P
ba b c a a d ad a
a a a
• We could use the O(n
√
m log m) time algorithm for Hamming distance. . .
but when k is small we can do much better
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(8) = X
Hamk(i) =



Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2

LCP - the Longest Common Preﬁx
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P, LCP(i, j) is the largest such that
T[i . . . i + − 1] = P[j . . . j + − 1]
it’s the furthest you can go before hitting a mismatch
a b c b a bc bP
nn nm

T[i . . . i + − 1] = P[j . . . j + − 1]
i j
a b c b a bc bP
nn nm

T[i . . . i + − 1] = P[j . . . j + − 1]
i j
a b c b a bc bP
nn nm
LCP(i, j) returns 3

T[i . . . i + − 1] = P[j . . . j + − 1]
i j
a b c b a bc bP
nn nm
LCP(i, j) returns 4

T[i . . . i + − 1] = P[j . . . j + − 1]
a b c b a bc bP
nn nm
i j
LCP(i, j) returns 0

T[i . . . i + − 1] = P[j . . . j + − 1]
i j
a
We can preprocess P and T for LCP queries in O(n) time and O(n) space
b c b a bc bP
nn nm
LCP(i, j) returns 4

T[i . . . i + − 1] = P[j . . . j + − 1]
i j
a
Each query then takes O(1) time
b c b a bc bP
nn nm
LCP(i, j) returns 4

T[i . . . i + − 1] = P[j . . . j + − 1]
i j
a
we’ll see how to do this later in this lecture
b c b a bc bP
nn nm
LCP(i, j) returns 4

T[i . . . i + − 1] = P[j . . . j + − 1]
i j
a
we’ll see how to do this later in this lecture
we can use LCP queries to solve the k-mismatch problem. . .
b c b a bc bP
nn nm
First let’s see how
LCP(i, j) returns 4

k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
a a b a

a
b
c
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T[i . . . i + m − 1] and P
(we do this for each i seperately)
a a b a

a
b
c
b c b a bP
nn
We can do this using (at most) k + 1 LCP queries
a a b a

a
b
c
b c b a bP
nn
each query takes O(1) time and ﬁnds a new mismatch
a a b a

a
b
c
aT a b c a b aa b cb b
i
a
b c b a bP
nn
a a b a

a
b
c
i
j
a
b c b a bP
nn
a a b a

a
b
c
i
j
a
b c b a bP
nn
perform LCP(i, j)
a a b a

a
b
c
i
j
a
b c b a bP
nn
perform LCP(i, j)
a a b a which returns 3

a
b
c
i
j
a
b c b a bP
nn
perform LCP(i, j)
mismatch!

a
b
c
i
j
a
b c b a bP
nn
a a b a
skip over this!

a
b
c
b c b a bP
nn
i
j
a a b a
skip over this!

a
b
c
b c b a bP
nn
i
j
perform LCP(i , j)
a a b a

a
b
c
b c b a bP
nn
i
j
perform LCP(i , j)

a
b
c
b c b a bP
nn
i
j
a a b a

a
b
c
b c b a bP
nn
perform LCP(i , j)i
j
a a b a

a
b
c
b c b a bP
nn
perform LCP(i , j)i
j

a
b
c
b c b a bP
nn
perform LCP(i , j)
a a b a
i
j

a
b
c
b c b a bP
nn
perform LCP(i , j)
a a b a
i
j
which returns 3

a
b
c
b c b a bP
nn
(or we reach the end of P)
perform LCP(i , j)
a a b a
i
j
which returns 3

a
b
c
b c b a bP
nn
a a b a
i
j

a
b
c
b c b a bP
nn
We can therefore calculate Hamk(i) for a single alignment i in O(k) time
a a b a
i
j

a
b
c
b c b a bP
nn
Overall this takes O(nk) time (including the ‘preprocessing’ for LCP queries)
a a b a
i
j

a
b
c
b c b a bP
nn
this is pretty good but we can do better
a a b a
i
j

a
b
c
b c b a bP
nn
this is pretty good but we can do better but ﬁrst. . . how do we answer those LCP queries?
a a b a
i
j

LCPs in Suffix Trees
TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
This is the suffix tree
of this text
2
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
in O(n) prep. time and space

TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
sufﬁxes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
of this text
2
What is the LCA of the leaves representing sufﬁxes i and j?
1
5

TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
sufﬁxes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
of this text
2
1
5
i j
3

TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
sufﬁxes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
of this text
2
1
5
i j
LCA(i, j)
3

TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
sufﬁxes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
of this text
2
1
5
i j
LCA(i, j)
3
it’s the node representing the longest common preﬁx of T[i . . . n − 1] and T[j . . . n − 1]

TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
sufﬁxes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
of this text
2
1
5
i j
LCA(i, j)
3
For any pair of locations i, j in T, LCPT (i, j) is the largest such that
T[i . . . i + − 1] = T[j . . . j + − 1]
Single string LCP:

TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
sufﬁxes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
of this text
2
1
5
LCA(i, j)
j
1
5
i
T[i . . . i + − 1] = T[j . . . j + − 1]
Single string LCP:

TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
sufﬁxes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
of this text
2
1
5
LCA(i, j)
j
4
i
2
T[i . . . i + − 1] = T[j . . . j + − 1]
Single string LCP:

TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
sufﬁxes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
of this text
2
1
5
LCA(i, j)
j
4
i
2

TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
sufﬁxes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
of this text
2
1
5
LCA(i, j)
j
4
i
2
We store the root-to-node length at each internal node
so we can recover the length, LCPT (i, j) in O(1) time

TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
sufﬁxes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
of this text
2
1
5
LCA(i, j)
j
4
i
2
1
3
0
2

TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
sufﬁxes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
of this text
2
1
5
LCA(i, j)
j
4
i
2
1
3
0
2
So we have O(n) space, O(n) prep. time and O(1) query time for the LCP problem on a single string

TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
sufﬁxes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
of this text
2
1
5
LCA(i, j)
j
4
i
2
1
3
0
2

TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
sufﬁxes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
of this text
2
1
5
LCA(i, j)
j
4
i
2
1
3
0
2
We can extend this two strings (P and T) by ﬁrst concatenating them together. . .
(and proceeding as for a single string)

TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
sufﬁxes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
of this text
2
1
5
LCA(i, j)
j
4
i
2
1
3
0
2
So we also have O(n) space, O(n) prep. time and O(1) query time
for the LCP problem on two strings

TT b n aaa sn
n
$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
sufﬁxes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7
3
0
2 4
6
nas$
of this text
2
1
5
LCA(i, j)
j
4
i
2
1
3
0
2
So we also have O(n) space, O(n) prep. time and O(1) query time
for the LCP problem on two strings
I.e. for any i in T and j in P, LCP(i, j) is the largest such that
T[i . . . i + − 1] = P[j . . . j + − 1] (as we originally deﬁned it)

k-mismatch using frequent/infrequent symbols
Deﬁnition: A symbol is frequent if it occurs at least
√
k times in P,
and infrequent otherwise

√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
k = 4
(
√
k = 2)
m = 9

√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
k = 4
(
√
k = 2)

√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent
k = 4
(
√
k = 2)

√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
k = 4
(
√
k = 2)

√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
k = 4
(
√
k = 2)
, d is frequent

√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
c is infrequent
k = 4
(
√
k = 2)
, d is frequent

√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
c is infrequent
How many frequent symbols can there be?
k = 4
(
√
k = 2)
, d is frequent

√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
c is infrequent
How many frequent symbols can there be? Lots!
k = 4
(
√
k = 2)
, d is frequent

√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
c is infrequent
How many frequent symbols can there be? Lots! there could be m√
k
frequent symbols
k = 4
(
√
k = 2)
, d is frequent

√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
c is infrequent
k
frequent symbols
Case 1: There are fewer than 2
√
k frequent symbols in P.
k = 4
(
√
k = 2)
, d is frequent

√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
c is infrequent
Algorithm summary
k
frequent symbols
√
k = 4
(
√
k = 2)
, d is frequent

√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
c is infrequent
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent
Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture)
Stage 2: Count all matches involving infrequent symbols (as in last lecture)
k
frequent symbols
√
k = 4
(
√
k = 2)
, d is frequent

√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
c is infrequent
Algorithm summary
k
frequent symbols
√
- O(m log m) time
k = 4
(
√
k = 2)
, d is frequent

√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
c is infrequent
Algorithm summary
k
frequent symbols
√
- O(m log m) time
- O(n
√
k log m) time
k = 4
(
√
k = 2)
, d is frequent

√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
c is infrequent
Algorithm summary
k
frequent symbols
√
- O(m log m) time
- O(n
√
k log m) time
- O(n
√
k) time
k = 4
(
√
k = 2)
, d is frequent

√
k times in P,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
c is infrequent
Algorithm summary
k
frequent symbols
√
- O(m log m) time
- O(n
√
k log m) time
- O(n
√
k) time
- O(n
√
k log m) total time
k = 4
(
√
k = 2)
, d is frequent

Case 2: There are at least 2
√
k frequent symbols
Pick any 2
√
k frequent symbols and for each symbol pick
√
k occurences in P.
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a d bc ab b da c bbfe
0 1 2 3 4 5 6 7 8 9 10 11 12 13

√
k frequent symbols
Pick any 2
√
√
k occurences in P.
k = 4d bc ab b daP a d bc ab b da c bbfe
0 1 2 3 4 5 6 7 8 9 10 11 12 13
e and f are infrequent
, d is frequent, c is frequent

√
k frequent symbols
Pick any 2
√
√
k occurences in P.
k = 4d bc ab b daP a dcb b da c bbfe
0 1 2 3 4 5 6 7 8 9 10 11 12 13

√
k frequent symbols
Pick any 2
√
√
k occurences in P.
J = {0, 2, 3, 4, 5, 7, 9, 10}
0 1 2 3 4 5 6 7 8 9 10 11 12 13

√
k frequent symbols
Pick any 2
√
√
k occurences in P.
J = {0, 2, 3, 4, 5, 7, 9, 10}

√
k frequent symbols
Pick any 2
√
√
k occurences in P.
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P[j] = T[i + j]
i.e. the number of (single character) matches involving interesting pattern locations

√
k frequent symbols
Pick any 2
√
√
k occurences in P.
J = {0, 2, 3, 4, 5, 7, 9, 10}
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4

√
k frequent symbols
Pick any 2
√
√
k occurences in P.
J = {0, 2, 3, 4, 5, 7, 9, 10}
i = 4 dk(i) = 3

√
k frequent symbols
Pick any 2
√
√
k occurences in P.
J = {0, 2, 3, 4, 5, 7, 9, 10}
i = 4 dk(i) = 3
Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X)

√
k frequent symbols
Pick any 2
√
√
k occurences in P.
J = {0, 2, 3, 4, 5, 7, 9, 10}
i = 4 dk(i) = 3
because there are 2k interesting positions. . . and fewer than k of them match

√
k frequent symbols
Pick any 2
√
√
k occurences in P.
J = {0, 2, 3, 4, 5, 7, 9, 10}
i = 4 dk(i) = 3
Fact There are at most n/
√
k values of i with dk(i) k

√
k frequent symbols
Pick any 2
√
√
k occurences in P.
J = {0, 2, 3, 4, 5, 7, 9, 10}
i = 4 dk(i) = 3
√
this follows from a counting argument

√
k frequent symbols
Pick any 2
√
√
k occurences in P.
J = {0, 2, 3, 4, 5, 7, 9, 10}
i = 4 dk(i) = 3
√

√
k frequent symbols
Pick any 2
√
√
k occurences in P.
J = {0, 2, 3, 4, 5, 7, 9, 10}
i = 4 dk(i) = 3
√
For any location i , T[i ] = P[j] for either 0 or
√
k distinct j ∈ J

√
k frequent symbols
Pick any 2
√
√
k occurences in P.
J = {0, 2, 3, 4, 5, 7, 9, 10}
i = 4 dk(i) = 3
√
√
k distinct j ∈ J
This implies that i dk(i) i j∈J Eq(T[i ], P[j]) n
√
k

√
k frequent symbols
Pick any 2
√
√
k occurences in P.
J = {0, 2, 3, 4, 5, 7, 9, 10}
i = 4 dk(i) = 3
√
√
k distinct j ∈ J
This implies that i dk(i) i j∈J Eq(T[i ], P[j]) n
√
k
Eq = 1 if T[i ] = P[j] (and
0 otherwise)

√
k frequent symbols
Pick any 2
√
√
k occurences in P.
J = {0, 2, 3, 4, 5, 7, 9, 10}
i = 4 dk(i) = 3
√
Assume that more than n/
√
k values of i have dk(i) k

√
k frequent symbols
Pick any 2
√
√
k occurences in P.
J = {0, 2, 3, 4, 5, 7, 9, 10}
i = 4 dk(i) = 3
√
√
So i dk(i) n√
k
+ 1 · k

√
k frequent symbols
Pick any 2
√
√
k occurences in P.
J = {0, 2, 3, 4, 5, 7, 9, 10}
i = 4 dk(i) = 3
√
√
So i dk(i) n√
k
+ 1 · k > n
√
k

√
k frequent symbols
Pick any 2
√
√
k occurences in P.
J = {0, 2, 3, 4, 5, 7, 9, 10}
i = 4 dk(i) = 3
√
√
So i dk(i) n√
k
+ 1 · k > n
√
k
Contradiction!

√
k frequent symbols
Pick any 2
√
√
k occurences in P.
J = {0, 2, 3, 4, 5, 7, 9, 10}
i = 4
We can ﬁlter the text, using the dk(i) values leaving only n/
√
k alignments to check
every other alignment has more than k mismatches

√
k frequent symbols
Pick any 2
√
√
k occurences in P.
J = {0, 2, 3, 4, 5, 7, 9, 10}
i = 4
√
Check each of the remaining alignments using LCP queries in O(k) time
per alignment

√
k frequent symbols
Pick any 2
√
√
k occurences in P.
J = {0, 2, 3, 4, 5, 7, 9, 10}
i = 4
√
This takes n/
√
k · O(k) = O(n
√
k) total time.
per alignment

√
k frequent symbols
Pick any 2
√
√
k occurences in P.
J = {0, 2, 3, 4, 5, 7, 9, 10}
i = 4
√
How do we compute all the dk(i) values?
This takes n/
√
k · O(k) = O(n
√
k) total time.
per alignment

Computing all the dk(i) values
Pick any 2
√
√
k occurences in P.
J = {0, 2, 3, 4, 5, 7, 9, 10}
i = 4
For each text character T[i ],
For each j ∈ J such that P[j] = T[i ]
Let dk(i) = 0 for all i.
increase dk(i − j) by one

Pick any 2
√
√
k occurences in P.
J = {0, 2, 3, 4, 5, 7, 9, 10}
i = 4
T[i ] = P[j] for either 0 or
√
k distinct j ∈ J
(store a list of j values for each symbol)

Pick any 2
√
√
k occurences in P.
J = {0, 2, 3, 4, 5, 7, 9, 10}
i = 4
T[i ] = P[j] for either 0 or
√
k distinct j ∈ J
(store a list of j values for each symbol)
This takes O(n
√
k) total time.

Pattern matching with k-mismatches: putting it all together
Algorithm summary

Algorithm summary
Preprocess P, T for LCP queries - O(n) time

Algorithm summary
Count the number of frequent symbols in P - O(m log m) time

Algorithm summary
Case 1: P has at most 2
√
k frequent symbols
Case 2: P has more than 2
√
k frequent symbols

Algorithm summary
√
k frequent symbols
Count matches with frequent symbols using cross-correlations - O(n
√
k log m) time
√
k frequent symbols

Algorithm summary
√
k frequent symbols
√
k log m) time
Count matches with infrequent symbols directly - O(n
√
k) time
√
k frequent symbols

Algorithm summary
√
k frequent symbols
√
k log m) time
√
k) time
√
k frequent symbols
Filter the text, leaving n/
√
k alignments - O(n
√
k) time

Algorithm summary
√
k frequent symbols
√
k log m) time
√
k) time
√
k frequent symbols
√
k alignments - O(n
√
k) time
Count mismatches at these alignments using LCP queries - O(n
√
k) time

Algorithm summary
Overall, we obtain a time complexity of O(n
√
k log m).
√
k frequent symbols
√
k log m) time
√
k) time
√
k frequent symbols
√
k alignments - O(n
√
k) time
√
k) time

Algorithm summary
Overall, we obtain a time complexity of O(n
√
k log m).
√
k frequent symbols
√
k log m) time
√
k) time
√
k frequent symbols
√
k alignments - O(n
√
k) time
√
k) time
- this can be improved to O(n
√
k log k)

Conclusion
T
P
ba b c a a d ad a
a a a
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(6) = 1
Hamk(i) =



Ham(i) if Ham(i) k
X if Ham(i) > k
k = 2
One algorithm takes O(nk) time
The other algorithm takes O(n
√
k log m) time (improvable to O(n
√
k log k) time)
We saw two algorithms for this problem:

Pattern Matching Part Two: k-mismatches

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (8)

Ähnlich wie Pattern Matching Part Two: k-mismatches

Ähnlich wie Pattern Matching Part Two: k-mismatches (20)

Mehr von Benjamin Sach

Mehr von Benjamin Sach (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Pattern Matching Part Two: k-mismatches