SlideShare ist ein Scribd-Unternehmen logo
1 von 171
Downloaden Sie, um offline zu lesen
Advanced Algorithms – COMS31900
Pattern Matching part two
Suffix Arrays
Benjamin Sach
Text indexing
T
Preprocess a text string T (length n) to answer pattern matching queries. . .
ba b c a b a cb a a b a
n
Text indexing
T
Preprocess a text string T (length n) to answer pattern matching queries. . .
ba b c a b a cb a a b a
n
After preprocessing, a query is a pattern P (length m),
P a b a
m
Text indexing
T
Preprocess a text string T (length n) to answer pattern matching queries. . .
ba b c a b a cb a a b a
n
After preprocessing, a query is a pattern P (length m),
P a b a
m
the output is a list of all matches in T.
Text indexing
T
Preprocess a text string T (length n) to answer pattern matching queries. . .
ba b c a b a cb a a b a
n
After preprocessing, a query is a pattern P (length m),
P a b a
m
the output is a list of all matches in T.
e.g. 4, 6, 10
4 6 10
Text indexing
T
Preprocess a text string T (length n) to answer pattern matching queries. . .
ba b c a b a cb a a b a
n
After preprocessing, a query is a pattern P (length m),
P a b a
m
the output is a list of all matches in T.
• Last lecture we saw the that text indexing problem can be solved using a suffix tree
e.g. 4, 6, 10
4 6 10
which uses O(n) space (when it’s stored compacted)
Text indexing
T
Preprocess a text string T (length n) to answer pattern matching queries. . .
ba b c a b a cb a a b a
n
After preprocessing, a query is a pattern P (length m),
P a b a
m
the output is a list of all matches in T.
• Last lecture we saw the that text indexing problem can be solved using a suffix tree
• Queries take O(m + occ) time when the alphabet size is constant
- occ is the number of occurences (matches)
e.g. 4, 6, 10
4 6 10
which uses O(n) space (when it’s stored compacted)
Text indexing
T
Preprocess a text string T (length n) to answer pattern matching queries. . .
ba b c a b a cb a a b a
n
After preprocessing, a query is a pattern P (length m),
P a b a
m
the output is a list of all matches in T.
• Last lecture we saw the that text indexing problem can be solved using a suffix tree
• Queries take O(m + occ) time when the alphabet size is constant
- occ is the number of occurences (matches)
• Suffix trees can be constructed in O(n) time (but we only saw how to achieve O(n2) time)
e.g. 4, 6, 10
4 6 10
which uses O(n) space (when it’s stored compacted)
The suffix array
T b n aaT a sn
n
The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa snsuffix
1
The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
Sort the suffixes
lexicographically
3 aa sn
The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
Sort the suffixes
lexicographically
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
Sort the suffixes
lexicographically
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
Sort the suffixes
lexicographically
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a
The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
Sort the suffixes
lexicographically
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a
The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
Sort the suffixes
lexicographically
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a
The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
Sort the suffixes
lexicographically
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a b c<
The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
Sort the suffixes
lexicographically
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a b c<
The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
Sort the suffixes
lexicographically
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a b c<
The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
Sort the suffixes
lexicographically
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a b c< b c< a
The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
Sort the suffixes
lexicographically
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a b c< b c< a
The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
Sort the suffixes
lexicographically
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a b c< b c< a
(in a ‘tie’, the shorter string is smaller)
The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
Sort the suffixes
lexicographically
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a b c< b c< a
(in a ‘tie’, the shorter string is smaller)
The suffix array
T b n aaT a sn
n
Sort the suffixes
lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a b c< b c< a
(in a ‘tie’, the shorter string is smaller)
The suffix array
T b n aaT a sn
n
Sort the suffixes
lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a b c< b c< a
(in a ‘tie’, the shorter string is smaller)
The suffix array
T b n aaT a sn
n
Sort the suffixes
lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a b c< b c< a
(in a ‘tie’, the shorter string is smaller)
The suffix array
T b n aaT a sn
n
Sort the suffixes
lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a b c< b c< a
(in a ‘tie’, the shorter string is smaller)
The suffix array
T b n aaT a sn
n
Sort the suffixes
lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
b a<a a b c< b c< a
(in a ‘tie’, the shorter string is smaller)
just a fancy name for the order the strings would appear in the dictionary
In lexicographical ordering we sort strings based on the first symbol that differs:
The suffix array
T b n aaT a sn
n
Sort the suffixes
lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
If the symbols don’t have a natural order, we use their binary representation in memory
b a<a a b c< b c< a
(in a ‘tie’, the shorter string is smaller)
just a fancy name for the order the strings would appear in the dictionary
In lexicographical ordering we sort strings based on the first symbol that differs:
The suffix array
T b n aaT a sn
n
Sort the suffixes
lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
The suffix array
T b n aaT a sn
n
Sort the suffixes
lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
10 2 3 4 5 6
The suffix array
T b n aaT a sn
n
Suffix Array 1 0 625 4
Sort the suffixes
lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
3
n
10 2 3 4 5 6
The suffix array
T b n aaT a sn
n
Suffix Array 1 0 625 4
Sort the suffixes
lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
3
n
10 2 3 4 5 6
The suffix array
T b n aaT a sn
n
Suffix Array 1 0 625 4
Sort the suffixes
lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
3
n
10 2 3 4 5 6
The suffix array
T b n aaT a sn
n
Suffix Array 1 0 625 4
Sort the suffixes
lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
3
n
The suffix array is much smaller than the suffix tree (in terms of constants)
10 2 3 4 5 6
The suffix array
T b n aaT a sn
n
Suffix Array 1 0 625 4
Sort the suffixes
lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
3
n
The suffix array is much smaller than the suffix tree (in terms of constants)
10 2 3 4 5 6
From Suffix Trees to Suffix Arrays
a
s$
nas$
nas$
nas$
s$
na
s$
bananas$
1 3
5
0
2 4
6
T b n aaT a sn
n
Suffix Array
Suffix Tree
1 0 625 43
n
Recall that we can get get the Suffix Array from the Suffix Tree
1 3
5
0
2 4
6
10 2 3 4 5 6
using depth-first search in O(n) time
Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
157 100 1 43 112 1396 1485 12
c db cb a c bc c d adb
Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
c db cb a c bc c d adb
m
Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
m
Find an occurence of P
using binary search
Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
m
occurences could start anywhere
Find an occurence of P
using binary search
Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
m
occurences could start anywhere
Find an occurence of P
using binary search
Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
m
occurences could start anywhere
Find an occurence of P
using binary search
Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c d adb>
c db cb a c bc c d adb
m
c db cb
occurences could start anywhere
Find an occurence of P
using binary search
Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c d adb>
c db cb a c bc c d adb
m
c db cb
occurences could start anywhere
Find an occurence of P
using binary search
Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c d adb>
c db cb a c bc c d adb
m
c db cb
occurences could start anywhere
Find an occurence of P
using binary search
Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c d adb>
c db cb a c bc c d adb
m
c db cb
occurences must start in here
Find an occurence of P
using binary search
Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
m
occurences must start in here
Find an occurence of P
using binary search
Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
m
occurences must start in here
Find an occurence of P
using binary search
Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
m
occurences must start in here
Find an occurence of P
using binary search
Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb < c bc db
c db cb a c bc c d adb
m
occurences must start in here
Find an occurence of P
using binary search
Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb < c bc db
c db cb a c bc c d adb
m
occurences must start in here
Find an occurence of P
using binary search
Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb < c bc db
c db cb a c bc c d adb
m
occurences must start in here
Find an occurence of P
using binary search
Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
m
occurences must start in here
Find an occurence of P
using binary search
Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
m
occurences must start in here
Find an occurence of P
using binary search
Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
=
c db cb a c bc c d adb
c db cb
m
c db cb
occurences must start in here
Find an occurence of P
using binary search
Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
=
c db cb a c bc c d adb
c db cb
m
c db cb
occurences must start in here
we found a match!
Find an occurence of P
using binary search
Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
m
Find an occurence of P
using binary search
Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
How long does this take?
m
Find an occurence of P
using binary search
Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
How long does this take?
O(m) time to compare two strings
m
Find an occurence of P
using binary search
Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
How long does this take?
O(m) time to compare two strings
so O(m log n) time in total
m
Find an occurence of P
using binary search
Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
How long does this take?
O(m) time to compare two strings
so O(m log n) time in total
This method generalises to O(m log n + occ) time
m
to find all occ occurences.
Find an occurence of P
using binary search
by continuing the binary search
(we will skip the details)
The suffix array
T b n aaT a sn
n
Suffix Array 1 0 625 4
Sort the suffixes lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
3
n
P aa n
m
an
Finding an occurrence of a pattern (length m) takes O(m log n) time
Finding all occurrences takes O(m log n + occ) time
where occ is the number of occurences
The suffix array
T b n aaT a sn
n
Suffix Array 1 0 625 4
Sort the suffixes lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
3
n
P aa n
m
an
Finding an occurrence of a pattern (length m) takes O(m log n) time
Finding all occurrences takes O(m log n + occ) time
where occ is the number of occurences
This can be further improved to O(m + log n + occ) time
(using LCP queries which we will see in a future lecture)
The suffix array
T b n aaT a sn
n
Suffix Array 1 0 625 4
Sort the suffixes lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
3
n
P aa n
m
an
Finding an occurrence of a pattern (length m) takes O(m log n) time
Finding all occurrences takes O(m log n + occ) time
where occ is the number of occurences
Do we really need to build the suffix tree to construct the suffix array?
This can be further improved to O(m + log n + occ) time
(using LCP queries which we will see in a future lecture)
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
R1 is split into blocks of length 3
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
R2 is also split into blocks of length 3
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 3
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 5
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 56
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 56 7
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 56 7
we assume that the bit representation of each symbol uses O(log n) bits.
(which is a common and realistic assumption)
This can be done by sorting the blocks in O(n) time using radix sort
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 56 7
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 56 7
let R = 1 2 4 34 56 7
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 56 7
let R = 1 2 4 34 56 7
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 56 7
let R = 1 2 4 34 56 7
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 56 7
let R = 1 2 4 34 56 7
compute the suffix array of R :
1 2 3 4 5 6 70
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 56 7
let R = 1 2 4 34 56 7
compute the suffix array of R :
1 2 3 4 5 6 70
0 1 6 4 2 5 3 7
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 56 7
let R = 1 2 4 34 56 7
compute the suffix array of R :
1 2 3 4 5 6 70
0 1 6 4 2 5 3 7
How do we compute the suffix array for R ?
Recursion!
(Notice that R has length 2n/3)
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 56 7
let R = 1 2 4 34 56 7
compute the suffix array of R :
1 2 3 4 5 6 70
0 1 6 4 2 5 3 7
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 56 7
let R = 1 2 4 34 56 7
compute the suffix array of R :
what use1 2 3 4 5 6 70
0 1 6 4 2 5 3 7
is this?!
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 56 7
let R = 1 2 4 34 56 7
compute the suffix array of R :
what use1 2 3 4 5 6 70
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
is this?!
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
is this?!
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2
is this?!
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2
a b d ob a a b b a d
d oa b b a d
is this?!
1
5
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
a b d ob a a b b a d
d oa b b a d
is this?!
1
5
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
a b d ob a a b b a d
d oa b b a d
is this?!
1
5
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
a b d ob a a b b a d
d oa b b a d
a b d ob a a b b a d $ b d ob a a b b a d $ $
is this?!
1
5
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
a b d ob a a b b a d
d oa b b a d
a b d ob a a b b a d $ b d ob a a b b a d $ $
is this?!
1
5
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
a b d ob a a b b a d
d oa b b a d
a b d ob a a b b a d $ b d ob a a b b a d $ $
d oa b b a d $ $
is this?!
1
5
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
a b d ob a a b b a d
d oa b b a d
a b d ob a a b b a d $ b d ob a a b b a d $ $
d oa b b a d $ $
their order is given by the suffix array of R :
is this?!
1
5
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
a b d ob a a b b a d
d oa b b a d
a b d ob a a b b a d $ b d ob a a b b a d $ $
d oa b b a d $ $
their order is given by the suffix array of R :
is this?!
1
5
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
a b d ob a a b b a d
d oa b b a d
a b d ob a a b b a d $ b d ob a a b b a d $ $
d oa b b a d $ $
their order is given by the suffix array of R :
is this?!
1
5
Suffix is smaller
than suffix
1
55
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
their order is given by the suffix array of R :
is this?!
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
their order is given by the suffix array of R :
is this?!
d oa b b a d5
ob b a d7
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
their order is given by the suffix array of R :
is this?!
d oa b b a d5
ob b a d7
ob b a d $ b d ob a a b b a d $ $
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
their order is given by the suffix array of R :
is this?!
d oa b b a d5
ob b a d7
d oa b b a d $ $
ob b a d $ b d ob a a b b a d $ $
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
their order is given by the suffix array of R :
is this?!
d oa b b a d5
ob b a d7
d oa b b a d $ $
ob b a d $ b d ob a a b b a d $ $
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
their order is given by the suffix array of R :
is this?!
d oa b b a d5
ob b a d7
d oa b b a d $ $
ob b a d $ b d ob a a b b a d $ $
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
their order is given by the suffix array of R :
is this?!
d oa b b a d5
ob b a d7
d oa b b a d $ $
ob b a d $ b d ob a a b b a d $ $
Suffix is smaller
than suffix
7
5
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
their order is given by the suffix array of R :
is this?!
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
their order is given by the suffix array of R :
is this?!
d oa b b a d5
b d ob a a b b a d2
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
their order is given by the suffix array of R :
is this?!
d oa b b a d5
b d ob a a b b a d2
d oa b b a d $ $
b d ob a a b b a d $ $
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
their order is given by the suffix array of R :
is this?!
d oa b b a d5
b d ob a a b b a d2
d oa b b a d $ $
b d ob a a b b a d $ $
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
their order is given by the suffix array of R :
is this?!
d oa b b a d5
b d ob a a b b a d2
d oa b b a d $ $
b d ob a a b b a d $ $
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
their order is given by the suffix array of R :
is this?!
d oa b b a d5
b d ob a a b b a d2
d oa b b a d $ $
b d ob a a b b a d $ $
Suffix is smaller
than suffix
2
5
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
is this?!
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
0 1 2 3 4 5 6 7
The suffix array of R : 0 1 6 4 2 5 3 7
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
0 1 2 3 4 5 6 7
The suffix array of R : 0 1 6 4 2 5 3 7
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
0 1 2 3 4 5 6 7
The suffix array of R : 0 1 6 4 2 5 3 7
after relabelling,
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
0 1 2 3 4 5 6 7
The suffix array of R : 0 1 6 4 2 5 3 7
after relabelling, 1 24 111058 7
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
0 1 2 3 4 5 6 7
The suffix array of R : 0 1 6 4 2 5 3 7
after relabelling,
we have the suffix array of just the suffixes from B1 ∪ B2
1 24 111058 7
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
0 1 2 3 4 5 6 7
The suffix array of R : 0 1 6 4 2 5 3 7
after relabelling,
we have the suffix array of just the suffixes from B1 ∪ B2
1 24 111058 7
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
0 1 2 3 4 5 6 7
The suffix array of R : 0 1 6 4 2 5 3 7
after relabelling,
we have the suffix array of just the suffixes from B1 ∪ B2
1 24 111058 7
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
Suffix array for just B1 ∪ B2 1 24 111058 7
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
y a b d ob a a b b a d
d ob a a b b a d
oa b b a d
oa d
0
3
6
9
1 24 111058 7
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
y a b d ob a a b b a d
d ob a a b b a d
oa b b a d
oa d
0
3
6
9
= + 1y
1 24 111058 7
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
y a b d ob a a b b a d
d ob a a b b a d
oa b b a d
oa d
0
3
6
9
= + 1y
suffix 0 is a y
1 24 111058 7
followed by suffix 1
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
y a b d ob a a b b a d
d ob a a b b a d
oa b b a d
oa d
0
3
6
9
= + 1y
1 24 111058 7
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
y a b d ob a a b b a d
d ob a a b b a d
oa b b a d
oa d
0
3
6
9
= + 1y
= + 4b
= + 7a
= +a 10
1 24 111058 7
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
y a b d ob a a b b a d
d ob a a b b a d
oa b b a d
oa d
0
3
6
9
= + 1y
= + 4b
= + 7a
= +a 10
Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1)
1 24 111058 7
(the ranks are given by the array below)
rank:
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
y a b d ob a a b b a d
d ob a a b b a d
oa b b a d
oa d
0
3
6
9
= + 1y
= + 4b
= + 7a
= +a 10
(y, 0)
(b, 1)
(a, 4)
(a, 6)
Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1)
1 24 111058 7
(the ranks are given by the array below)
rank:
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
y a b d ob a a b b a d
d ob a a b b a d
oa b b a d
oa d
0
3
6
9
= + 1y
= + 4b
= + 7a
= +a 10
(y, 0)
(b, 1)
(a, 4)
(a, 6)
Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1)
1 24 111058 7
(the ranks are given by the array below)
rank:
We then sort in O(n) time using radix sort
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1)
1 24 111058 7
(the ranks are given by the array below)
rank:
We then sort in O(n) time using radix sort
y a b d ob a a b b a d
d ob a a b b a d
oa b b a d
oa d
0
3
6
9
= + 1y
= + 4b
= + 7a
= +a 10
(y, 0)
(b, 1)
(a, 4)
(a, 6)
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1)
1 24 111058 7
(the ranks are given by the array below)
rank:
We then sort in O(n) time using radix sort
y a b d ob a a b b a d
d ob a a b b a d
oa b b a d
oa d
0
3
6
9
= + 1y
= + 4b
= + 7a
= +a 10
(y, 0)
(b, 1)
(a, 4)
(a, 6)
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1)
1 24 111058 7
(the ranks are given by the array below)
rank:
We then sort in O(n) time using radix sort
y a b d ob a a b b a d
d ob a a b b a d
oa b b a d
oa d
0
3
6
9
= + 1y
= + 4b
= + 7a
= +a 10
(y, 0)
(b, 1)
(a, 4)
(a, 6)
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1)
1 24 111058 7
(the ranks are given by the array below)
rank:
We then sort in O(n) time using radix sort
y a b d ob a a b b a d
d ob a a b b a d
oa b b a d
oa d
0
3
6
9
= + 1y
= + 4b
= + 7a
= +a 10
(y, 0)
(b, 1)
(a, 4)
(a, 6)
Suffix array for just B0 6 09 3
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1)
1 24 111058 7
(the ranks are given by the array below)
rank:
We then sort in O(n) time using radix sort
y a b d ob a a b b a d
d ob a a b b a d
oa b b a d
oa d
0
3
6
9
= + 1y
= + 4b
= + 7a
= +a 10
(y, 0)
(b, 1)
(a, 4)
(a, 6)
Suffix array for just B0 6 09 3 How do we merge these?
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . .
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . .
which is smaller, suffix or ?1 6
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . .
which is smaller, suffix or ?
+ 7a=
1 6
6
1 +a= 2
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . .
which is smaller, suffix or ?
+ 7a=
1 6
6
1 +a= 2
(a, 4)
(a, 3)
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . .
which is smaller, suffix or ?
+ 7a=
1 6
6
1 +a= 2
(a, 4)
(a, 3)
It takes O(1) time to decide
that 1 is smaller
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . .
which is smaller, suffix or ?
+ 7a=
1 6
6
1 +a= 2
(a, 4)
(a, 3)
It takes O(1) time to decide
that 1 is smaller
1
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . . 1
which is smaller, suffix or4 6 ?
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . . 1
which is smaller, suffix or4 6
+ 7a=6
4 +a= 5
?
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . . 1
which is smaller, suffix or4 6
+ 7a=6
4 +a= 5
(a, 4)
(a, 5)
?
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . . 1
which is smaller, suffix or4 6
+ 7a=6
4 +a= 5
(a, 4)
(a, 5)
Again, it takes O(1) time to decide
that is smaller6
?
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . . 1 6
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . . 1 6
which is smaller, suffix or4 9 ?
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . . 1 6
)4(which is smaller, suffix or4 9 ? is smaller
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . . 1 6 4
)4(which is smaller, suffix or4 9 ? is smaller
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . . 1 6 4
8which is smaller, suffix or 9 ?
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . . 1 6 4
8
+a=
+b=
9 10
8 9
which is smaller, suffix or 9 ?
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . . 1 6 4
8
+a=
+b=
9 10
8 9
Uh oh! how do we compare 9 10to ?
which is smaller, suffix or 9 ?
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . . 1 6 4
8
Uh oh! how do we compare 9 10to ?
+a=
+b=
9
8
d
a
+
+
11
10
It still takes O(1) time to decide
that is smaller9
which is smaller, suffix or 9 ?
The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . . 1 6 4
Overall this merging phase takes O(n) time
(because processing each suffix takes O(1) time)
The DC3 method
Theorem
The DC3 algorithm constructs a suffix array in O(n) time.
The DC3 method
Theorem
The DC3 algorithm constructs a suffix array in O(n) time.
Proof
Suppose T(n) is the running time. We have
T(n) = T(2n/3) + O(n)
The DC3 method
Theorem
The DC3 algorithm constructs a suffix array in O(n) time.
Proof
Suppose T(n) is the running time. We have
T(n) = T(2n/3) + O(n)
radix sorting and merging
recursion to construct
a suffix array of size 2n/3
The DC3 method
Theorem
The DC3 algorithm constructs a suffix array in O(n) time.
Proof
Suppose T(n) is the running time. We have
T(n) = T(2n/3) + O(n)
Solving this recurrence gives T(n) ∈ O(n).
radix sorting and merging
recursion to construct
a suffix array of size 2n/3
The suffix array
T b n aaT a sn
n
Suffix Array 1 0 625 4
Sort the suffixes lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
3
n
P aa n
m
an
Finding an occurrence of a pattern (length m) takes O(m log n) time
Finding all occurrences takes O(m log n + occ) time
where occ is the number of occurences
We can construct the suffix array in O(n) time
This can be further improved to O(m + log n + occ) time
(using LCP queries which we will see in a future lecture)

Weitere ähnliche Inhalte

Was ist angesagt?

the fourier series
the fourier seriesthe fourier series
the fourier series
safi al amu
 
Application of tries
Application of triesApplication of tries
Application of tries
Tech_MX
 
Cut and Count
Cut and CountCut and Count
Cut and Count
ASPAK2014
 
15 practical grep command examples in linux
15 practical grep command examples in linux15 practical grep command examples in linux
15 practical grep command examples in linux
Teja Bheemanapally
 

Was ist angesagt? (20)

International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
POST’s CORRESPONDENCE PROBLEM
POST’s CORRESPONDENCE PROBLEMPOST’s CORRESPONDENCE PROBLEM
POST’s CORRESPONDENCE PROBLEM
 
Application of fourier series to differential equations
Application of fourier series to differential equationsApplication of fourier series to differential equations
Application of fourier series to differential equations
 
Lex analysis
Lex analysisLex analysis
Lex analysis
 
Boyer more algorithm
Boyer more algorithmBoyer more algorithm
Boyer more algorithm
 
String matching algorithms
String matching algorithmsString matching algorithms
String matching algorithms
 
Parseval's Theorem
Parseval's TheoremParseval's Theorem
Parseval's Theorem
 
the fourier series
the fourier seriesthe fourier series
the fourier series
 
Theory of computation / Post’s Correspondence Problems (PCP)
Theory of computation / Post’s Correspondence Problems (PCP)Theory of computation / Post’s Correspondence Problems (PCP)
Theory of computation / Post’s Correspondence Problems (PCP)
 
Fuzzy α^m-Separation Axioms
Fuzzy α^m-Separation AxiomsFuzzy α^m-Separation Axioms
Fuzzy α^m-Separation Axioms
 
String matching algorithms
String matching algorithmsString matching algorithms
String matching algorithms
 
Application of tries
Application of triesApplication of tries
Application of tries
 
Object-Oriented Programming 5
Object-Oriented Programming 5Object-Oriented Programming 5
Object-Oriented Programming 5
 
Cut and Count
Cut and CountCut and Count
Cut and Count
 
Basics of Mathematical Cryptography
Basics of Mathematical CryptographyBasics of Mathematical Cryptography
Basics of Mathematical Cryptography
 
25 String Matching
25 String Matching25 String Matching
25 String Matching
 
15 practical grep command examples in linux
15 practical grep command examples in linux15 practical grep command examples in linux
15 practical grep command examples in linux
 
Fourier Series of Music by Robert Fustero
Fourier Series of Music by Robert FusteroFourier Series of Music by Robert Fustero
Fourier Series of Music by Robert Fustero
 
String matching Algorithm by Foysal
String matching Algorithm by FoysalString matching Algorithm by Foysal
String matching Algorithm by Foysal
 
Seq db searching
Seq db searchingSeq db searching
Seq db searching
 

Mehr von Benjamin Sach

Mehr von Benjamin Sach (20)

Approximation Algorithms Part Four: APTAS
Approximation Algorithms Part Four: APTASApproximation Algorithms Part Four: APTAS
Approximation Algorithms Part Four: APTAS
 
Approximation Algorithms Part Three: (F)PTAS
Approximation Algorithms Part Three: (F)PTASApproximation Algorithms Part Three: (F)PTAS
Approximation Algorithms Part Three: (F)PTAS
 
Approximation Algorithms Part Two: More Constant factor approximations
Approximation Algorithms Part Two: More Constant factor approximationsApproximation Algorithms Part Two: More Constant factor approximations
Approximation Algorithms Part Two: More Constant factor approximations
 
Approximation Algorithms Part One: Constant factor approximations
Approximation Algorithms Part One: Constant factor approximationsApproximation Algorithms Part One: Constant factor approximations
Approximation Algorithms Part One: Constant factor approximations
 
van Emde Boas trees
van Emde Boas treesvan Emde Boas trees
van Emde Boas trees
 
Orthogonal Range Searching
Orthogonal Range SearchingOrthogonal Range Searching
Orthogonal Range Searching
 
Pattern Matching Part Two: k-mismatches
Pattern Matching Part Two: k-mismatchesPattern Matching Part Two: k-mismatches
Pattern Matching Part Two: k-mismatches
 
Pattern Matching Part Three: Hamming Distance
Pattern Matching Part Three: Hamming DistancePattern Matching Part Three: Hamming Distance
Pattern Matching Part Three: Hamming Distance
 
Lowest Common Ancestor
Lowest Common AncestorLowest Common Ancestor
Lowest Common Ancestor
 
Range Minimum Queries
Range Minimum QueriesRange Minimum Queries
Range Minimum Queries
 
Hashing Part Two: Cuckoo Hashing
Hashing Part Two: Cuckoo HashingHashing Part Two: Cuckoo Hashing
Hashing Part Two: Cuckoo Hashing
 
Hashing Part Two: Static Perfect Hashing
Hashing Part Two: Static Perfect HashingHashing Part Two: Static Perfect Hashing
Hashing Part Two: Static Perfect Hashing
 
Hashing Part One
Hashing Part OneHashing Part One
Hashing Part One
 
Probability Recap
Probability RecapProbability Recap
Probability Recap
 
Bloom Filters
Bloom FiltersBloom Filters
Bloom Filters
 
Dynamic Programming
Dynamic ProgrammingDynamic Programming
Dynamic Programming
 
Minimum Spanning Trees (via Disjoint Sets)
Minimum Spanning Trees (via Disjoint Sets)Minimum Spanning Trees (via Disjoint Sets)
Minimum Spanning Trees (via Disjoint Sets)
 
Shortest Paths Part 1: Priority Queues and Dijkstra's Algorithm
Shortest Paths Part 1: Priority Queues and Dijkstra's AlgorithmShortest Paths Part 1: Priority Queues and Dijkstra's Algorithm
Shortest Paths Part 1: Priority Queues and Dijkstra's Algorithm
 
Depth First Search and Breadth First Search
Depth First Search and Breadth First SearchDepth First Search and Breadth First Search
Depth First Search and Breadth First Search
 
Shortest Paths Part 2: Negative Weights and All-pairs
Shortest Paths Part 2: Negative Weights and All-pairsShortest Paths Part 2: Negative Weights and All-pairs
Shortest Paths Part 2: Negative Weights and All-pairs
 

Kürzlich hochgeladen

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Kürzlich hochgeladen (20)

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 

Pattern Matching Part Two: Suffix Arrays

  • 1. Advanced Algorithms – COMS31900 Pattern Matching part two Suffix Arrays Benjamin Sach
  • 2. Text indexing T Preprocess a text string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n
  • 3. Text indexing T Preprocess a text string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n After preprocessing, a query is a pattern P (length m), P a b a m
  • 4. Text indexing T Preprocess a text string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n After preprocessing, a query is a pattern P (length m), P a b a m the output is a list of all matches in T.
  • 5. Text indexing T Preprocess a text string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n After preprocessing, a query is a pattern P (length m), P a b a m the output is a list of all matches in T. e.g. 4, 6, 10 4 6 10
  • 6. Text indexing T Preprocess a text string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n After preprocessing, a query is a pattern P (length m), P a b a m the output is a list of all matches in T. • Last lecture we saw the that text indexing problem can be solved using a suffix tree e.g. 4, 6, 10 4 6 10 which uses O(n) space (when it’s stored compacted)
  • 7. Text indexing T Preprocess a text string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n After preprocessing, a query is a pattern P (length m), P a b a m the output is a list of all matches in T. • Last lecture we saw the that text indexing problem can be solved using a suffix tree • Queries take O(m + occ) time when the alphabet size is constant - occ is the number of occurences (matches) e.g. 4, 6, 10 4 6 10 which uses O(n) space (when it’s stored compacted)
  • 8. Text indexing T Preprocess a text string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n After preprocessing, a query is a pattern P (length m), P a b a m the output is a list of all matches in T. • Last lecture we saw the that text indexing problem can be solved using a suffix tree • Queries take O(m + occ) time when the alphabet size is constant - occ is the number of occurences (matches) • Suffix trees can be constructed in O(n) time (but we only saw how to achieve O(n2) time) e.g. 4, 6, 10 4 6 10 which uses O(n) space (when it’s stored compacted)
  • 9. The suffix array T b n aaT a sn n
  • 10. The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn
  • 11. The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa snsuffix 1
  • 12. The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn
  • 13. The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn
  • 14. The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order
  • 15. The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs:
  • 16. The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a
  • 17. The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a
  • 18. The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a
  • 19. The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c<
  • 20. The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c<
  • 21. The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c<
  • 22. The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a
  • 23. The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a
  • 24. The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller)
  • 25. The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller)
  • 26. The suffix array T b n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller)
  • 27. The suffix array T b n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller)
  • 28. The suffix array T b n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller)
  • 29. The suffix array T b n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller)
  • 30. The suffix array T b n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller) just a fancy name for the order the strings would appear in the dictionary In lexicographical ordering we sort strings based on the first symbol that differs:
  • 31. The suffix array T b n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order If the symbols don’t have a natural order, we use their binary representation in memory b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller) just a fancy name for the order the strings would appear in the dictionary In lexicographical ordering we sort strings based on the first symbol that differs:
  • 32. The suffix array T b n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn
  • 33. The suffix array T b n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 10 2 3 4 5 6
  • 34. The suffix array T b n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n 10 2 3 4 5 6
  • 35. The suffix array T b n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n 10 2 3 4 5 6
  • 36. The suffix array T b n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n 10 2 3 4 5 6
  • 37. The suffix array T b n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n The suffix array is much smaller than the suffix tree (in terms of constants) 10 2 3 4 5 6
  • 38. The suffix array T b n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n The suffix array is much smaller than the suffix tree (in terms of constants) 10 2 3 4 5 6
  • 39. From Suffix Trees to Suffix Arrays a s$ nas$ nas$ nas$ s$ na s$ bananas$ 1 3 5 0 2 4 6 T b n aaT a sn n Suffix Array Suffix Tree 1 0 625 43 n Recall that we can get get the Suffix Array from the Suffix Tree 1 3 5 0 2 4 6 10 2 3 4 5 6 using depth-first search in O(n) time
  • 40. Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 157 100 1 43 112 1396 1485 12 c db cb a c bc c d adb
  • 41. Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find c db cb a c bc c d adb m
  • 42. Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m Find an occurence of P using binary search
  • 43. Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m occurences could start anywhere Find an occurence of P using binary search
  • 44. Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m occurences could start anywhere Find an occurence of P using binary search
  • 45. Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m occurences could start anywhere Find an occurence of P using binary search
  • 46. Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c d adb> c db cb a c bc c d adb m c db cb occurences could start anywhere Find an occurence of P using binary search
  • 47. Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c d adb> c db cb a c bc c d adb m c db cb occurences could start anywhere Find an occurence of P using binary search
  • 48. Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c d adb> c db cb a c bc c d adb m c db cb occurences could start anywhere Find an occurence of P using binary search
  • 49. Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c d adb> c db cb a c bc c d adb m c db cb occurences must start in here Find an occurence of P using binary search
  • 50. Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m occurences must start in here Find an occurence of P using binary search
  • 51. Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m occurences must start in here Find an occurence of P using binary search
  • 52. Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m occurences must start in here Find an occurence of P using binary search
  • 53. Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb < c bc db c db cb a c bc c d adb m occurences must start in here Find an occurence of P using binary search
  • 54. Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb < c bc db c db cb a c bc c d adb m occurences must start in here Find an occurence of P using binary search
  • 55. Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb < c bc db c db cb a c bc c d adb m occurences must start in here Find an occurence of P using binary search
  • 56. Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m occurences must start in here Find an occurence of P using binary search
  • 57. Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m occurences must start in here Find an occurence of P using binary search
  • 58. Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: = c db cb a c bc c d adb c db cb m c db cb occurences must start in here Find an occurence of P using binary search
  • 59. Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: = c db cb a c bc c d adb c db cb m c db cb occurences must start in here we found a match! Find an occurence of P using binary search
  • 60. Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m Find an occurence of P using binary search
  • 61. Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb How long does this take? m Find an occurence of P using binary search
  • 62. Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb How long does this take? O(m) time to compare two strings m Find an occurence of P using binary search
  • 63. Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb How long does this take? O(m) time to compare two strings so O(m log n) time in total m Find an occurence of P using binary search
  • 64. Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb How long does this take? O(m) time to compare two strings so O(m log n) time in total This method generalises to O(m log n + occ) time m to find all occ occurences. Find an occurence of P using binary search by continuing the binary search (we will skip the details)
  • 65. The suffix array T b n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n P aa n m an Finding an occurrence of a pattern (length m) takes O(m log n) time Finding all occurrences takes O(m log n + occ) time where occ is the number of occurences
  • 66. The suffix array T b n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n P aa n m an Finding an occurrence of a pattern (length m) takes O(m log n) time Finding all occurrences takes O(m log n + occ) time where occ is the number of occurences This can be further improved to O(m + log n + occ) time (using LCP queries which we will see in a future lecture)
  • 67. The suffix array T b n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n P aa n m an Finding an occurrence of a pattern (length m) takes O(m log n) time Finding all occurrences takes O(m log n + occ) time where occ is the number of occurences Do we really need to build the suffix tree to construct the suffix array? This can be further improved to O(m + log n + occ) time (using LCP queries which we will see in a future lecture)
  • 68. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T =
  • 69. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1
  • 70. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1
  • 71. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2
  • 72. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2
  • 73. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2
  • 74. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 R1 is split into blocks of length 3
  • 75. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2
  • 76. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2
  • 77. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 R2 is also split into blocks of length 3
  • 78. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2
  • 79. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2
  • 80. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol)
  • 81. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1
  • 82. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2
  • 83. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 3
  • 84. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34
  • 85. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 5
  • 86. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56
  • 87. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7
  • 88. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 we assume that the bit representation of each symbol uses O(log n) bits. (which is a common and realistic assumption) This can be done by sorting the blocks in O(n) time using radix sort
  • 89. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7
  • 90. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7
  • 91. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7
  • 92. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7
  • 93. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7 compute the suffix array of R : 1 2 3 4 5 6 70
  • 94. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7 compute the suffix array of R : 1 2 3 4 5 6 70 0 1 6 4 2 5 3 7
  • 95. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7 compute the suffix array of R : 1 2 3 4 5 6 70 0 1 6 4 2 5 3 7 How do we compute the suffix array for R ? Recursion! (Notice that R has length 2n/3)
  • 96. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7 compute the suffix array of R : 1 2 3 4 5 6 70 0 1 6 4 2 5 3 7
  • 97. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7 compute the suffix array of R : what use1 2 3 4 5 6 70 0 1 6 4 2 5 3 7 is this?!
  • 98. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7 compute the suffix array of R : what use1 2 3 4 5 6 70 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 is this?!
  • 99. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 is this?!
  • 100. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 is this?!
  • 101. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 a b d ob a a b b a d d oa b b a d is this?! 1 5
  • 102. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R a b d ob a a b b a d d oa b b a d is this?! 1 5
  • 103. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R a b d ob a a b b a d d oa b b a d is this?! 1 5
  • 104. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R a b d ob a a b b a d d oa b b a d a b d ob a a b b a d $ b d ob a a b b a d $ $ is this?! 1 5
  • 105. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R a b d ob a a b b a d d oa b b a d a b d ob a a b b a d $ b d ob a a b b a d $ $ is this?! 1 5
  • 106. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R a b d ob a a b b a d d oa b b a d a b d ob a a b b a d $ b d ob a a b b a d $ $ d oa b b a d $ $ is this?! 1 5
  • 107. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R a b d ob a a b b a d d oa b b a d a b d ob a a b b a d $ b d ob a a b b a d $ $ d oa b b a d $ $ their order is given by the suffix array of R : is this?! 1 5
  • 108. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R a b d ob a a b b a d d oa b b a d a b d ob a a b b a d $ b d ob a a b b a d $ $ d oa b b a d $ $ their order is given by the suffix array of R : is this?! 1 5
  • 109. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R a b d ob a a b b a d d oa b b a d a b d ob a a b b a d $ b d ob a a b b a d $ $ d oa b b a d $ $ their order is given by the suffix array of R : is this?! 1 5 Suffix is smaller than suffix 1 55
  • 110. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?!
  • 111. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 ob b a d7
  • 112. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 ob b a d7 ob b a d $ b d ob a a b b a d $ $
  • 113. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 ob b a d7 d oa b b a d $ $ ob b a d $ b d ob a a b b a d $ $
  • 114. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 ob b a d7 d oa b b a d $ $ ob b a d $ b d ob a a b b a d $ $
  • 115. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 ob b a d7 d oa b b a d $ $ ob b a d $ b d ob a a b b a d $ $
  • 116. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 ob b a d7 d oa b b a d $ $ ob b a d $ b d ob a a b b a d $ $ Suffix is smaller than suffix 7 5
  • 117. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?!
  • 118. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 b d ob a a b b a d2
  • 119. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 b d ob a a b b a d2 d oa b b a d $ $ b d ob a a b b a d $ $
  • 120. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 b d ob a a b b a d2 d oa b b a d $ $ b d ob a a b b a d $ $
  • 121. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 b d ob a a b b a d2 d oa b b a d $ $ b d ob a a b b a d $ $
  • 122. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 b d ob a a b b a d2 d oa b b a d $ $ b d ob a a b b a d $ $ Suffix is smaller than suffix 2 5
  • 123. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 is this?!
  • 124. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 0 1 2 3 4 5 6 7 The suffix array of R : 0 1 6 4 2 5 3 7
  • 125. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 0 1 2 3 4 5 6 7 The suffix array of R : 0 1 6 4 2 5 3 7
  • 126. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 0 1 2 3 4 5 6 7 The suffix array of R : 0 1 6 4 2 5 3 7 after relabelling,
  • 127. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 0 1 2 3 4 5 6 7 The suffix array of R : 0 1 6 4 2 5 3 7 after relabelling, 1 24 111058 7
  • 128. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 0 1 2 3 4 5 6 7 The suffix array of R : 0 1 6 4 2 5 3 7 after relabelling, we have the suffix array of just the suffixes from B1 ∪ B2 1 24 111058 7
  • 129. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 0 1 2 3 4 5 6 7 The suffix array of R : 0 1 6 4 2 5 3 7 after relabelling, we have the suffix array of just the suffixes from B1 ∪ B2 1 24 111058 7
  • 130. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 0 1 2 3 4 5 6 7 The suffix array of R : 0 1 6 4 2 5 3 7 after relabelling, we have the suffix array of just the suffixes from B1 ∪ B2 1 24 111058 7 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
  • 131. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
  • 132. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 24 111058 7
  • 133. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7
  • 134. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 1 24 111058 7
  • 135. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y 1 24 111058 7
  • 136. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y suffix 0 is a y 1 24 111058 7 followed by suffix 1
  • 137. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y 1 24 111058 7
  • 138. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 1 24 111058 7
  • 139. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1) 1 24 111058 7 (the ranks are given by the array below) rank:
  • 140. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 (y, 0) (b, 1) (a, 4) (a, 6) Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1) 1 24 111058 7 (the ranks are given by the array below) rank:
  • 141. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 (y, 0) (b, 1) (a, 4) (a, 6) Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1) 1 24 111058 7 (the ranks are given by the array below) rank: We then sort in O(n) time using radix sort
  • 142. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1) 1 24 111058 7 (the ranks are given by the array below) rank: We then sort in O(n) time using radix sort y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 (y, 0) (b, 1) (a, 4) (a, 6)
  • 143. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1) 1 24 111058 7 (the ranks are given by the array below) rank: We then sort in O(n) time using radix sort y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 (y, 0) (b, 1) (a, 4) (a, 6)
  • 144. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1) 1 24 111058 7 (the ranks are given by the array below) rank: We then sort in O(n) time using radix sort y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 (y, 0) (b, 1) (a, 4) (a, 6)
  • 145. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1) 1 24 111058 7 (the ranks are given by the array below) rank: We then sort in O(n) time using radix sort y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 (y, 0) (b, 1) (a, 4) (a, 6) Suffix array for just B0 6 09 3
  • 146. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1) 1 24 111058 7 (the ranks are given by the array below) rank: We then sort in O(n) time using radix sort y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 (y, 0) (b, 1) (a, 4) (a, 6) Suffix array for just B0 6 09 3 How do we merge these?
  • 147. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these?
  • 148. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . .
  • 149. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . which is smaller, suffix or ?1 6
  • 150. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . which is smaller, suffix or ? + 7a= 1 6 6 1 +a= 2
  • 151. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . which is smaller, suffix or ? + 7a= 1 6 6 1 +a= 2 (a, 4) (a, 3)
  • 152. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . which is smaller, suffix or ? + 7a= 1 6 6 1 +a= 2 (a, 4) (a, 3) It takes O(1) time to decide that 1 is smaller
  • 153. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . which is smaller, suffix or ? + 7a= 1 6 6 1 +a= 2 (a, 4) (a, 3) It takes O(1) time to decide that 1 is smaller 1
  • 154. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 which is smaller, suffix or4 6 ?
  • 155. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 which is smaller, suffix or4 6 + 7a=6 4 +a= 5 ?
  • 156. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 which is smaller, suffix or4 6 + 7a=6 4 +a= 5 (a, 4) (a, 5) ?
  • 157. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 which is smaller, suffix or4 6 + 7a=6 4 +a= 5 (a, 4) (a, 5) Again, it takes O(1) time to decide that is smaller6 ?
  • 158. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6
  • 159. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6 which is smaller, suffix or4 9 ?
  • 160. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6 )4(which is smaller, suffix or4 9 ? is smaller
  • 161. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6 4 )4(which is smaller, suffix or4 9 ? is smaller
  • 162. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6 4 8which is smaller, suffix or 9 ?
  • 163. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6 4 8 +a= +b= 9 10 8 9 which is smaller, suffix or 9 ?
  • 164. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6 4 8 +a= +b= 9 10 8 9 Uh oh! how do we compare 9 10to ? which is smaller, suffix or 9 ?
  • 165. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6 4 8 Uh oh! how do we compare 9 10to ? +a= +b= 9 8 d a + + 11 10 It still takes O(1) time to decide that is smaller9 which is smaller, suffix or 9 ?
  • 166. The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6 4 Overall this merging phase takes O(n) time (because processing each suffix takes O(1) time)
  • 167. The DC3 method Theorem The DC3 algorithm constructs a suffix array in O(n) time.
  • 168. The DC3 method Theorem The DC3 algorithm constructs a suffix array in O(n) time. Proof Suppose T(n) is the running time. We have T(n) = T(2n/3) + O(n)
  • 169. The DC3 method Theorem The DC3 algorithm constructs a suffix array in O(n) time. Proof Suppose T(n) is the running time. We have T(n) = T(2n/3) + O(n) radix sorting and merging recursion to construct a suffix array of size 2n/3
  • 170. The DC3 method Theorem The DC3 algorithm constructs a suffix array in O(n) time. Proof Suppose T(n) is the running time. We have T(n) = T(2n/3) + O(n) Solving this recurrence gives T(n) ∈ O(n). radix sorting and merging recursion to construct a suffix array of size 2n/3
  • 171. The suffix array T b n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n P aa n m an Finding an occurrence of a pattern (length m) takes O(m log n) time Finding all occurrences takes O(m log n + occ) time where occ is the number of occurences We can construct the suffix array in O(n) time This can be further improved to O(m + log n + occ) time (using LCP queries which we will see in a future lecture)